Data visualization

 
import pandas as pd
import scikit_na as na
data = pd.read_csv('../../_tests/data/titanic_dataset.csv')

Heatmap

NA values

Missing data can be visualized on a heatmap to quickly grasp its patterns. We will be using Altair + Vega backend. To plot a heatmap of NAs, simply pass your DataFrame to scikit_na.altair.plot_heatmap() function.

Droppables are those values that will be dropped if we simply use pandas.DataFrame.dropna() on the whole dataset. By default, columns are sorted by the number of NA values.

 
na.altair.plot_heatmap(data)

Correlations

Correlations can be plotted using scikit_na.altair.plot_corr() function. Under the hood, it calls scikit_na.correlate() function with your input DataFrame as the first argument:

 
na.altair.plot_corr(data).properties(width=125, height=125)

Stairs plot

Stairs plot is a useful visualization of a dataset shrinkage on applying pandas.Series.dropna() method to each column sequentially (sorted by the number of NA values, by default):

 
na.altair.plot_stairs(data)

After dropping all NAs in Cabin column, we are left with 21 more NAs (in Age and Embarked columns). This plot also shows tooltips with exact numbers of NA values that are dropped per each column.

 
na.altair.plot_stairbars(data)

Histogram

Plotting a nice histogram may require configuring additional parameters.

 
chart = na.altair.plot_hist(data, col='Pclass', col_na='Age')\
    .properties(width=200, height=200)
chart.configure_axisX(labelAngle = 0)

Density plot

 
chart = na.altair.plot_kde(data, col='Age', col_na='Cabin')\
    .properties(width=200, height=200)
chart.configure_axisX(labelAngle = 0)