Statistics
In missing data analysis, an important step is calculating simple descriptive and aggregate statistics for both missing and non-missing data in each column, as well as for the entire dataset. Scikit-na provides useful functions to facilitate these operations.
Summary
We will use Titanic dataset, which contains missing values (NA) in three columns: Age, Cabin, and Embarked.
Per column
To generate a simple summary for each column, we will load the dataset using pandas
and pass it to the summary() function. This function supports subsetting the dataset
using the columns argument, which we will use to reduce the width of the results table.
import scikit_na as na
import pandas as pd
data = pd.read_csv('titanic_dataset.csv')
# Excluding three columns without NA to fit the table here
na.summary(data, columns=data.columns.difference(['SibSp', 'Parch', 'Ticket']))
Age |
Cabin |
Embarked |
Fare |
Name |
PassengerId |
Pclass |
Sex |
Survived |
|
|---|---|---|---|---|---|---|---|---|---|
na_count |
177 |
687 |
2 |
0 |
0 |
0 |
0 |
0 |
0 |
na_pct_per_col |
19.87 |
77.1 |
0.22 |
0 |
0 |
0 |
0 |
0 |
0 |
na_pct_total |
20.44 |
79.33 |
0.23 |
0 |
0 |
0 |
0 |
0 |
0 |
na_unique_per_col |
19 |
529 |
2 |
0 |
0 |
0 |
0 |
0 |
0 |
na_unique_pct_per_col |
10.73 |
77 |
100 |
0 |
0 |
0 |
0 |
0 |
0 |
rows_after_dropna |
714 |
204 |
889 |
891 |
891 |
891 |
891 |
891 |
891 |
rows_dropna_pct |
80.13 |
22.9 |
99.78 |
100 |
100 |
100 |
100 |
100 |
100 |
Those measures were meant to be self-explanatory:
na_count is the number of NA values in each column.
na_unique_per_col is the number of missing values in each column that are unique to it, meaning they do not overlap with NA values in other columns (or the number of values that would remain in the dataset if we drop rows with NA values from the other columns).
rows_after_dropna is the number of rows that would remain in the dataset if we applied
pandas.Series.dropna()method to each column.
Whole dataset
By default, the summary() function returns the results for each column. To get
the summary of missing data for the entire dataset, we should set the per_column
argument to False.
na.summary(data, per_column=False)
dataset |
|
|---|---|
total_columns |
12 |
total_rows |
891 |
na_rows |
708 |
non_na_rows |
183 |
total_cells |
10692 |
na_cells |
866 |
na_cells_pct |
8.1 |
non_na_cells |
9826 |
non_na_cells_pct |
91.9 |
Descriptive statistics
The next step is to calculate descriptive statistics for columns with quantitative and qualitative data. First, let’s filter the columns by data types:
# Presumably, qualitative data, needs checking
cols_nominal = data.columns[data.dtypes == object]
# Quantitative data
cols_numeric = data.columns[(data.dtypes == float) | (data.dtypes == int)]
We should also specify a column with missing values (NAs) to be used for splitting the data in the selected columns into two groups: NA (missing) and Filled (non-missing).
Qualitative data
na.describe(data, columns=cols_nominal)
Embarked |
Name |
Sex |
Ticket |
|||||
|---|---|---|---|---|---|---|---|---|
Cabin |
Filled |
NA |
Filled |
NA |
Filled |
NA |
Filled |
NA |
count |
202 |
687 |
204 |
687 |
204 |
687 |
204 |
687 |
unique |
3 |
3 |
204 |
687 |
2 |
2 |
142 |
549 |
top |
S |
S |
Levy, Mr. Rene Jacques |
Nasser, Mr. Nicholas |
male |
male |
113760 |
347082 |
freq |
129 |
515 |
1 |
1 |
107 |
470 |
4 |
7 |
Let’s check the results by hand:
data.groupby(
data['Cabin'].isna().replace({False: 'Filled', True: 'NA'}))['Sex']\
.value_counts()
Cabin |
Sex |
Count |
|---|---|---|
Filled |
male |
107 |
female |
97 |
|
NA |
male |
470 |
female |
217 |
Here we take Cabin column, encode missing/non-missing data as Filled/NA, and then use it to group and count values in Sex column: among the passengers with missing cabin data, 470 were males, while 217 were females.
Quantitative data
Now, let’s look at the statistics calculated for the numeric data:
# Selecting just two columns
na.describe(data, columns=['Age', 'Fare'], col_na='Cabin')
Age |
Fare |
|||
|---|---|---|---|---|
Cabin |
Filled |
NA |
Filled |
NA |
count |
185 |
529 |
204 |
687 |
mean |
35.8293 |
27.5553 |
76.1415 |
19.1573 |
std |
15.6794 |
13.4726 |
74.3917 |
28.6633 |
min |
0.92 |
0.42 |
0 |
0 |
25% |
24 |
19 |
29.4531 |
7.8771 |
50% |
36 |
26 |
55.2208 |
10.5 |
75% |
48 |
35 |
89.3282 |
23 |
max |
80 |
74 |
512.329 |
512.329 |
The mean age of passengers with missing cabin data was 27.6 years.