Statistics

In missing data analysis, an important step is to calculate simple descriptive and aggregating statistics of missing and non-missing data for each column and for the whole dataset. Scikit-na attempts to provide useful functions for such operations.

Summary

We will use Titanic dataset that contains NA values in three columns: Age, Cabin, and Embarked.

Per column

To get a simple summary per each column, we will load a dataset using pandas and pass it to summary() function. The latter supports subsetting a dataset with columns argument. And we will make use of it to cut the width of the results table.

import scikit_na as na
import pandas as pd

data = pd.read_csv('titanic_dataset.csv')

# Excluding three columns without NA to fit the table here
na.summary(data, columns=data.columns.difference(['SibSp', 'Parch', 'Ticket']))

Age

Cabin

Embarked

Fare

Name

PassengerId

Pclass

Sex

Survived

NA count

177

687

2

0

0

0

0

0

0

NA, % (per column)

19.87

77.1

0.22

0

0

0

0

0

0

NA, % (of all NAs)

20.44

79.33

0.23

0

0

0

0

0

0

NA unique (per column)

19

529

2

0

0

0

0

0

0

NA unique, % (per column)

10.73

77

100

0

0

0

0

0

0

Rows left after dropna()

714

204

889

891

891

891

891

891

891

Rows left after dropna(), %

80.13

22.9

99.78

100

100

100

100

100

100

Those measures were meant to be self-explanatory:

  • NA count is the number of NA values in each column.

  • NA unique is the number of NA values in each column that are unique for it, i.e. do not intersect with NA values in the other columns (or that will remain in dataset if we drop NA values in the other columns).

  • Rows left after dropna() shows how many rows will be left in a dataset if we apply pandas.Series.dropna() method to each column.

Whole dataset

By default, summary() function returns the results for each column. To get the summary of missing data for the whole DataFrame, we should set per_column argument to False.

na.summary(data, per_column=False)

dataset

Total columns

12

Total rows

891

Rows with NA

708

Rows without NA

183

Total cells

10692

Cells with NA

866

Cells with NA, %

8.1

Cells with non-missing data

9826

Cells with non-missing data, %

91.9

Descriptive statistics

The next step is to calculate descriptive statistics for columns with quantitative and qualitative data. First, let’s filter the columns by data types:

# Presumably, qualitative data, needs checking
cols_nominal = data.columns[data.dtypes == object]

# Quantitative data
cols_numeric = data.columns[(data.dtypes == float) | (data.dtypes == int)]

We should also specify a column with missing values (NAs) that will be used to split the data in the selected columns into two groups: NA (missing) and Filled (non-missing).

Qualitative data

na.describe(data, columns=cols_nominal)

Embarked

Name

Sex

Ticket

Cabin

Filled

NA

Filled

NA

Filled

NA

Filled

NA

count

202

687

204

687

204

687

204

687

unique

3

3

204

687

2

2

142

549

top

S

S

Levy, Mr. Rene Jacques

Nasser, Mr. Nicholas

male

male

113760

347082

freq

129

515

1

1

107

470

4

7

Let’s check the results by hand:

data.groupby(
  data['Cabin'].isna().replace({False: 'Filled', True: 'NA'}))['Sex']\
.value_counts()

Cabin

Sex

Count

Filled

male

107

female

97

NA

male

470

female

217

Here we take Cabin column, encode missing/non-missing data as Filled/NA, and then use it to group and count values in Sex column: among the passengers with missing cabin data, 470 were males, while 217 were females.

Quantitative data

Now, let’s look at the statistics calculated for the numeric data:

# Selecting just two columns
na.describe(data, columns=['Age', 'Fare'], col_na='Cabin')

Age

Fare

Cabin

Filled

NA

Filled

NA

count

185

529

204

687

mean

35.8293

27.5553

76.1415

19.1573

std

15.6794

13.4726

74.3917

28.6633

min

0.92

0.42

0

0

25%

24

19

29.4531

7.8771

50%

36

26

55.2208

10.5

75%

48

35

89.3282

23

max

80

74

512.329

512.329

The mean age of passengers with missing cabin data was 27.6 years.