Statistical Analysis

Understanding missing data patterns through statistical analysis is crucial for making informed decisions about data handling strategies. scikit-na provides comprehensive statistical functions to analyze missing data at both column and dataset levels.

This guide demonstrates key statistical functions using the Titanic dataset, which contains missing values in three columns: Age, Cabin, and Embarked.

Getting Started

import pandas as pd
import scikit_na as na

# Load the Titanic dataset
data = pd.read_csv('titanic_dataset.csv')

# Quick overview of missing data
print(f"Dataset shape: {data.shape}")
print(f"Missing values per column:")
print(data.isnull().sum())

Summary Statistics

Column-Level Analysis

Generate detailed statistics for each column to understand individual patterns:

# Comprehensive per-column summary
summary_stats = na.summary(data, per_column=True)
print(summary_stats)

# Focus on columns with missing data only
na.summary(data, columns=['Age', 'Cabin', 'Embarked'])

Age

Cabin

Embarked

Fare

Name

PassengerId

Pclass

Sex

Survived

na_count

177

687

2

0

0

0

0

0

0

na_pct_per_col

19.87

77.1

0.22

0

0

0

0

0

0

na_pct_total

20.44

79.33

0.23

0

0

0

0

0

0

na_unique_per_col

19

529

2

0

0

0

0

0

0

na_unique_pct_per_col

10.73

77

100

0

0

0

0

0

0

rows_after_dropna

714

204

889

891

891

891

891

891

891

rows_dropna_pct

80.13

22.9

99.78

100

100

100

100

100

100

Understanding the Summary Metrics

The summary provides several key metrics for missing data analysis:

Missing Data Counts
  • na_count: Absolute number of missing values in each column

  • na_pct_per_col: Percentage of missing values within each column

  • na_pct_total: This column’s missing values as percentage of all missing values

Missing Data Patterns
  • na_unique_per_col: Missing values unique to this column (don’t overlap with other columns)

  • na_unique_pct_per_col: Percentage of this column’s missing values that are unique

Impact Analysis
  • rows_after_dropna: Rows remaining after dropping missing values from this column

  • rows_after_dropna_pct: Percentage of original rows that would remain

Dataset-Level Analysis

For an overall dataset perspective, use aggregate statistics:

# Dataset-level summary
dataset_summary = na.summary(data, per_column=False)
print(dataset_summary)
na.summary(data, per_column=False)

dataset

total_columns

12

total_rows

891

na_rows

708

non_na_rows

183

total_cells

10692

na_cells

866

na_cells_pct

8.1

non_na_cells

9826

non_na_cells_pct

91.9

Descriptive statistics

The next step is to calculate descriptive statistics for columns with quantitative and qualitative data. First, let’s filter the columns by data types:

# Presumably, qualitative data, needs checking
cols_nominal = data.columns[data.dtypes == object]

# Quantitative data
cols_numeric = data.columns[(data.dtypes == float) | (data.dtypes == int)]

We should also specify a column with missing values (NAs) to be used for splitting the data in the selected columns into two groups: NA (missing) and Filled (non-missing).

Qualitative data

na.describe(data, columns=cols_nominal)

Embarked

Name

Sex

Ticket

Cabin

Filled

NA

Filled

NA

Filled

NA

Filled

NA

count

202

687

204

687

204

687

204

687

unique

3

3

204

687

2

2

142

549

top

S

S

Levy, Mr. Rene Jacques

Nasser, Mr. Nicholas

male

male

113760

347082

freq

129

515

1

1

107

470

4

7

Let’s check the results by hand:

data.groupby(
  data['Cabin'].isna().replace({False: 'Filled', True: 'NA'}))['Sex']\
.value_counts()

Cabin

Sex

Count

Filled

male

107

female

97

NA

male

470

female

217

Here we take Cabin column, encode missing/non-missing data as Filled/NA, and then use it to group and count values in Sex column: among the passengers with missing cabin data, 470 were males, while 217 were females.

Quantitative data

Now, let’s look at the statistics calculated for the numeric data:

# Selecting just two columns
na.describe(data, columns=['Age', 'Fare'], col_na='Cabin')

Age

Fare

Cabin

Filled

NA

Filled

NA

count

185

529

204

687

mean

35.8293

27.5553

76.1415

19.1573

std

15.6794

13.4726

74.3917

28.6633

min

0.92

0.42

0

0

25%

24

19

29.4531

7.8771

50%

36

26

55.2208

10.5

75%

48

35

89.3282

23

max

80

74

512.329

512.329

The mean age of passengers with missing cabin data was 27.6 years.