Statistical Analysis

Understanding missing data patterns through statistical analysis is crucial for making informed decisions about data handling strategies. scikit-na provides comprehensive statistical functions to analyze missing data at both column and dataset levels.

This guide demonstrates key statistical functions using the Titanic dataset, which contains missing values in three columns: Age, Cabin, and Embarked.

Getting Started

import pandas as pd
import scikit_na as na

# Load the Titanic dataset
data = pd.read_csv('titanic_dataset.csv')

# Quick overview of missing data
print(f"Dataset shape: {data.shape}")
print(f"Missing values per column:")
print(data.isnull().sum())

Summary Statistics

Column-Level Analysis

Generate detailed statistics for each column to understand individual patterns:

# Comprehensive per-column summary
summary_stats = na.summary(data, per_column=True)
print(summary_stats)

# Focus on columns with missing data only
na.summary(data, columns=['Age', 'Cabin', 'Embarked'])

	Age	Cabin	Embarked	Fare	Name	PassengerId	Pclass	Sex	Survived
na_count	177	687	2	0	0	0	0	0	0
na_pct_per_col	19.87	77.1	0.22	0	0	0	0	0	0
na_pct_total	20.44	79.33	0.23	0	0	0	0	0	0
na_unique_per_col	19	529	2	0	0	0	0	0	0
na_unique_pct_per_col	10.73	77	100	0	0	0	0	0	0
rows_after_dropna	714	204	889	891	891	891	891	891	891
rows_dropna_pct	80.13	22.9	99.78	100	100	100	100	100	100

Understanding the Summary Metrics

The summary provides several key metrics for missing data analysis:

Missing Data Counts

na_count: Absolute number of missing values in each column
na_pct_per_col: Percentage of missing values within each column
na_pct_total: This column’s missing values as percentage of all missing values

Missing Data Patterns

na_unique_per_col: Missing values unique to this column (don’t overlap with other columns)
na_unique_pct_per_col: Percentage of this column’s missing values that are unique

Impact Analysis

rows_after_dropna: Rows remaining after dropping missing values from this column
rows_after_dropna_pct: Percentage of original rows that would remain

Dataset-Level Analysis

For an overall dataset perspective, use aggregate statistics:

# Dataset-level summary
dataset_summary = na.summary(data, per_column=False)
print(dataset_summary)

na.summary(data, per_column=False)

	dataset
total_columns	12
total_rows	891
na_rows	708
non_na_rows	183
total_cells	10692
na_cells	866
na_cells_pct	8.1
non_na_cells	9826
non_na_cells_pct	91.9

Descriptive statistics

The next step is to calculate descriptive statistics for columns with quantitative and qualitative data. First, let’s filter the columns by data types:

# Presumably, qualitative data, needs checking
cols_nominal = data.columns[data.dtypes == object]

# Quantitative data
cols_numeric = data.columns[(data.dtypes == float) | (data.dtypes == int)]

We should also specify a column with missing values (NAs) to be used for splitting the data in the selected columns into two groups: NA (missing) and Filled (non-missing).

Qualitative data

na.describe(data, columns=cols_nominal)

	Embarked		Name		Sex		Ticket
Cabin	Filled	NA	Filled	NA	Filled	NA	Filled	NA
count	202	687	204	687	204	687	204	687
unique	3	3	204	687	2	2	142	549
top	S	S	Levy, Mr. Rene Jacques	Nasser, Mr. Nicholas	male	male	113760	347082
freq	129	515	1	1	107	470	4	7

Let’s check the results by hand:

data.groupby(
  data['Cabin'].isna().replace({False: 'Filled', True: 'NA'}))['Sex']\
.value_counts()

Cabin	Sex	Count
Filled	male	107
	female	97
NA	male	470
	female	217

Here we take Cabin column, encode missing/non-missing data as Filled/NA, and then use it to group and count values in Sex column: among the passengers with missing cabin data, 470 were males, while 217 were females.

Quantitative data

Now, let’s look at the statistics calculated for the numeric data:

# Selecting just two columns
na.describe(data, columns=['Age', 'Fare'], col_na='Cabin')

	Age		Fare
Cabin	Filled	NA	Filled	NA
count	185	529	204	687
mean	35.8293	27.5553	76.1415	19.1573
std	15.6794	13.4726	74.3917	28.6633
min	0.92	0.42	0	0
25%	24	19	29.4531	7.8771
50%	36	26	55.2208	10.5
75%	48	35	89.3282	23
max	80	74	512.329	512.329

The mean age of passengers with missing cabin data was 27.6 years.