Predictive Modeling of Missingness

Understanding what predicts missingness patterns can reveal important insights about your data collection process and help determine appropriate missing data mechanisms. scikit-na provides logistic regression modeling to predict the probability of missingness based on other variables.

Why Model Missingness?

Modeling missingness helps you:

  • Test missing data mechanisms: Distinguish between MCAR, MAR, and MNAR

  • Identify predictors: Understand which variables are associated with missingness

  • Inform imputation: Use predictive relationships for better imputation strategies

  • Assess bias: Evaluate potential selection bias in your analysis

The Model

The scikit_na.model() function fits a logistic regression where:

  • Dependent variable: Missing (1) vs. Non-missing (0) in the target column

  • Independent variables: Other columns that might predict missingness

  • Backend: Uses statsmodels for robust statistical inference

Basic Example

import pandas as pd
import scikit_na as na

# Load the Titanic dataset
data = pd.read_csv("titanic_dataset.csv")

# Select numeric predictors
predictors = ['Fare', 'Parch', 'Pclass', 'SibSp', 'Survived']

# Fit logistic regression model
model = na.model(data, col_na='Age', columns=predictors)

# Display comprehensive results
print(model.summary())

Interpreting Results

# Extract key information
print("Model Coefficients:")
print(model.params)

print("\\nStatistical Significance:")
print(model.pvalues)

print("\\nConfidence Intervals:")
print(model.conf_int())
Optimization terminated successfully.
Current function value: 0.467801
Iterations 7
                        Logit Regression Results
==============================================================================
Dep. Variable:                    Age   No. Observations:                  891
Model:                          Logit   Df Residuals:                      885
Method:                           MLE   Df Model:                            5
Date:                Sat, 05 Jun 2021   Pseudo R-squ.:                 0.06164
Time:                        17:51:31   Log-Likelihood:                -416.81
converged:                       True   LL-Null:                       -444.19
Covariance Type:            nonrobust   LLR p-value:                 1.463e-10
===============================================================================
                coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
(intercept)    -2.7294      0.429     -6.369      0.000      -3.569      -1.890
Fare            0.0010      0.003      0.376      0.707      -0.004       0.006
Parch          -0.8874      0.223     -3.984      0.000      -1.324      -0.451
Pclass          0.5953      0.147      4.046      0.000       0.307       0.884
SibSp           0.2548      0.095      2.684      0.007       0.069       0.441
Survived       -0.1026      0.198     -0.519      0.604      -0.490       0.285
===============================================================================