Statistical Analysis
===================

Understanding missing data patterns through statistical analysis is crucial for making
informed decisions about data handling strategies. **scikit-na** provides comprehensive
statistical functions to analyze missing data at both column and dataset levels.

This guide demonstrates key statistical functions using the Titanic dataset, which
contains missing values in three columns: *Age*, *Cabin*, and *Embarked*.

Getting Started
~~~~~~~~~~~~~~~

.. code-block:: python

   import pandas as pd
   import scikit_na as na

   # Load the Titanic dataset
   data = pd.read_csv('titanic_dataset.csv')

   # Quick overview of missing data
   print(f"Dataset shape: {data.shape}")
   print(f"Missing values per column:")
   print(data.isnull().sum())

Summary Statistics
~~~~~~~~~~~~~~~~~~

Column-Level Analysis
---------------------

Generate detailed statistics for each column to understand individual patterns:

.. code-block:: python

   # Comprehensive per-column summary
   summary_stats = na.summary(data, per_column=True)
   print(summary_stats)

   # Focus on columns with missing data only
   na.summary(data, columns=['Age', 'Cabin', 'Embarked'])

===========================  ======  =======  ==========  ======  ======  =============  ========  =====  ==========
..                              Age    Cabin    Embarked    Fare    Name    PassengerId    Pclass    Sex    Survived
===========================  ======  =======  ==========  ======  ======  =============  ========  =====  ==========
na_count                     177      687           2          0       0              0         0      0           0
na_pct_per_col                19.87    77.1         0.22       0       0              0         0      0           0
na_pct_total                  20.44    79.33        0.23       0       0              0         0      0           0
na_unique_per_col             19      529           2          0       0              0         0      0           0
na_unique_pct_per_col         10.73    77         100          0       0              0         0      0           0
rows_after_dropna            714      204         889        891     891            891       891    891         891
rows_dropna_pct               80.13    22.9        99.78     100     100            100       100    100         100
===========================  ======  =======  ==========  ======  ======  =============  ========  =====  ==========

Understanding the Summary Metrics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The summary provides several key metrics for missing data analysis:

**Missing Data Counts**
  - *na_count*: Absolute number of missing values in each column
  - *na_pct_per_col*: Percentage of missing values within each column
  - *na_pct_total*: This column's missing values as percentage of all missing values

**Missing Data Patterns**
  - *na_unique_per_col*: Missing values unique to this column (don't overlap with other columns)
  - *na_unique_pct_per_col*: Percentage of this column's missing values that are unique

**Impact Analysis**
  - *rows_after_dropna*: Rows remaining after dropping missing values from this column
  - *rows_after_dropna_pct*: Percentage of original rows that would remain

Dataset-Level Analysis
----------------------

For an overall dataset perspective, use aggregate statistics:

.. code-block:: python

   # Dataset-level summary
   dataset_summary = na.summary(data, per_column=False)
   print(dataset_summary)

.. code:: python

    na.summary(data, per_column=False)

==============================  =========
..                                dataset
==============================  =========
total_columns                          12 
total_rows                            891 
na_rows                               708 
non_na_rows                           183 
total_cells                         10692 
na_cells                              866 
na_cells_pct                          8.1 
non_na_cells                         9826 
non_na_cells_pct                     91.9 
==============================  =========

Descriptive statistics
~~~~~~~~~~~~~~~~~~~~~~

The next step is to calculate descriptive statistics for columns with
quantitative and qualitative data. First, let's filter the columns by data
types:

.. code:: python

    # Presumably, qualitative data, needs checking
    cols_nominal = data.columns[data.dtypes == object]

    # Quantitative data
    cols_numeric = data.columns[(data.dtypes == float) | (data.dtypes == int)]

We should also specify a column with missing values (NAs) to be used for
splitting the data in the selected columns into two groups: NA (missing)
and Filled (non-missing).

Qualitative data
----------------

.. code:: python

    na.describe(data, columns=cols_nominal)

======  ======  ===  ======================  ====================  ======  ====  ======  ======
..        Embarked                        Name                     Sex           Ticket         
------  -----------  --------------------------------------------  ------------  --------------
Cabin   Filled  NA   Filled                  NA                    Filled   NA   Filled    NA  
======  ======  ===  ======================  ====================  ======  ====  ======  ======
count   202     687  204                     687                   204     687      204     687
unique  3       3    204                     687                   2       2        142     549
top     S       S    Levy, Mr. Rene Jacques  Nasser, Mr. Nicholas  male    male  113760  347082
freq    129     515  1                       1                     107     470        4       7
======  ======  ===  ======================  ====================  ======  ====  ======  ======

Let's check the results by hand:

.. code:: python

    data.groupby(
      data['Cabin'].isna().replace({False: 'Filled', True: 'NA'}))['Sex']\
    .value_counts()

======  ======  =====
Cabin   Sex     Count  
======  ======  =====
Filled  male    107  
..      female  97   
NA      male    470  
..      female  217  
======  ======  =====

Here we take *Cabin* column, encode missing/non-missing data as Filled/NA, and
then use it to group and count values in *Sex* column: among the passengers with
missing *cabin* data, 470 were males, while 217 were females.

Quantitative data
-----------------

Now, let's look at the statistics calculated for the numeric data:

.. code:: python

  # Selecting just two columns
  na.describe(data, columns=['Age', 'Fare'], col_na='Cabin')

=====  ========  ========  ========  =========
..     Age                 Fare               
-----  ------------------  -------------------
Cabin  Filled    NA        Filled    NA       
=====  ========  ========  ========  =========
count  185       529       204        687     
mean    35.8293   27.5553   76.1415    19.1573
std     15.6794   13.4726   74.3917    28.6633
min      0.92      0.42      0          0     
25%     24        19        29.4531     7.8771
50%     36        26        55.2208    10.5   
75%     48        35        89.3282    23     
max     80        74       512.329    512.329 
=====  ========  ========  ========  =========

The mean *age* of passengers with missing *cabin* data was 27.6 years.