scikit_na

scikit_na.correlate(data: DataFrame, columns: Sequence | None = None, drop: bool = True, **kwargs) DataFrame

Calculate correlations between columns in terms of NA values.

Parameters:
  • data (DataFrame) – Input data.

  • columns (Optional[List, ndarray, Index] = None) – Columns names.

  • drop (bool = True, optional) – Drop columns without NA values.

  • kwargs (dict, optional) – Keyword arguments passed to pandas.DataFrame.corr() method.

Returns:

Correlation values.

Return type:

DataFrame

scikit_na.describe(data: DataFrame, col_na: str, columns: Sequence | None = None, na_mapping: dict = None) DataFrame

Describe data grouped by a column with NA values.

Parameters:
  • data (DataFrame) – Input data.

  • col_na (str) – Column with NA values to group the other data by.

  • columns (Optional[Sequence]) – Columns to calculate descriptive statistics on.

  • na_mapping (dict, optional) – Dictionary with NA mappings. By default, it is {True: “NA”, False: “Filled”}.

Returns:

Descriptive statistics (mean, median, etc.).

Return type:

DataFrame

scikit_na.model(data: DataFrame, col_na: str, columns: Sequence | None = None, intercept: bool = True, fit_kws: dict = None, logit_kws: dict = None)

Logistic regression modeling.

Fit a logistic regression model to NA values encoded as 0 (non-missing) and 1 (NA) in column col_na with predictors passed with columns argument. Statsmodels package is used as a backend for model fitting.

Parameters:
  • data (DataFrame) – Input data.

  • col_na (str) – Column with NA values to use as a dependent variable.

  • columns (Optional[Sequence]) – Columns to use as independent variables.

  • intercept (bool, optional) – Fit intercept.

  • fit_kws (dict, optional) – Keyword arguments passed to fit() method of model.

  • logit_kws (dict, optional) – Keyword arguments passed to statsmodels.discrete.discrete_model.Logit() class.

Returns:

Model after applying fit method.

Return type:

statsmodels.discrete.discrete_model.BinaryResultsWrapper

Example

>>> import scikit_na as na
>>> model = na.model(
...     data,
...     col_na='column_with_NAs',
...     columns=['age', 'height', 'weight'])
>>> model.summary()
scikit_na.report(data: DataFrame, columns: Sequence[str] | None = None, layout: Layout = None, round_dec: int = 2, corr_kws: dict = None, heat_kws: dict = None, dist_kws: dict = None)

Interactive report.

Parameters:
  • data (DataFrame) – Input data.

  • columns (Optional[Sequence[str]], optional) – Columns names.

  • layout (widgets.Layout, optional) – Layout object for use in GridBox.

  • round_dec (int, optional) – Number of decimals for rounding.

  • corr_kws (dict, optional) – Keyword arguments passed to scikit_na.altair.plot_corr().

  • heat_kws (dict, optional) – Keyword arguments passed to scikit_na.altair.plot_heatmap().

  • hist_kws (dict, optional) – Keyword arguments passed to scikit_na.altair.plot_hist().

Returns:

Interactive report with multiple tabs.

Return type:

widgets.Tab

scikit_na.stairs(data: DataFrame, columns: Sequence | None = None, xlabel: str = 'Columns', ylabel: str = 'Instances', tooltip_label: str = 'Size difference', dataset_label: str = '(Whole dataset)')

DataFrame shrinkage on cumulative pandas.DataFrame.dropna().

Parameters:
  • data (DataFrame) – Input data.

  • columns (Optional[Sequence], optional) – Columns names.

  • xlabel (str, optional) – X axis label.

  • ylabel (str, optional) – Y axis label.

  • tooltip_label (str, optional) – Tooltip label.

  • dataset_label (str, optional) – Label for a whole dataset.

Returns:

Dataset shrinkage results after cumulative pandas.DataFrame.dropna().

Return type:

DataFrame

scikit_na.summary(data: DataFrame, columns: Sequence | None = None, per_column: bool = True, round_dec: int = 2) DataFrame

Summary statistics on NA values.

Parameters:
  • data (DataFrame) – Data object.

  • columns (Optional[Sequence]) – Columns or indices to observe.

  • per_column (bool = True, optional) – Show stats per each selected column.

  • round_dec (int = 2, optional) – Number of decimals for rounding.

Returns:

Summary on NA values in the input data.

Return type:

DataFrame

scikit_na.test_hypothesis(data: DataFrame, col_na: str, test_fn: callable, test_kws: dict = None, columns: Sequence[str] | Dict[str, callable] | None = None, dropna: bool = True) Dict[str, object]

Test a statistical hypothesis.

This function can be used to find evidence against missing completely at random (MCAR) mechanism by comparing two samples grouped by missingness in another column.

Parameters:
  • data (DataFrame) – Input data.

  • col_na (str) – Column to group values by. pandas.Series.isna() method is applied before grouping.

  • columns (Optional[Union[Sequence[str], Dict[str, callable]]]) – Columns to test hypotheses on.

  • test_fn (callable, optional) – Function to test hypothesis on NA/non-NA data. Must be a two-sample test function that accepts two arrays and (optionally) keyword arguments such as scipy.stats.mannwhitneyu().

  • test_kws (dict, optional) – Keyword arguments passed to test_fn function.

  • dropna (bool = True, optional) – Drop NA values in two samples before running a hypothesis test.

Returns:

Dictionary with tests results as column => test function output.

Return type:

Dict[str, object]

Example

>>> import scikit_na as na
>>> import pandas as pd
>>> data = pd.read_csv('some_dataset.csv')
>>> # Simple example
>>> na.test_hypothesis(
...     data,
...     col_na='some_column_with_NAs',
...     columns=['age', 'height', 'weight'],
...     test_fn=ss.mannwhitneyu)
>>> # Example with `columns` as a dictionary of column => function pairs
>>> from functools import partial
>>> import scipy.stats as st
>>> # Passing keyword arguments to functions
>>> kstest_mod = partial(st.kstest, N=100)
>>> mannwhitney_mod = partial(st.mannwhitneyu, use_continuity=False)
>>> # Running tests
>>> results = na.test_hypothesis(
...     data,
...     col_na='some_column_with_NAs',
...     columns={
...         'age': kstest_mod,
...         'height': mannwhitney_mod,
...         'weight': mannwhitney_mod})
>>> pd.DataFrame(results, index=['statistic', 'p-value'])