Home

Documentation

NoGANSynthesizer is a library which generates synthetic tabular data based on methods of multivariate binning. It offers faster, more accurate and less complex alternative to GAN.

Class

  • NoGANSynthesizer: Synthetic Data Generator that fits a tabular data

Functions

  • wrap_category_columns: Function to compress all specified categorical columns into one
  • unwrap_category_columns: Function to expand all wrapped categorical columns

Authors

Installation

The package can be installed with

pip install nogan_synthesizer

Tests

The test can be run by cloning the repo and running:

pytest tests

In case of any issues running the tests, please run them after installing the package locally:

pip install -e .

Usage

Start by importing the class

from nogan_synthesizer import NoGANSynth
from nogan_synthesizer.preprocessing import wrap_category_columns, unwrap_category_columns
from genai_evaluation import multivariate_ecdf, ks_statistic

Assuming we have a pandas dataframe (Real) having some categorical columns and we are interested in generating Synthetic based on that. We first prepocess the categorical columns which will return preprocessed real dataset & its corresponding flag vector index to key value dictionary

cat_cols = [category columns list...]
wrapped_real_data, idx_to_key, key_to_idx = \
                        wrap_category_columns(real_data, cat_cols)

We then fit the NoGANSynth Model on the wrapped dataset and generate synthetic data

nogan = NoGANSynth(real_data)
nogan.fit()

n_synth_rows = len(real_data)
synth_data = nogan.generate_synthetic_data(no_of_rows=n_synth_rows)

We can then evaluate the synthetic & real data distributions using genai_evaluation package

_, ecdf_val1, ecdf_synth = \
            multivariate_ecdf(wrapped_real_data, 
                              synth_data, 
                              n_nodes = 1000,
                              verbose = True,
                              random_seed=42)

ks_stat = ks_statistic(ecdf_val1, ecdf_synth)                              

Once we are satisfied with the evaluation results, we can unwrap the Generated Synthetic dataset (unwrap the categorical columns) using the previously generated flag vector index to key dictionary

unwrapped_synth_data = unwrap_category_columns(synth_data, idx_to_key, cat_cols)

Motivation

The motivation for this package comes from Dr. Vincent Granville's paper Generative AI Technology Break-through: Spectacular Performance of New Synthesizer

If you have any tips or suggestions, please contact us on email.