DataGenerator

Disclaimer: on AI usage

This project started as a "vibe-coding" project, meaning that the package's initial commit has been entirely generated by Claude-Code, from design to implementation and documentation, based on my original ideas and specifications. While I have reviewed and tested the code and the documentation, there may still be issues or limitations. Subsequent changes (and commmits) have a higher human to AI ratio but the code heavily relies on Claude.

DataGenerator

Synthetic data generation using DAG-based structural causal models. I am mainly using this package to create data sets where the following properties are controled:

non-linearity
noise
confounding
causal effect

I then use the dataset to test causal inference methods or ML models and for didacting purposes.

The structured equations orginating from the DAG have the general form:

$$ node_{j} = \sum^{parents}_{i} w_{ij} f_{ij}(node_{i}, , {p}_{ij}) + \epsilon_{j}({n}_{j}) $$

where:

$w_{ij}$ is the weight of the edge between the parent node i and the node j
$f_{ij}$ is a function of node i with a set of parameterss ${p}_{ij}$. It described the type of relation between node i and node j.
$\epsilon_{j}$ is a distribution with a set of parameters ${n}_{j}$. It describes the underlying noise distribution of node j.

Different type of function/transform are available (see bellow).

Categorical variables

Categorical variables are expanded into one-hot sub-nodes (one per category).

Root categorical (no parents): sampled from a multinomial distribution with user-defined probabilities:

$$ C \sim \text{Multinomial}(p_1, p_2, \ldots, p_K) $$

Child categorical (has parents): a latent score is computed for each category $k$ from the parent contributions, then a softmax converts scores to probabilities:

$$ s_k = \sum^{parents}_{i} w_{ik} f_{ik}(node_{i}, , {p}_{ik}) $$

$$ P(C = k) = \frac{e^{s_k}}{\sum_{l=1}^{K} e^{s_l}} $$

Note that each category $k$ has its own weight $w_{ik}$ and transform $f_{ik}$ per parent, allowing different categories to respond differently to the same parent variable.

Categorical as parent: the one-hot encoding (0/1 per category) feeds into downstream nodes, each category-edge with its own weight and transform.

Installation

pip install .

For full functionality (plotting, probit link):

pip install ".[full]"

Quick Start

Basic DAG Data Generation

from datagenerator import DAG, DataGenerator

# Create a DAG with confounding
dag = DAG()
dag.add_node("Z", noise_std=1.0)  # Confounder
dag.add_node("X", noise_std=0.5)
dag.add_node("Y", noise_std=0.5)
dag.add_edge("Z", "X", weight=0.8)
dag.add_edge("Z", "Y", weight=0.6)
dag.add_edge("X", "Y", weight=1.0, transform="quadratic") # quadratic relationship between X and Y

# add categorical variable
dag.add_categorical_node('cat', categories=['A', 'B', 'C'])
# add edge from continuous to categorical variable (per-category weights and transforms)
dag.add_edge('X', 'cat',
             weights={'A': 0.8, 'B': -0.3, 'C': 0.0},
             transforms={'A':'linear', 'B':'linear', 'C':'exponential'}) # exponential relationship between X and category C

# Generate data
generator = DataGenerator(dag, seed=42)
data = generator.sample(n=1000)

# Or as a dictionary
data_dict = generator.sample(n=1000, return_dict=True)

Classification Data

from datagenerator import ClassificationDataGenerator, FeatureSpec

# Generative mode: control class balance directly
gen_class = ClassificationDataGenerator(
    mode="generative",
    class_balance=0.1,  # 10% positive class
    feature_specs=[
        FeatureSpec("f0", loc_by_class=(0.0, 2.0), noise_std=1.0),
        FeatureSpec("f1", loc_by_class=(-0.5, 1.0), noise_std=1.0),
        FeatureSpec("f2", parents=["f0", "f1"], parent_weights=[1.0, 0.5],
                    output_transform="tanh", noise_std=0.5),
    ],
    n_noise_features=3,
    seed=42
)
X, y = gen_class.generate_batch(1000)
gen_class.plot_dag()

# Or generate randomly configured data
gen_class = ClassificationDataGenerator.from_random(
    n_features=10,
    n_informative=6,
    n_direct_to_y=3,
    connectivity=0.4,
    class_balance=0.2,
    mode="causal",
    seed=42
)
X, y = gen_class.generate_batch(1000)

Common DAG Patterns

from datagenerator import (
    create_chain,
    create_fork,
    create_collider,
    create_mediator,
    create_instrument,
    create_random_dag,
)

# Chain: X0 -> X1 -> X2
chain = create_chain(n_nodes=3)

# Fork (confounder): Z -> X, Z -> Y
fork = create_fork(n_children=2)

# Collider: X -> Y, Z -> Y
collider = create_collider(n_parents=2)

# Mediation: X -> M -> Y and X -> Y
mediator = create_mediator(direct_effect=0.5, indirect_effect_xm=1.0, indirect_effect_my=0.8)

# Instrumental variable: Z -> X -> Y with U -> X and U -> Y
iv = create_instrument(x_y_weight=2.0)

# Random DAG
random_dag = create_random_dag(n_nodes=5, edge_probability=0.4, seed=42)

Interventions (do-calculus)

from datagenerator import DAG, DataGenerator

dag = DAG()
dag.add_node("X", noise_std=1.0)
dag.add_node("Y", noise_std=0.5)
dag.add_edge("X", "Y", weight=1.0)

generator = DataGenerator(dag, seed=42)

# Sample with intervention do(X=2)
interventional_data = generator.sample_interventional(
    n=1000,
    interventions={"X": 2.0},
    return_dict=True
)

# Also possible with classification data in causal mode
# be carefull when interpreting the results 
interventional_data = gen_class.sample_interventional(
    n=1000,
    interventions={"f1": 2.0},
    return_dataframe=True
)

Non-linear Transforms

Available transforms: linear, quadratic, cubic, sigmoid, tanh, sin, exp, log, relu, leaky_relu, threshold

from datagenerator import DAG, PolynomialTransform, CompositeTransform, SigmoidTransform

dag = DAG()
dag.add_node("X")
dag.add_node("Y")

# Using string name
dag.add_edge("X", "Y", weight=1.0, transform="quadratic")

# Using transform instance
dag.add_edge("X", "Y", weight=1.0, transform=PolynomialTransform(degrees=[1, 2, 3]))

# Composing transforms
dag.add_edge("X", "Y", weight=1.0, transform=CompositeTransform([
    PolynomialTransform(degrees=[2]),
    SigmoidTransform(scale=0.5)
]))

print(dag.show_equations())

Noise Distributions

from datagenerator import DAG, GaussianNoise, LaplacianNoise, StudentTNoise, MixtureNoise

dag = DAG()

# Using convenience parameters
dag.add_node("X", noise_std=1.0)  # Gaussian
dag.add_node("Y", noise_type="laplacian", noise_std=0.5)
dag.add_node("Z", noise_type="student_t", noise_std=1.0, noise_params={"df": 3})

# Using noise generator instances
dag.add_node("W", noise=MixtureNoise(
    components=[GaussianNoise(std=1.0), LaplacianNoise(scale=2.0)],
    weights=[0.7, 0.3]
))

print(dag.show_equations())

Visualization

# Requires matplotlib
gen.dag.plot(figsize=(10, 8), show_weights=True)

# ASCII representation
print(gen_class.dag.to_ascii())

# Detailed description
print(dag.describe())

# display structural equations (both for DataGenerator and ClassificationDataGenerator)
print(gen_class.show_equations())
print(generator.show_equations())

# For ClassificationDataGenerator
gen_class.plot_dag()

Features

Flexible DAG construction with automatic cycle detection
Multiple noise distributions: Gaussian, Uniform, Laplacian, Student's t, Mixture
Non-linear edge transformations: polynomial, sigmoid, tanh, sinusoidal, exponential, log, ReLU, threshold
Interventional sampling for causal inference experiments
Classification data generation with generative or causal modes
Common DAG patterns: chain, fork, collider, mediator, instrumental variable
Visualization with matplotlib or ASCII

License

MIT

Next Steps

Add free text variable support
Add categorical-to-categorical edge support

Known bugs and inconstitencies

sample_interventional() function's return output parameters different for DataGenerator (return_dict) and ClassificationDataGenerator (retrun_dataframe)

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
datagenerator		datagenerator
notebooks		notebooks
scripts		scripts
tests		tests
.coveragerc		.coveragerc
.editorconfig		.editorconfig
.envrc		.envrc
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
documentation.md		documentation.md
prek.toml		prek.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disclaimer: on AI usage

DataGenerator

Categorical variables

Installation

Quick Start

Basic DAG Data Generation

Classification Data

Common DAG Patterns

Interventions (do-calculus)

Non-linear Transforms

Noise Distributions

Visualization

Features

License

Next Steps

Known bugs and inconstitencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Disclaimer: on AI usage

DataGenerator

Categorical variables

Installation

Quick Start

Basic DAG Data Generation

Classification Data

Common DAG Patterns

Interventions (do-calculus)

Non-linear Transforms

Noise Distributions

Visualization

Features

License

Next Steps

Known bugs and inconstitencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages