Skip to content

MilaimKas/DataGenerator

Repository files navigation

Disclaimer: on AI usage

AI Usage

This project started as a "vibe-coding" project, meaning that the package's initial commit has been entirely generated by Claude-Code, from design to implementation and documentation, based on my original ideas and specifications. While I have reviewed and tested the code and the documentation, there may still be issues or limitations. Subsequent changes (and commmits) have a higher human to AI ratio but the code heavily relies on Claude.

DataGenerator

Synthetic data generation using DAG-based structural causal models. I am mainly using this package to create data sets where the following properties are controled:

  • non-linearity
  • noise
  • confounding
  • causal effect

I then use the dataset to test causal inference methods or ML models and for didacting purposes.

The structured equations orginating from the DAG have the general form:

$$ node_{j} = \sum^{parents}_{i} w_{ij} f_{ij}(node_{i}, , {p}_{ij}) + \epsilon_{j}({n}_{j}) $$

where:

  • $w_{ij}$ is the weight of the edge between the parent node i and the node j
  • $f_{ij}$ is a function of node i with a set of parameterss ${p}_{ij}$. It described the type of relation between node i and node j.
  • $\epsilon_{j}$ is a distribution with a set of parameters ${n}_{j}$. It describes the underlying noise distribution of node j.

Different type of function/transform are available (see bellow).

Categorical variables

Categorical variables are expanded into one-hot sub-nodes (one per category).

Root categorical (no parents): sampled from a multinomial distribution with user-defined probabilities:

$$ C \sim \text{Multinomial}(p_1, p_2, \ldots, p_K) $$

Child categorical (has parents): a latent score is computed for each category $k$ from the parent contributions, then a softmax converts scores to probabilities:

$$ s_k = \sum^{parents}_{i} w_{ik} f_{ik}(node_{i}, , {p}_{ik}) $$

$$ P(C = k) = \frac{e^{s_k}}{\sum_{l=1}^{K} e^{s_l}} $$

Note that each category $k$ has its own weight $w_{ik}$ and transform $f_{ik}$ per parent, allowing different categories to respond differently to the same parent variable.

Categorical as parent: the one-hot encoding (0/1 per category) feeds into downstream nodes, each category-edge with its own weight and transform.

Installation

pip install .

For full functionality (plotting, probit link):

pip install ".[full]"

Quick Start

Basic DAG Data Generation

from datagenerator import DAG, DataGenerator

# Create a DAG with confounding
dag = DAG()
dag.add_node("Z", noise_std=1.0)  # Confounder
dag.add_node("X", noise_std=0.5)
dag.add_node("Y", noise_std=0.5)
dag.add_edge("Z", "X", weight=0.8)
dag.add_edge("Z", "Y", weight=0.6)
dag.add_edge("X", "Y", weight=1.0, transform="quadratic") # quadratic relationship between X and Y

# add categorical variable
dag.add_categorical_node('cat', categories=['A', 'B', 'C'])
# add edge from continuous to categorical variable (per-category weights and transforms)
dag.add_edge('X', 'cat',
             weights={'A': 0.8, 'B': -0.3, 'C': 0.0},
             transforms={'A':'linear', 'B':'linear', 'C':'exponential'}) # exponential relationship between X and category C

# Generate data
generator = DataGenerator(dag, seed=42)
data = generator.sample(n=1000)

# Or as a dictionary
data_dict = generator.sample(n=1000, return_dict=True)

Classification Data

from datagenerator import ClassificationDataGenerator, FeatureSpec

# Generative mode: control class balance directly
gen_class = ClassificationDataGenerator(
    mode="generative",
    class_balance=0.1,  # 10% positive class
    feature_specs=[
        FeatureSpec("f0", loc_by_class=(0.0, 2.0), noise_std=1.0),
        FeatureSpec("f1", loc_by_class=(-0.5, 1.0), noise_std=1.0),
        FeatureSpec("f2", parents=["f0", "f1"], parent_weights=[1.0, 0.5],
                    output_transform="tanh", noise_std=0.5),
    ],
    n_noise_features=3,
    seed=42
)
X, y = gen_class.generate_batch(1000)
gen_class.plot_dag()

# Or generate randomly configured data
gen_class = ClassificationDataGenerator.from_random(
    n_features=10,
    n_informative=6,
    n_direct_to_y=3,
    connectivity=0.4,
    class_balance=0.2,
    mode="causal",
    seed=42
)
X, y = gen_class.generate_batch(1000)

Common DAG Patterns

from datagenerator import (
    create_chain,
    create_fork,
    create_collider,
    create_mediator,
    create_instrument,
    create_random_dag,
)

# Chain: X0 -> X1 -> X2
chain = create_chain(n_nodes=3)

# Fork (confounder): Z -> X, Z -> Y
fork = create_fork(n_children=2)

# Collider: X -> Y, Z -> Y
collider = create_collider(n_parents=2)

# Mediation: X -> M -> Y and X -> Y
mediator = create_mediator(direct_effect=0.5, indirect_effect_xm=1.0, indirect_effect_my=0.8)

# Instrumental variable: Z -> X -> Y with U -> X and U -> Y
iv = create_instrument(x_y_weight=2.0)

# Random DAG
random_dag = create_random_dag(n_nodes=5, edge_probability=0.4, seed=42)

Interventions (do-calculus)

from datagenerator import DAG, DataGenerator

dag = DAG()
dag.add_node("X", noise_std=1.0)
dag.add_node("Y", noise_std=0.5)
dag.add_edge("X", "Y", weight=1.0)

generator = DataGenerator(dag, seed=42)

# Sample with intervention do(X=2)
interventional_data = generator.sample_interventional(
    n=1000,
    interventions={"X": 2.0},
    return_dict=True
)

# Also possible with classification data in causal mode
# be carefull when interpreting the results 
interventional_data = gen_class.sample_interventional(
    n=1000,
    interventions={"f1": 2.0},
    return_dataframe=True
)

Non-linear Transforms

Available transforms: linear, quadratic, cubic, sigmoid, tanh, sin, exp, log, relu, leaky_relu, threshold

from datagenerator import DAG, PolynomialTransform, CompositeTransform, SigmoidTransform

dag = DAG()
dag.add_node("X")
dag.add_node("Y")

# Using string name
dag.add_edge("X", "Y", weight=1.0, transform="quadratic")

# Using transform instance
dag.add_edge("X", "Y", weight=1.0, transform=PolynomialTransform(degrees=[1, 2, 3]))

# Composing transforms
dag.add_edge("X", "Y", weight=1.0, transform=CompositeTransform([
    PolynomialTransform(degrees=[2]),
    SigmoidTransform(scale=0.5)
]))

print(dag.show_equations())

Noise Distributions

from datagenerator import DAG, GaussianNoise, LaplacianNoise, StudentTNoise, MixtureNoise

dag = DAG()

# Using convenience parameters
dag.add_node("X", noise_std=1.0)  # Gaussian
dag.add_node("Y", noise_type="laplacian", noise_std=0.5)
dag.add_node("Z", noise_type="student_t", noise_std=1.0, noise_params={"df": 3})

# Using noise generator instances
dag.add_node("W", noise=MixtureNoise(
    components=[GaussianNoise(std=1.0), LaplacianNoise(scale=2.0)],
    weights=[0.7, 0.3]
))

print(dag.show_equations())

Visualization

# Requires matplotlib
gen.dag.plot(figsize=(10, 8), show_weights=True)

# ASCII representation
print(gen_class.dag.to_ascii())

# Detailed description
print(dag.describe())

# display structural equations (both for DataGenerator and ClassificationDataGenerator)
print(gen_class.show_equations())
print(generator.show_equations())

# For ClassificationDataGenerator
gen_class.plot_dag()

Features

  • Flexible DAG construction with automatic cycle detection
  • Multiple noise distributions: Gaussian, Uniform, Laplacian, Student's t, Mixture
  • Non-linear edge transformations: polynomial, sigmoid, tanh, sinusoidal, exponential, log, ReLU, threshold
  • Interventional sampling for causal inference experiments
  • Classification data generation with generative or causal modes
  • Common DAG patterns: chain, fork, collider, mediator, instrumental variable
  • Visualization with matplotlib or ASCII

License

MIT

Next Steps

  • Add free text variable support
  • Add categorical-to-categorical edge support

Known bugs and inconstitencies

  • sample_interventional() function's return output parameters different for DataGenerator (return_dict) and ClassificationDataGenerator (retrun_dataframe)

About

Python package to generate data from DAG and structural equations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors