Metadata-Version: 2.4
Name: dods-datagen
Version: 0.1.1
Summary: Reusable autonomous data generator for the DODS toolkit
Author-email: Maximilian Häusler <mahaeu@yahoo.de>
License: Apache-2.0
Project-URL: Homepage, https://github.com/mahaeu
Project-URL: Repository, https://github.com/mahaeu/dods-datagen
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: pandas<3,>=1.5
Requires-Dist: numpy<2,>=1.22
Dynamic: license-file

# dods-datagen

Config-driven synthetic data generator for the DODS toolkit.

Generate CSV datasets from a YAML template:
- categorical + numeric features
- derived features (expressions referencing earlier columns)
- optional noise, clipping, missing values, outliers

## Install

```bash
pip install dods-datagen
```

## Quick start

```python
import dods.datagen as dg

# 1) Create a starter template (basic)
dg.create_template()           # -> template.yaml

# 2) Generate a dataset from that template
df = dg.run("template")        # reads template.yaml, writes CSV as configured inside YAML
```

Use the long template (with more explanations/examples):

```python
import dods.datagen as dg

dg.create_template("template_long")  # -> template_long.yaml
df = dg.run("template_long")         # reads template_long.yaml
```

Run from any path:

```python
import dods.datagen as dg

df = dg.run("path/to/my_config")     # reads path/to/my_config.yaml
df = dg.run("path/to/my_config.yaml")
```


## Long-Template to read
```yaml
# =====================================================================
# 🧩 DODS Data Generator — template.yaml (single "features:" list)
# ---------------------------------------------------------------------
# This template shows the recommended structure for your generator:
#   1) Meta (scenario/seed/n/output)
#   2) Features (categorical + numeric + derived) — evaluated top to bottom
#   3) Optional: noise, clip, postprocess (missing/outliers)
#
# Key rule:
# - If you want multi-line formulas, use YAML "|" (literal block) AND wrap
#   the expression in parentheses "( ... )".
#
# Why?
# - Your parser will wrap expressions into a Python lambda:
#     lambda income, score: <EXPR>
# - A lambda body must be a single expression.
# - Parentheses make multi-line math still "one expression".
# =====================================================================

# ------------------------------------------------------------
# 0) META
# ------------------------------------------------------------
scenario: synthetic
seed: 42
n: 50000

output_path: "data"
filename: "template_long.csv"

# ------------------------------------------------------------
# 1) FEATURES (evaluated in order)
# ------------------------------------------------------------
features:
  # -------------------------
  # 1.1 CATEGORICAL — sampled from a list
  # -------------------------
  - city:
      type: categorical
      values: ["Berlin", "London", "Rome"]            # 3 categories
      p: [0.40, 0.30, 0.30]                           # optional (default uniform)

  - plan:
      type: categorical
      values: ["free", "basic", "pro", "enterprise"]  # 4 categories (different from city)
      p: [0.55, 0.25, 0.15, 0.05]

  # -------------------------
  # 1.2 NUMERIC — primitives (all shortcuts + sequences)
  # Notes:
  # - Shortcuts are expanded automatically:
  #     randint(...)  -> np.random.randint(...)
  #     normal(...)   -> np.random.normal(...)
  #     uniform(...)  -> np.random.uniform(...)
  #     choice(...)   -> np.random.choice(...)
  #
  # - Sequence helpers:
  #     range(a,b) / arange(a,b)  -> np.arange(a,b)
  #     seq(a,b)                  -> np.linspace(a,b,n)
  #     linspace(a,b)             -> np.linspace(a,b,n)   (if only 2 args given)
  #
  # - Format:
  #     Only lambda expressions need quotation marks!
  # -------------------------
  - age: randint(18, 70, n)                           # integers in [18, 70)
  - rate: uniform(1.0, 5.0, n)                        # floats in [1, 5)
  - var: normal(0, 1, n)                              # gaussian noise (mean=0, std=1)
  - dev: choice([0.6, 0.9, 1.2], n)                   # pick from a fixed set
  - extra: "lambda n: np.random.exponential(2.0, n)"  # custom lambda (needs quotes)

  # Sequence examples (useful for trends/seasonality/time)
  - idx: range(0, n)                                  # 0..n-1 (np.arange)
  - wave: np.sin(age)                                 # uses existing column (robust with your engine)

  # -------------------------
  # 1.3 NUMERIC — mixing (start simple → then richer)
  #
  # Tip:
  # - Comparisons like (city=='Rome') become True/False → behave like 1/0 in math.
  # - Keep "income" deterministic here; add randomness later via "noise:".
  # -------------------------

  # 1) Simple: purely numeric dependency
  - score: age * rate

  # 2) Slightly richer: numeric + one categorical factor (still one-liner)
  - inc: 2000 + 50*age + (city=='Rome')*400

  # 3) Richer: numeric + multiple categorical factors (still one-liner)
  - inc_adj: |
      (inc
       + (plan=='pro')*250
       + (plan=='enterprise')*800)

  # -------------------------
  # 1.4 CATEGORICAL — derived from numeric
  # -------------------------
  - segment:
      type: categorical
      func: np.where(age >= 45, 'senior', 'junior')

  # -------------------------
  # 1.5 CATEGORICAL — derived from categorical (mapping)
  # -------------------------
  - region:
      type: categorical
      func: |
        np.where(city=='Berlin','DE',
        np.where(city=='London','UK','IT'))

  # -------------------------
  # 1.6 NUMERIC — derived formula (multi-line, complex mixing)
  #
  # IMPORTANT:
  # - Use "|" to keep line breaks (readability)
  # - Wrap expression in parentheses "( ... )" so it stays ONE expression
  #   when your parser converts it into a lambda.
  # -------------------------
  - spend: |
      (inc_adj * 0.6
       + rate * 12
       + (plan=='pro')*250
       + (plan=='enterprise')*600
       - score * 0.15
       + wave * 25)

  # Another derived numeric (one-liner)
  - widx: (inc_adj + score) / (spend + 1)

# ------------------------------------------------------------
# 2) NOISE (optional)
# Applied AFTER the column is computed.
# Some Integer features may be converted to Float here.
#
# Accepted formats per column:
#   - scalar value         → stddev
#   - [mean, stddev]       → tuple
#   - lambda / expression  → custom generator
# ------------------------------------------------------------
noise:
  inc_adj: 150                                     # stddev = 150 (applied to the whole column)
  age: [0, 1.5]                                    # mean=0, stddev=1.5
  score: "lambda n: np.random.normal(0, 3, n)"     # custom generator (must return array-like of length n)

# ------------------------------------------------------------
# 3) CLIP (optional)
# Enforce min/max bounds per feature: [min, max]
# ------------------------------------------------------------
clip:
  age: [18, 70]
  inc: [0, 20000]
  inc_adj: [0, 25000]
  score: [0, 500000]
  spend: [0, 50000]
  widx: [0, 1000000]

# ------------------------------------------------------------
# 4) POSTPROCESS (optional)
# Missing values:
#   - default_ratio: fallback ratio for all columns not listed below
#   - columns: per-column missing ratio (0.0–1.0)
#
# Outliers:
#   - default_ratio: fallback ratio for all columns not listed below
#   - columns: { ratio, magnitude } per column
#     magnitude > 1 → values become larger (scaled up)
#     magnitude < 1 → values become smaller (scaled down)
# ------------------------------------------------------------
postprocess:
  missing:
    default_ratio: 0.0
    columns:
      inc_adj: 0.02                                 # 2% missing
      city: 0.01                                    # 1% missing
      plan: 0.01                                    # 1% missing

  outliers:
    default_ratio: 0.0
    columns:
      inc_adj: { ratio: 0.01, magnitude: 2.5 }       # 1% extreme high values
      age: { ratio: 0.02, magnitude: 0.7 }           # 2% unusually low ages (scaled down)
      spend: { ratio: 0.01, magnitude: 1.8 }         # 1% high spending spikes

# =====================================================================
# END
# =====================================================================

```


## Notes

- Features are evaluated in order → define dependencies first.
- Multi-line expressions must be a single expression:
  use `|` and wrap with `( ... )`.
- The CSV output location is defined inside the YAML via:
  - `output_path`
  - `filename`

## Links

- Homepage: https://github.com/mahaeu/dods-datagen
