Metadata-Version: 2.4
Name: fakesmith
Version: 0.1.1
Summary: Generate realistic fake data that mirrors your real data's shape — safe to share with LLMs.
License: MIT
Keywords: fake data,data masking,privacy,testing,mock data,llm safety
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Security
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: faker>=24.0.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Dynamic: license-file






# FakeSmith

[![PyPI Version](https://img.shields.io/pypi/v/fakesmith.svg)](https://pypi.org/project/fakesmith/)
[![Python Versions](https://img.shields.io/pypi/pyversions/fakesmith.svg)](https://pypi.org/project/fakesmith/)
[![Tests](https://github.com/jeffasante/fakesmith/actions/workflows/tests.yml/badge.svg)](https://github.com/jeffasante/fakesmith/actions/workflows/tests.yml)
[![License](https://img.shields.io/github/license/jeffasante/fakesmith.svg)](https://github.com/jeffasante/fakesmith/blob/main/LICENSE)


> Generate realistic fake data that **mirrors your real data's shape** — safe to share with LLMs, teammates, or in public repos.

A Python package and CLI that converts real configs, payloads, logs, and datasets into schema-preserving synthetic versions safe to share with LLMs. Because LLM-safe sanitization of real developer artifacts is a real and growing workflow problem.

When you share code with an AI assistant, you shouldn't have to expose real emails, API keys, card numbers, or user data. FakeSmith lets you describe (or just paste) a sample of your data and instantly get structurally identical but completely fake replacements.

---

## Install

```bash
pip install fakesmith
```

---

## Quick Start

### Option 1 — Auto-detect from a sample

```python
from fakesmith import FakeSmith

# Paste a real (or representative) sample — FakeSmith reads its shape
sample = '''[{
    "user_id": "3f2e1a4b-0000-0000-0000-000000000000",
    "email": "john.doe@company.com",
    "phone": "+1-800-555-0199",
    "api_key": "sk-abc123def456ghi789jkl012",
    "amount": 199.99,
    "status": "active",
    "created_at": "2024-01-15T09:30:00"
}]'''

smith = FakeSmith.from_sample(sample)
smith.describe()          # see what was detected
print(smith.to_json(5))   # 5 fake records, same shape
```

```python
from fakesmith import FakeSmith, SchemaField, FieldType

smith = FakeSmith([
    SchemaField("user_id",  FieldType.UUID),
    SchemaField("email",    FieldType.EMAIL),
    SchemaField("name",     FieldType.FULL_NAME),
    SchemaField("amount",   FieldType.AMOUNT, min_value=10, max_value=5000),
    SchemaField("status",   FieldType.STATUS, choices=["active", "inactive", "pending"]),
    SchemaField("api_key",  FieldType.API_KEY, prefix="sk-live-"),
])

# Generate deterministic records with a seed
result = smith.generate(10, seed=42)
result.print_summary()  # See which fields were faked
records = result.records # Access the list of dicts
```

### Option 3 — Quick dict shorthand

```python
smith = FakeSmith.from_dict({
    "id":       FieldType.UUID,
    "email":    FieldType.EMAIL,
    "score":    FieldType.INTEGER,
    "verified": FieldType.BOOLEAN,
})
```

---

## Output Formats

```python
smith.to_json(10)                          # JSON string
smith.to_csv(10)                           # CSV string
smith.to_sql(10, table_name="users")       # SQL INSERT statements
smith.to_env()                             # .env file format

smith.save_json("fake_users.json", 100)    # save to file
smith.save_csv("fake_users.csv",  100)
smith.save_sql("seed.sql",        100, table_name="users")
smith.save_env(".env.fake")
```

---

## CLI

```bash
# Generate 20 fake records from a JSON sample
fakesmith generate --file real_sample.json --count 20 --format json

# From CSV, output as SQL inserts
fakesmith generate --file data.csv --count 50 --format sql --table transactions

# Deterministic output using a seed
fakesmith generate --file data.json --seed 42 --out fake_data.json

# Sanitize raw text (log lines, configs) in-place
fakesmith sanitize --file server.log --out clean.log --summary

# Inspect detected schema and sensitivity flags
fakesmith describe --file data.json
```

---

## In-place Sanitization

FakeSmith can scan raw text (log lines, configuration blocks, or emails) and replace PII/secrets in-place without needing a schema.

```python
from fakesmith import sanitize_text

raw_text = "My email is alex@example.com and my key is sk-12345"
result = sanitize_text(raw_text, seed=42)

print(result.sanitized)
# "My email is fake.user@domain.com and my key is sk-a1b2c3d4..."

result.print_summary() # See exactly what was replaced and why
```

---

## Run the Samples

Try out FakeSmith on the included sample datasets (JSON, CSV, and .env) using the demo script:

1. **Setup Environment**
   ```bash
   python3 -m venv venv
   source venv/bin/activate
   pip install faker pytest
   ```

2. **Run the Samples**
   To run any script in the `examples/` folder while working on the source code, you must set the `PYTHONPATH` to the current directory:
   
   ```bash
   # Set PYTHONPATH to the root so Python can find the 'fakesmith' package
   export PYTHONPATH=$PYTHONPATH:.
   
   # Run the main demo
   python3 examples/demo_all.py
   
   # Or run any individual sample
   python3 examples/export_to_sql_csv.py
   python3 examples/sanitize_logs_in_place.py
   ```

3. **Explore the examples/ directory**
   The `examples/` folder contains several targeted scripts illustrating different features (auto-detection, manual schemas, in-place sanitization, etc.).

---

## Override Auto-Detection

```python
smith = FakeSmith.from_sample(
    my_json,
    overrides={
        # Auto-detected "status" as SENTENCE — override to proper STATUS
        "status": SchemaField("status", FieldType.STATUS, choices=["open", "closed", "resolved"]),
        # Keep a realistic amount range
        "balance": SchemaField("balance", FieldType.AMOUNT, min_value=0, max_value=100000),
    }
)
```

---

## Custom Fields

```python
import random

smith = FakeSmith([
    SchemaField("ref_code", FieldType.CUSTOM,
        generator=lambda: f"REF-{random.randint(10000, 99999)}"
    ),
    SchemaField("tier", FieldType.CUSTOM,
        generator=lambda: random.choice(["bronze", "silver", "gold", "platinum"])
    ),
])
```

---

## Supported Field Types

| Category   | Types |
|------------|-------|
| Identity   | UUID, FULL_NAME, FIRST_NAME, LAST_NAME, USERNAME, EMAIL, PHONE, PASSWORD, PASSWORD_HASH |
| Location   | ADDRESS, CITY, STATE, COUNTRY, ZIP_CODE, LATITUDE, LONGITUDE |
| Finance    | CARD_NUMBER, CARD_EXPIRY, CARD_CVV, BANK_ACCOUNT, IBAN, AMOUNT, CURRENCY |
| Business   | COMPANY, JOB_TITLE, DEPARTMENT, API_KEY, SECRET_TOKEN, JWT_TOKEN, WEBHOOK_URL |
| Dates      | DATETIME, DATE, TIME, DATE_OF_BIRTH, TIMESTAMP |
| Web & Tech | IP_ADDRESS, IPV6, MAC_ADDRESS, USER_AGENT, URL, DOMAIN, SLUG, JWT_TOKEN |
| Content    | WORD, SENTENCE, PARAGRAPH, TITLE, DESCRIPTION, TAG |
| Numeric    | INTEGER, FLOAT, BOOLEAN, PERCENTAGE |
| Enums      | STATUS, GENDER, CUSTOM |

---

## Why FakeSmith?

- **LLM-safe** — no real credentials, PII, or secrets ever leave your machine
- **Zero config** — paste a sample and go
- **Structurally identical** — same field names, same types, realistic values
- **All formats** — JSON, CSV, SQL, .env
- **Extensible** — override any field with a custom generator

---

## License

MIT
