Metadata-Version: 2.4
Name: ai-data-scrubber
Version: 0.1.1
Summary: A lightweight tool for removing personal data from text before uploading to LLMs
Author: Catherine Nelson
License: MIT
Project-URL: Homepage, https://github.com/catherinenelson1/ai-data-scrubber
Project-URL: Repository, https://github.com/catherinenelson1/ai-data-scrubber
Project-URL: Issues, https://github.com/catherinenelson1/ai-data-scrubber/issues
Keywords: privacy,data-cleaning,pii,llm,spacy,anonymization,text-processing
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: spacy>=3.5.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Dynamic: license-file

# AI Data Scrubber

The AI Data Scrubber is a lightweight privacy-focused tool designed to remove personal information from text documents before uploading them to Large Language Models (LLMs). You can use it to clean sensitive documents like resumes or contracts. It's not guaranteed to remove everything, so you should still check before you upload your file to a LLM.

It uses a mixture of regular expressions and named entity recognition models from [spaCy](https://spacy.io/). It's less accurate than asking a LLM to remove PII - but then you don't need to either run a LLM on your own machine, or upload your document to a LLM.

## What It Does

The AI Data Scrubber removes personal information including:
- Names
- Email addresses
- Phone numbers
- Street addresses & ZIP codes
- URLs
- License plates

Currently, only US formats are supported.

You can run it via a command line interface or you can import it into your Python script.

## Quick Start

```bash
# Install using pip
pip install ai-data-scrubber

# Download required language model (~560MB)
python -m spacy download en_core_web_lg

# Clean your file
ai-data-scrubber your-file.txt
```

## Usage

**Command Line:**
```bash
# Auto-generates an output file with _scrubbed suffix
ai-data-scrubber input.txt

# Or you can specify the output file with the -o flag
ai-data-scrubber input.txt -o output.txt
```

**Python:**
```python
from ai_data_scrubber import scrub_text, scrub_file

# Scrub text directly
cleaned = scrub_text("Your text with personal information here")

# Or scrub a file
scrub_file("input.txt", "output.txt")
```

## Example

**Original text:**
```
John Smith
123 Main Street, Apt 4B
New York, NY 10001
Email: john.smith@example.com
Phone: (555) 123-4567
```

**Cleaned text:**
```
[NAME]
[ADDRESS], [UNIT]
New York, NY [ZIP_CODE]
Email: [EMAIL]
Phone: [PHONE]
```

## Documentation

You can find further documentation in the `docs` folder: 

- [Installation Guide](docs/INSTALLATION.md) - Detailed installation including from source
- [Performance Testing](docs/TESTING.md) - How to run a testing script with 20 sample resumes


## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contributing

Contributions welcome! Feel free to submit a Pull Request.
