Metadata-Version: 2.4
Name: data-atlas
Version: 0.0.1
Summary: Generate clinical data dictionaries using language models
Author-email: Raffaele Giancotti <raffaele.giancotti@unical.it>
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.0.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: requests>=2.25.0
Requires-Dist: ollama>=0.1.0
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.20.0
Requires-Dist: sentence-transformers>=2.0.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: reportlab>=3.6.0
Requires-Dist: tqdm>=4.60.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Dynamic: license-file

# Data Dictionary Generator

A Python package to automatically generate data dictionaries for clinical datasets using a large language model (LLM) via Ollama. This tool takes in dataset files (CSV format), processes them, and generates descriptions for each column in the dataset, as well as other metadata like data types, sample data, table descriptions and some quality information such as missing values, outliers and redundant values. It is also able to find relationships between tables and columns.

## Features
### Metadata Generation
- **Column Descriptions**: AI-generated clinical context for each field
- **Table Summaries**: Dataset-level documentation
- **Smart Sampling**: Automatic data type detection with sample values

### Quality Analysis
- Missing value statistics
- Outlier detection (numeric fields)
- Duplicate row/column identification

### Advanced Capabilities
- **Semantic Relationship Detection**: Finds connected columns across tables
- **Multi-Format Outputs**: JSON, Markdown
- **Custom LLM Integration**: Supports any Ollama, OpenAI and Google Gemini models

## Requirements

Make sure you have Python 3.8+ installed. All dependencies are automatically installed when you install the package.

## Installation

Clone the repository:

```bash
git clone https://github.com/rafgia/data-dictionary-generator.git
cd code
```

Install the package:

```bash
pip install -e .
```

## Usage

### Command-line Interface

Once the package is installed, you can use the command line to generate metadata for your dataset(s).

To run the tool, use the following command:

```bash
python generate_dictionary.py <folder_path> --output-dir <output_dir> --model llama3.1
```

#### Parameters:
- `<folder_path>`: The path to the folder containing your CSV files.
- `<output-dir>`: The name of the path where the metadata will be saved.
- `--model`: Specify the Ollama or OpenAI model to use for generating metadata.

### Example

1. **Prepare your data files**:
   Place all your CSV files (representing tables in your dataset) in a folder, e.g., `data/MIMIC`.

2. **Run the generator**:

```bash
python -m generate_dictionary.py data/MIMIC --output-dir output --model gpt-4o-mini
```

This will generate metadata for each column in the dataset and save it to a JSON file and a Markdown summary for the dictionary, and a .png plot and a markdown summary for the detected relationships.

### Sample Output

For each **table** in your dataset, the following metadata will be generated:
- **Table Name**: The name of the table (CSV filename).
- **Number of Rows**: The total number of rows in the table.
- **Number of Columns**: The total number of columns in the table.
- **Table Description**: A generated description of what the table contains.

For each **column** in your dataset:
- **Column Name**: The name of the column.
- **Sample Data**: A sample of 5 data points from the column.
- **Data Type**: The inferred data type (e.g., integer, float, string).
- **Column Description**: A description generated by the model for the column.

### Example of generated metadata:


## Troubleshooting

### 1. If you encounter an error such as `model not found`, make sure you have set up Ollama correctly and the model is available.
- Ensure that you can manually run the model using `ollama run llama3.1` from the command line before using it in the Python script.

### 2. If the dependencies are not installing, make sure you're using the correct Python version and have all required libraries listed in `pyproject.toml`.

### 3. If the dataset is very large, consider breaking it down into smaller CSV files for more efficient processing.

## Contributing

If you would like to contribute to this project, feel free to fork the repository and submit a pull request. Make sure to add tests and document any new features.

### To contribute:
1. Fork the repository.
2. Create a new branch (`git checkout -b feature-name`).
3. Make your changes and commit them (`git commit -am 'Add new feature'`).
4. Push to the branch (`git push origin feature-name`).
5. Open a pull request.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## Author

Raffaele Giancotti
