Metadata-Version: 2.1
Name: qullm
Version: 0.3.0
Summary: A Python package for preprocessing and augmenting data for large language models by quantum Neural networks.
Home-page: https://github.com/1nobodybutme1/qullm
Author: C.C.K Leon
Author-email: ckchau1@asu.edu
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas

Of course, here is the detailed example of the `README.md` file, which describes the functionality, installation method, and usage examples of the `qullm` package.

```markdown
# QULLM

QUllm is a Python package for preprocessing and augmenting data for large language models using quantum neural networks.

## Features

- **Data Loading**: Load data from various file formats.
- **Data Preprocessing**: Clean, normalize, and filter text data.
- **Data Augmentation**: Enhance data with techniques like synonym replacement, random insertion, deletion, and swapping.
- **Data Analysis**: Analyze data to get word frequencies, unique words, and other statistics.

## Installation

You can install the package using pip:

```bash
pip install qullm
```

## Usage

Here is an example of how to use the `qullm` package:

```python
from qullm.llm_data_processor import DataLoader, DataPreprocessor, DataAugmenter, DataAnalyzer, save_data

# Specify the path to your data file
data_file_path = 'path/to/your/data.txt'
processed_data_file_path = 'path/to/save/processed_data.txt'

# Step 1: Load data
data_loader = DataLoader(data_file_path)
data = data_loader.load_data()
print("Original Data:")
print(data[:5])  # Print first 5 lines for inspection

# Step 2: Preprocess data
preprocessor = DataPreprocessor(data)
cleaned_data = preprocessor.clean_data()
normalized_data = preprocessor.normalize_data()
filtered_data = preprocessor.filter_data()

print("Cleaned Data:")
print(cleaned_data[:5])  # Print first 5 lines for inspection

print("Normalized Data:")
print(normalized_data[:5])  # Print first 5 lines for inspection

print("Filtered Data:")
print(filtered_data[:5])  # Print first 5 lines for inspection

# Step 3: Augment data
augmenter = DataAugmenter(filtered_data)
augmented_data = augmenter.augment_data()
print("Augmented Data:")
print(augmented_data[:5])  # Print first 5 lines for inspection

# Step 4: Analyze data
analyzer = DataAnalyzer(augmented_data)
word_freq = analyzer.get_word_frequency()
stats = analyzer.get_data_statistics()

print("Word Frequency:")
print(word_freq.most_common(10))  # Print top 10 most common words

print("Data Statistics:")
print(stats)

# Step 5: Save processed data
save_data(augmented_data, processed_data_file_path)
print(f"Processed data saved to {processed_data_file_path}")
```

## Modules

### DataLoader

A module for loading data from files.

#### Methods

- `load_data()`: Load data from a file and return as a list of strings.
- `load_data_as_string()`: Load data from a file and return as a single string.
- `load_data_as_lines()`: Load data from a file and return as a list of lines.

### DataPreprocessor

A module for preprocessing text data.

#### Methods

- `clean_data()`: Clean the data by removing extra whitespace and newlines.
- `normalize_data()`: Normalize the data by converting text to lowercase.
- `filter_data(min_length)`: Filter out lines with fewer words than `min_length`.
- `remove_special_characters()`: Remove special characters from the data.
- `remove_stopwords(stopwords)`: Remove stopwords from the data.

### DataAugmenter

A module for augmenting text data.

#### Methods

- `augment_data(augment_factor)`: Augment the data by duplicating and shuffling.
- `synonym_replacement(word_map)`: Replace words with their synonyms.
- `random_insertion(insertion_words, insert_prob)`: Randomly insert words into the data.
- `random_deletion(delete_prob)`: Randomly delete words from the data.
- `random_swap(swap_prob)`: Randomly swap words in the data.

### DataAnalyzer

A module for analyzing text data.

#### Methods

- `get_word_frequency()`: Get the frequency of words in the data.
- `get_data_statistics()`: Get statistics such as the number of lines and words.
- `get_unique_words()`: Get unique words in the data.
- `get_average_line_length()`: Get the average length of lines in the data.
- `get_most_common_words(n)`: Get the top `n` most common words in the data.

### Utils

Utility functions for saving data and loading configurations.

#### Methods

- `save_data(data, file_path)`: Save data to a specified file path.
- `load_config(config_path)`: Load configuration from a JSON file.
- `save_config(config, config_path)`: Save configuration to a JSON file.
- `split_data(data, train_ratio)`: Split data into training and testing sets.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgements

Thanks to the contributors and the open-source community for their support and contributions.
```

### Explanation

1. **Features**:
   - Briefly introduces the main functionalities of the package, including data loading, preprocessing, augmentation, and analysis.

2. **Installation**:
   - Provides the command to install the package.

3. **Usage**:
   - Offers a complete usage example demonstrating how to load, preprocess, augment, analyze, and save data.

4. **Modules**:
   - Detailed introduction of each module and its methods, including `DataLoader`, `DataPreprocessor`, `DataAugmenter`, `DataAnalyzer`, and `Utils`.

5. **License**:
   - Provides information about the project's license.

6. **Acknowledgements**:
   - Thanks the contributors and the open-source community for their support and contributions.

In this way, users can easily understand the functionality and usage of the `qullm` package to better preprocess and augment large language data.
