Metadata-Version: 2.1
Name: safe_store
Version: 0.1.0
Summary: A library for safe document storage and vectorization.
Home-page: https://github.com/ParisNeo/safe_store
Author: ALOUI Saifeddine (ParisNeo)
Author-email: aloui.seifeddine@email.com
License: Apache 2.0
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Description-Content-Type: text/markdown
License-File: LICENSE

# safe_store
![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)
![PyPI](https://img.shields.io/pypi/v/safe_store)
![Python](https://img.shields.io/pypi/pyversions/safe_store)

# safe_store Library - TextVectorizer Class

The `TextVectorizer` class is a part of the `safe_store` library, which is available on PyPI and is released under the Apache 2.0 license. This class provides functionality for text vectorization using various methods, such as TF-IDF vectorization or model-based embedding. It also offers features for document decomposition, visualization, and querying.

## Installation

You can install the `safe_store` library using pip:

```bash
pip install safe_store
```

## Usage

To use the `TextVectorizer` class, you need to import it from the library and create an instance. Here's an example of how to use it:

```python
from safe_store import TextVectorizer, VectorizationMethod

# Create an instance of TextVectorizer
vectorizer = TextVectorizer(
    vectorization_method=VectorizationMethod.TFIDF_VECTORIZER,
    database_path="database.json",
    save_db=True,
    visualize_data_at_startup=True,
    visualize_data_at_add_file=True,
    visualize_data_at_generate=True
)

# Add a document for vectorization
document_name = "example.txt"
text = "This is an example document for vectorization."
vectorizer.add_document(document_name, text, chunk_size=100, overlap_size=20, force_vectorize=False, add_as_a_bloc=False)

# Index the documents (perform vectorization)
vectorizer.index()

# Embed a query and retrieve similar documents
query_text = "vectorization"
query_embedding = vectorizer.embed_query(query_text)
similar_texts, _ = vectorizer.recover_text(query_embedding, top_k=3)
print("Similar Documents:")
for i, text in enumerate(similar_texts):
    print(f"{i + 1}: {text}")
```

## Constructor Parameters

- `vectorization_method`: Specify the vectorization method to use. Options are `VectorizationMethod.MODEL_EMBEDDING` or `VectorizationMethod.TFIDF_VECTORIZER`.
- `model`: Provide a model instance when using model-based embedding (required if `vectorization_method` is `VectorizationMethod.MODEL_EMBEDDING`).
- `database_path`: Path to the JSON database file where vectorized data is stored.
- `save_db`: Boolean to determine whether to save vectorized data to the database file.
- `visualize_data_at_startup`: Boolean to enable visualization of data at startup.
- `visualize_data_at_add_file`: Boolean to enable visualization of data when adding a file.
- `visualize_data_at_generate`: Boolean to enable visualization of data when generating embeddings.
- `data_visualization_method`: Specify the visualization method for data. Options are "PCA" or "t-SNE".
- `database_dict`: Optional dictionary to initialize the `TextVectorizer` state from a previous session.

## Methods

- `add_document`: Add a document for vectorization.
- `index`: Index the documents to perform vectorization.
- `embed_query`: Embed a query text for similarity search.
- `recover_text`: Retrieve similar documents based on a query embedding.
- `show_document`: Visualize the data and embeddings.
- `file_exists`: Check if a document file already exists in the database.
- `remove_document`: Remove a document from the database.
- `toJson`: Serialize the current state of the `TextVectorizer` to JSON.
- `setVectorizer`: Set the vectorizer using a dictionary representation.
- `save_to_json`: Save the current state to a JSON file.
- `load_from_json`: Load vectorized documents and state from a JSON file.
- `clear_database`: Clear the database and reset the `TextVectorizer` instance.

## License

This library is released under the Apache 2.0 license. See the LICENSE file for more details.

## Contributing

We welcome contributions! If you find issues or have suggestions for improvements, please open an issue or create a pull request on the [GitHub repository](https://github.com/ParisNeo/safe_store).
```

Make sure to replace `"example.txt"` and `"vectorization"` with your own document and query text for testing.

This README.md provides an overview of how to use the `TextVectorizer` class from the `safe_store` library and includes important information about installation, constructor parameters, methods, and licensing. Users can refer to this documentation to understand and utilize the functionality provided by the `TextVectorizer` class.

