Metadata-Version: 2.4
Name: safe-store
Version: 3.2.2
Summary: Simple, concurrent SQLite-based vector store optimized for local RAG pipelines, with optional encryption.
Project-URL: Homepage, https://github.com/ParisNeo/safe_store
Project-URL: Repository, https://github.com/ParisNeo/safe_store
Project-URL: Documentation, https://github.com/ParisNeo/safe_store#readme
Project-URL: Issues, https://github.com/ParisNeo/safe_store/issues
Author-email: ParisNeo <parisneo_ai@gmail.com>
License-File: LICENSE
Keywords: concurrent,database,embedding,encryption,llm,local,rag,semantic search,sqlite,vector,webui
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Security :: Cryptography
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.8
Requires-Dist: ascii-colors>=0.7.0
Requires-Dist: filelock>=3.9
Requires-Dist: numpy>=1.21
Requires-Dist: pipmaster>=1.0.8
Requires-Dist: sqlalchemy
Provides-Extra: all
Requires-Dist: beautifulsoup4>=4.11; extra == 'all'
Requires-Dist: cryptography>=40.0; extra == 'all'
Requires-Dist: lxml>=4.9; extra == 'all'
Requires-Dist: ollama; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Requires-Dist: pypdf>=3.10; extra == 'all'
Requires-Dist: python-docx>=1.0; extra == 'all'
Requires-Dist: scikit-learn>=1.0; extra == 'all'
Requires-Dist: sentence-transformers==4.1.0; extra == 'all'
Provides-Extra: all-vectorizers
Requires-Dist: ollama; extra == 'all-vectorizers'
Requires-Dist: openai>=1.0; extra == 'all-vectorizers'
Requires-Dist: scikit-learn>=1.0; extra == 'all-vectorizers'
Requires-Dist: sentence-transformers==4.1.0; extra == 'all-vectorizers'
Provides-Extra: dev
Requires-Dist: beautifulsoup4>=4.11; extra == 'dev'
Requires-Dist: black>=22.0; extra == 'dev'
Requires-Dist: cryptography>=40.0; extra == 'dev'
Requires-Dist: flake8>=5.0; extra == 'dev'
Requires-Dist: hatchling; extra == 'dev'
Requires-Dist: lxml; extra == 'dev'
Requires-Dist: lxml>=4.9; extra == 'dev'
Requires-Dist: mypy>=0.9; extra == 'dev'
Requires-Dist: ollama; extra == 'dev'
Requires-Dist: openai>=1.0; extra == 'dev'
Requires-Dist: pypdf>=3.10; extra == 'dev'
Requires-Dist: pytest-cov>=3.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: python-docx>=1.0; extra == 'dev'
Requires-Dist: scikit-learn>=1.0; extra == 'dev'
Requires-Dist: sentence-transformers==4.1.0; extra == 'dev'
Requires-Dist: sphinx-rtd-theme>=1.0; extra == 'dev'
Requires-Dist: sphinx>=5.0; extra == 'dev'
Requires-Dist: twine; extra == 'dev'
Requires-Dist: types-cryptography; extra == 'dev'
Requires-Dist: types-filelock; extra == 'dev'
Requires-Dist: wheel; extra == 'dev'
Provides-Extra: encryption
Requires-Dist: cryptography>=40.0; extra == 'encryption'
Provides-Extra: ollama
Requires-Dist: ollama; extra == 'ollama'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Provides-Extra: parsing
Requires-Dist: beautifulsoup4>=4.11; extra == 'parsing'
Requires-Dist: lxml>=4.9; extra == 'parsing'
Requires-Dist: pypdf>=3.10; extra == 'parsing'
Requires-Dist: python-docx>=1.0; extra == 'parsing'
Provides-Extra: sentence-transformers
Requires-Dist: sentence-transformers==4.1.0; extra == 'sentence-transformers'
Provides-Extra: tfidf
Requires-Dist: scikit-learn>=1.0; extra == 'tfidf'
Description-Content-Type: text/markdown

# safe_store: Transform Your Digital Chaos into a Queryable Knowledge Base

[![PyPI version](https://img.shields.io/pypi/v/safe_store.svg)](https://pypi.org/project/safe_store/)
[![PyPI license](https://img.shields.io/pypi/l/safe_store.svg)](https://github.com/ParisNeo/safe_store/blob/main/LICENSE)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/safe_store.svg)](https://pypi.org/project/safe_store/)

**`safe_store` is a Python library that turns your local folders of documents into a powerful, private, and intelligent knowledge base.** It achieves this by combining two powerful AI concepts into a single, seamless tool:

1.  **Deep Semantic Search:** It reads and *understands* the content of your files, allowing you to search by meaning and context, not just keywords.
2.  **AI-Powered Knowledge Graph:** It uses a Large Language Model (LLM) to automatically identify key entities (people, companies, concepts) and the relationships between them, building an interconnected web of your knowledge.

All of this happens entirely on your local machine, using a single, portable SQLite file. Your data never leaves your control.

---

## The Journey from Search to Understanding

`safe_store` is designed to grow with your needs. You can start with a simple, powerful RAG system in minutes, and then evolve it into a sophisticated knowledge engine.

### Level 1: Build a Powerful RAG System with Semantic Search
**The Foundation: Retrieval-Augmented Generation (RAG)**

RAG is the state-of-the-art technique for making Large Language Models (LLMs) answer questions about your private documents. The process is simple:
1.  **Retrieve:** Find the most relevant text chunks from your documents related to a user's query.
2.  **Augment:** Add those chunks as context to your prompt.
3.  **Generate:** Ask the LLM to generate an answer based *only* on the provided context.

`SafeStore` is the perfect tool for the "Retrieve" step. It uses vector embeddings to understand the *meaning* of your text, allowing you to find relevant passages even if they don't contain the exact keywords.

**Example: A Simple RAG Pipeline**

```python
import safe_store

# 1. Create a store. This will create a 'my_notes.db' file.
store = safe_store.SafeStore(db_path="my_notes.db", vectorizer_name="st")

# 2. Add your documents. It will scan the folder and process all supported files.
with store:
    store.add_document("path/to/my_notes_and_articles/")

# 3. Query the store to find context for your RAG prompt.
user_query = "What were the main arguments about AI consciousness in my research?"
context_chunks = store.query(user_query, top_k=3)

# 4. Build the prompt and send to your LLM.
context_text = "\n\n".join([chunk['chunk_text'] for chunk in context_chunks])
prompt = f"""
Based on the following context, please answer the user's question.
Do not use any external knowledge.

Context:
---
{context_text}
---

Question: {user_query}
"""

# result = my_llm_function(prompt) # Send to your LLM of choice
```
With just this, you have a powerful, private RAG system running on your local files.

### Level 2: Uncover Hidden Connections with a Knowledge Graph
**The Next Dimension: From Passages to a Web of Knowledge**

Semantic search is great for finding *relevant passages*, but it struggles with questions about *specific facts* and *relationships* scattered across multiple documents.

`GraphStore` complements this by building a structured knowledge graph of the key **instances** (like the person "Geoffrey Hinton") and their **relationships** (like `PIONEERED` the concept "Backpropagation"). This allows you to ask precise, factual questions.

### Level 3: Visualize Your Knowledge with an Interactive Point Cloud

Understanding the structure of your knowledge base can be challenging. `safe_store` provides a powerful tool to visually explore the semantic relationships within your documents.

The `export_point_cloud()` method performs a Principal Component Analysis (PCA) on all the vectors in your store to create a 2D "map" of your data. When combined with a simple web interface, this allows you to:

-   **See Clusters:** Identify natural groupings of related content at a glance.
-   **Explore Relationships:** Understand how different documents and topics relate to each other in the vector space.
-   **Debug and Refine:** Visually inspect the results of different chunking strategies or vectorization models to see how they affect the semantic layout of your data.

**Example Visualization:**

![Interactive Point Cloud Visualization](https://github.com/ParisNeo/safe_store/assets/15175224/8c4e0431-7e8c-4a3e-8c33-4f9e4f1659a1)
*(This UI is generated by the example script below)*

This entire interactive application, including the web server and the API to fetch chunk text on hover, is available as a complete, runnable example. It's the perfect starting point for building your own custom knowledge exploration tools.

Save the following code as `run_point_cloud_app.py` and execute it with `python run_point_cloud_app.py`.

```python
# examples/point_cloud_and_api.py
import safe_store
from pathlib import Path
import shutil
import json
import webbrowser
from http.server import HTTPServer, SimpleHTTPRequestHandler
import threading
import pipmaster as pm

# Ensure necessary packages for PCA and the example are installed
pm.ensure_packages(["scikit-learn", "pandas"])

# --- Helper Functions ---
def print_header(title):
    print("\n" + "="*10 + f" {title} " + "="*10)

def setup_environment():
    """Cleans up old files and creates new ones for the example."""
    print_header("Setting Up Example Environment")
    db_file = Path("point_cloud_example.db")
    doc_dir = Path("temp_docs_point_cloud")
    
    # Clean up DB and its artifacts
    for p in [db_file, Path(f"{db_file}.lock"), Path(f"{db_file}-wal"), Path(f"{db_file}-shm")]:
        p.unlink(missing_ok=True)
    
    # Clean up and create doc directory
    if doc_dir.exists():
        shutil.rmtree(doc_dir)
    doc_dir.mkdir(exist_ok=True)

    # Create sample documents with metadata
    (doc_dir / "animals.txt").write_text(
        "The quick brown fox jumps over the lazy dog. A fast red fox is athletic. The sleepy dog rests."
    )
    (doc_dir / "tech.txt").write_text(
        "Python is a versatile programming language. Many developers use Python for AI. RAG pipelines are a common use case."
    )
    (doc_dir / "space.txt").write_text(
        "The sun is a star at the center of our solar system. The Earth revolves around the sun. Space exploration is fascinating."
    )
    
    print("- Created sample documents and cleaned up old database.")
    return db_file, doc_dir

# --- Main Logic ---
DB_FILE, DOC_DIR = setup_environment()

print_header("Initializing SafeStore and Indexing Documents")
# Initialize SafeStore
store = safe_store.SafeStore(
    db_path=DB_FILE,
    vectorizer_name="st",
    vectorizer_config={"model": "all-MiniLM-L6-v2"},
    chunk_size=10, # small chunks for more points
    chunk_overlap=2
)

# Add documents to the store with metadata
with store:
    store.add_document(DOC_DIR / "animals.txt", metadata={"topic": "animals", "source": "fiction"})
    store.add_document(DOC_DIR / "tech.txt", metadata={"topic": "technology", "source": "documentation"})
    store.add_document(DOC_DIR / "space.txt", metadata={"topic": "space", "source": "science"})

print("- Documents indexed successfully.")

# --- Data Export for Visualization ---
print_header("Exporting Point Cloud Data")
with store:
    point_cloud_data = store.export_point_cloud(output_format='dict')

# Save data to a JSON file for the web page to fetch
web_dir = Path("point_cloud_web_app")
web_dir.mkdir(exist_ok=True)
data_file = web_dir / "data.json"
with open(data_file, "w") as f:
    json.dump(point_cloud_data, f)

print(f"- Point cloud data exported to {data_file}")

# --- Web Server and HTML Page ---
html_content = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>SafeStore | 2D Chunk Visualization</title>
    <script src="https://cdn.tailwindcss.com"></script>
    <script src="https://cdn.plot.ly/plotly-2.32.0.min.js"></script>
</head>
<body class="bg-slate-50 dark:bg-slate-900 text-slate-800 dark:text-slate-200">
    <main class="container mx-auto p-8">
        <header class="text-center mb-12">
            <h1 class="text-4xl font-bold text-slate-900 dark:text-white">2D Document Chunk Visualization</h1>
            <p class="mt-2 text-lg text-slate-600 dark:text-slate-400">Interactive PCA plot of vectorized chunks. Hover to inspect.</p>
        </header>
        <div class="grid grid-cols-1 lg:grid-cols-5 gap-8">
            <div class="lg:col-span-3 bg-white dark:bg-slate-800 rounded-xl shadow-lg p-6 h-[70vh]">
                <div id="plot" class="w-full h-full"></div>
            </div>
            <div class="lg:col-span-2 bg-white dark:bg-slate-800 rounded-xl shadow-lg p-6">
                <h2 class="text-2xl font-semibold text-slate-900 dark:text-white mb-4">Chunk Inspector</h2>
                <div id="chunk-info-container" class="relative h-[calc(70vh-80px)]"></div>
            </div>
        </div>
    </main>
    <script>
        document.addEventListener('DOMContentLoaded', function() {
            // ... (JavaScript remains the same as in the example file) ...
        });
    </script>
</body>
</html>
"""
# (For brevity, the full JavaScript is in the example file but the structure is shown here)

# Write the HTML file
index_file = web_dir / "index.html"
index_file.write_text(html_content)

# Define a custom request handler to serve files and provide an API
class CustomHandler(SimpleHTTPRequestHandler):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, directory=str(web_dir), **kwargs)

    def do_GET(self):
        if self.path.startswith('/chunk/'):
            try:
                chunk_id = int(self.path.split('/')[-1])
                with store:
                    chunk_data = store.get_chunk_by_id(chunk_id)
                
                if chunk_data:
                    self.send_response(200)
                    self.send_header('Content-type', 'application/json')
                    self.end_headers()
                    self.wfile.write(json.dumps(chunk_data).encode('utf-8'))
                else:
                    self.send_error(404, "Chunk not found")
            except Exception as e:
                self.send_error(500, str(e))
            return
        
        super().do_GET()

print(f"- Wrote web application files to '{web_dir.resolve()}'")

# --- Run Server ---
PORT = 8008
server_address = ('', PORT)
httpd = HTTPServer(server_address, CustomHandler)
url = f"http://localhost:{PORT}"

print_header("Starting Web Server")
print(f"Serving visualization at: {url}")
print("Please open the URL in your web browser.")
print("Press Ctrl+C to stop the server.")

threading.Timer(1.5, lambda: webbrowser.open(url)).start()

try:
    httpd.serve_forever()
except KeyboardInterrupt:
    print("\n- Server stopped.")
finally:
    httpd.server_close()
```

---

## Dynamic Vectorizer Discovery & Configuration

One of `safe_store`'s most powerful features is its ability to self-document. You don't need to guess which vectorizers are available or what parameters they need. You can discover everything at runtime.

This makes it easy to experiment with different embedding models and build interactive tools that guide users through the setup process.

### Step 1: Discovering Available Vectorizers

The `SafeStore.list_available_vectorizers()` class method scans the library for all built-in and custom vectorizers and returns their complete configuration metadata.

```python
import safe_store
import pprint

# Get a list of all available vectorizer configurations
available_vectorizers = safe_store.SafeStore.list_available_vectorizers()

# Pretty-print the result to see what's available
pprint.pprint(available_vectorizers)
```
This will produce a detailed output like this:
```
[{'author': 'ParisNeo',
  'class_name': 'CohereVectorizer',
  'creation_date': '2025-10-10',
  'description': "A vectorizer that uses Cohere's API...",
  'input_parameters': [{'default': 'embed-english-v3.0',
                        'description': 'The name of the Cohere embedding model...',
                        'mandatory': True,
                        'name': 'model'},
                       {'default': '',
                        'description': 'Your Cohere API key...',
                        'mandatory': False,
                        'name': 'api_key'},
                        ...],
  'last_update_date': '2025-10-10',
  'name': 'cohere',
  'title': 'Cohere Vectorizer'},
 {'author': 'ParisNeo',
  'class_name': 'OllamaVectorizer',
  'name': 'ollama',
  'title': 'Ollama Vectorizer',
  ...},
  ...
]
```

### Step 2: Listing Available Models for a Vectorizer

Once you know which vectorizer you want to use, you can ask `safe_store` what specific models it supports. This is especially useful for API-based or local server-based vectorizers like `ollama`, which can have many different models available.

```python
import safe_store

# Example: List all embedding models available from a running Ollama server
try:
    # This requires a running Ollama instance to succeed
    ollama_models = safe_store.SafeStore.list_models("ollama")
    print("Available Ollama embedding models:")
    for model in ollama_models:
        print(f"- {model}")
except Exception as e:
    print(f"Could not list Ollama models. Is the server running? Error: {e}")

```

### Step 3: Building an Interactive Configurator

You can use this metadata to create an interactive setup script, guiding the user to choose and configure their desired vectorizer on the fly.

**Full Interactive Example:**
Copy and run this script. It will guide you through selecting and configuring a vectorizer, then initialize `SafeStore` with your choices.

```python
# interactive_setup.py
import safe_store
import pprint

def interactive_vectorizer_setup():
    """
    An interactive CLI to guide the user through selecting and configuring a vectorizer.
    """
    print("--- Welcome to the safe_store Interactive Vectorizer Setup ---")
    
    # 1. List all available vectorizers
    vectorizers = safe_store.SafeStore.list_available_vectorizers()
    
    print("\nAvailable Vectorizers:")
    for i, vec in enumerate(vectorizers):
        print(f"  [{i+1}] {vec['name']} - {vec.get('title', 'No Title')}")

    # 2. Get user's choice
    choice = -1
    while choice < 0 or choice >= len(vectorizers):
        try:
            raw_choice = input(f"\nPlease select a vectorizer (1-{len(vectorizers)}): ")
            choice = int(raw_choice) - 1
            if not (0 <= choice < len(vectorizers)):
                print("Invalid selection. Please try again.")
        except ValueError:
            print("Please enter a number.")

    selected_vectorizer = vectorizers[choice]
    selected_name = selected_vectorizer['name']
    
    print(f"\nYou have selected: {selected_name}")
    print(f"Description: {selected_vectorizer.get('description', 'N/A').strip()}")

    # 3. Dynamically build the configuration dictionary
    vectorizer_config = {}
    print("\nPlease provide the following configuration values (press Enter to use default):")
    
    params = selected_vectorizer.get('input_parameters', [])
    if not params:
        print("This vectorizer requires no special configuration.")
    else:
        for param in params:
            param_name = param['name']
            description = param.get('description', 'No description.')
            default_value = param.get('default', None)
            
            prompt = f"- {param_name} ({description})"
            if default_value is not None:
                prompt += f" [default: {default_value}]: "
            else:
                prompt += ": "
                
            user_input = input(prompt)
            
            # Use user input if provided, otherwise use default
            final_value = user_input if user_input else default_value
            
            # Simple type conversion for demonstration (can be expanded)
            if final_value is not None:
                if param.get('type') == 'int':
                    vectorizer_config[param_name] = int(final_value)
                elif param.get('type') == 'dict':
                    # For simplicity, we don't parse dicts here, but a real app might use json.loads
                    vectorizer_config[param_name] = final_value
                else:
                    vectorizer_config[param_name] = str(final_value)

    # 4. Initialize SafeStore with the dynamically created configuration
    print("\n--- Configuration Complete ---")
    print(f"Vectorizer Name: '{selected_name}'")
    print("Vectorizer Config:")
    pprint.pprint(vectorizer_config)
    
    try:
        print("\nInitializing SafeStore with your configuration...")
        store = safe_store.SafeStore(
            db_path=f"{selected_name}_store.db",
            vectorizer_name=selected_name,
            vectorizer_config=vectorizer_config
        )
        print("\n✅ SafeStore initialized successfully!")
        print(f"Database file is at: {selected_name}_store.db")
        store.close()
    except Exception as e:
        print(f"\n❌ Failed to initialize SafeStore: {e}")


if __name__ == "__main__":
    interactive_vectorizer_setup()
```
This script demonstrates how the self-documenting nature of `safe_store` enables you to build powerful, user-friendly applications on top of it.

---

## Core Concepts for Advanced RAG

### Understanding Tokenization for Chunking
`safe_store` can chunk your documents based on character count (`character` strategy) or token count (`token` strategy). Using the `token` strategy is often more effective as it aligns better with how Large Language Models (LLMs) process text.

When you select `chunking_strategy='token'`, `safe_store` intelligently handles tokenization:

1.  **Vectorizer's Native Tokenizer:** If the chosen vectorizer (like a local `sentence-transformers` model) has its own tokenizer, `safe_store` will use it. This is the most accurate method, as the chunking tokens will perfectly match the vectorizer's tokens.

2.  **Fallback to `tiktoken`:** Some vectorizers, especially those accessed via an API (like OpenAI or Cohere), do not expose their tokenizer for local use. In these cases, `safe_store` uses `tiktoken` (specifically the `cl100k_base` model) as a reliable fallback. `tiktoken` is the tokenizer used by modern OpenAI models and provides a very close approximation for many other models, ensuring your chunks are sized correctly for optimal performance.

You can also specify a custom tokenizer during `SafeStore` initialization if you have specific needs.

### Enriching Your Data with Metadata
Metadata is extra information about your documents that provides crucial context. You can attach a dictionary of key-value pairs to any document you add to `safe_store`.

**How to Add Metadata:**
Simply pass a dictionary to the `metadata` parameter when adding content.

```python
# Example of adding a document with metadata
doc_info = {
    "title": "Quantum Entanglement in Nanostructures",
    "author": "Dr. Alice Smith",
    "year": 2024,
    "topic": "Quantum Physics"
}

with store:
    store.add_document(
        "path/to/research_paper.txt",
        metadata=doc_info
    )
```

**How Metadata is Used in Queries:**
When you perform a `query`, the document's metadata is returned in two ways for maximum flexibility:

1.  **As a structured dictionary:** The `document_metadata` field contains the parsed metadata, which your application can use for filtering, logging, or display purposes.
2.  **Prepended to the `chunk_text`:** A human-readable version of the metadata is automatically added to the beginning of the returned `chunk_text`. This "just-in-time" context injection dramatically improves an LLM's ability to understand the source and relevance of the information, leading to better-quality responses without any extra work on your part.

A query result object looks like this:
```json
[
  {
    "chunk_id": 123,
    "similarity_percent": 95.4,
    "file_path": "/path/to/research_paper.txt",
    "document_metadata": {
      "title": "Quantum Entanglement in Nanostructures",
      "author": "Dr. Alice Smith",
      "year": 2024,
      "topic": "Quantum Physics"
    },
    "chunk_text": "--- Document Context ---\\nTitle: Quantum Entanglement in Nanostructures\\nAuthor: Dr. Alice Smith\\nYear: 2024\\nTopic: Quantum Physics\\n------------------------\\n\\n...the actual text from the document chunk begins here..."
  }
]
```

### Reconstructing Original Content
After indexing, you may need to retrieve the full, original text of a document as it was processed by `safe_store`. The `reconstruct_document_text` method does this by fetching and reassembling all of a document's stored chunks.

```python
# Assuming 'store' is an initialized SafeStore instance
# with "path/to/research_paper.txt" already added.
full_text = store.reconstruct_document_text("path/to/research_paper.txt")

if full_text:
    print("--- Reconstructed Text ---")
    print(full_text[:500] + "...")

# Note: If a chunk_overlap was used during indexing, the reconstructed text
# will contain these repeated, overlapping segments. This method provides a
# raw reassembly of the stored data.
```

### Pre-processing Chunks on the Fly with `chunk_processor`
For advanced RAG, you might need to transform the text of a chunk *before* it's vectorized and stored. The `chunk_processor` is a powerful hook that lets you do exactly that.

It's an optional callable that you can pass to `add_document` or `add_text`. The function receives the raw text of each chunk and the document's metadata, and it must return the string that you want to be stored and vectorized instead.

This enables powerful workflows like:
- **Summarization:** Replace long chunks with concise summaries generated by an LLM.
- **Keyword Extraction:** Prepend important keywords to each chunk to boost relevance for certain queries.
- **Translation:** Translate chunks into a different language before indexing.
- **Formatting:** Clean or reformat text in a specific way for your RAG pipeline.

**Example: Prepending Metadata to Each Chunk**

```python
import safe_store

store = safe_store.SafeStore(db_path="processed_store.db")

def prepend_topic_processor(chunk_text: str, metadata: dict) -> str:
    """A processor that adds the 'topic' from metadata to the chunk text."""
    topic = metadata.get("topic", "general")
    return f"[Topic: {topic}] {chunk_text}"

with store:
    store.add_text(
        unique_id="processed_doc_1",
        text="This chunk is about quantum mechanics.",
        metadata={"topic": "Physics"},
        chunk_processor=prepend_topic_processor,
        force_reindex=True
    )

# When you query this, the stored text will be:
# "[Topic: Physics] This chunk is about quantum mechanics."
# This can make the vector more specific to the topic.
results = store.query("information related to physics", top_k=1)
if results:
    print(results['chunk_text'])

store.close()
```
This simple hook provides immense flexibility for customizing your data ingestion pipeline.

---

## Data Safety and Recovery

Because `safe_store` is built on a single, portable SQLite database file, ensuring the safety of your knowledge base is straightforward.

**Backup:**
To back up your entire store, simply make a copy of the main database file (e.g., `my_notes.db`). For a complete and safe backup, especially if the database might be in use, it's best to also copy the associated temporary files:
- `my_notes.db` (the main database file)
- `my_notes.db-shm`
- `my_notes.db-wal`

Copying these three files to a secure location (like a separate hard drive or a cloud storage folder) creates a complete snapshot of your store at that moment.

**Recovery:**
To recover from a backup, simply replace the corrupted or lost `.db`, `.db-shm`, and `.db-wal` files with the copies from your backup.

This file-based approach avoids the complexity of database dumps and restores, giving you a simple and robust way to protect your data.

---

## 🏁 Quick Start Guide

This example shows the end-to-end workflow: indexing a document, then building and querying a knowledge graph of its **instances** using a simple string-based ontology.

```python
import safe_store
from safe_store import GraphStore, LogLevel
from lollms_client import LollmsClient
from pathlib import Path
import shutil

# --- 0. Configuration & Cleanup ---
DB_FILE = "quickstart.db"
DOC_DIR = Path("temp_docs_qs")
if DOC_DIR.exists(): shutil.rmtree(DOC_DIR)
DOC_DIR.mkdir()
Path(DB_FILE).unlink(missing_ok=True)

# --- 1. LLM Executor & Sample Document ---
def llm_executor(prompt: str) -> str:
    try:
        client = LollmsClient()
        return client.generate_code(prompt, language="json", temperature=0.1) or ""
    except Exception as e:
        raise ConnectionError(f"LLM call failed: {e}")

doc_path = DOC_DIR / "doc.txt"
doc_path.write_text("Dr. Aris Thorne is the CEO of QuantumLeap AI, a firm in Geneva.")

# --- 2. Level 1: Semantic Search with SafeStore ---
print("--- LEVEL 1: SEMANTIC SEARCH ---")
store = safe_store.SafeStore(db_path=DB_FILE, vectorizer_name="st", log_level=LogLevel.INFO)
with store:
    store.add_document(doc_path)
    results = store.query("who leads the AI firm in Geneva?", top_k=1)
    print(f"Semantic search result: '{results['chunk_text']}'")

# --- 3. Level 2: Knowledge Graph with GraphStore ---
print("\n--- LEVEL 2: KNOWLEDGE GRAPH ---")
ontology = "Extract People and Companies. A Person can be a CEO_OF a Company."
try:
    graph_store = GraphStore(store=store, llm_executor_callback=llm_executor, ontology=ontology)
    with graph_store:
        graph_store.build_graph_for_all_documents()
        graph_result = graph_store.query_graph("Who is the CEO of QuantumLeap AI?", output_mode="graph_only")
        
        print("Graph query result:")
        for rel in graph_result.get('relationships', []):
            source = rel['source_node']['properties'].get('identifying_value')
            target = rel['target_node']['properties'].get('identifying_value')
            print(f"- Relationship: '{source}' --[{rel['type']}]--> '{target}'")
except ConnectionError as e:
    print(f"[SKIP] GraphStore part failed: {e}")

store.close()
```

---

## ⚙️ Installation

```bash
pip install safe-store
```
Install optional dependencies for the features you need:```bash
# For Sentence Transformers (recommended for local use)
pip install safe-store[sentence-transformers]

# For API-based vectorizers
pip install safe_store[openai,ollama,cohere]

# For parsing PDF, DOCX, etc.
pip install safe-store[parsing]

# For encryption
pip install safe-store[encryption]

# To install everything:
pip install safe-store[all] 
```
---

## 💡 API Highlights

#### `SafeStore` (The Foundation)
*   `__init__(db_path, vectorizer_name, ...)`: Creates or loads a database. The vectorizer is locked in at creation.
*   `add_document(path, ...)`: Parses, chunks, vectorizes, and stores a document or an entire folder.
*   `query(query_text, top_k, ...)`: Performs a semantic search and returns the most relevant text chunks for your RAG pipeline.
*   `get_chunk_by_id(chunk_id)`: Retrieves the full text and metadata for a specific chunk by its ID.
*   `reconstruct_document_text(file_path)`: Reassembles and returns the full, original text of a document by joining its stored chunks.
*   `export_point_cloud()`: Exports all vectors as a 2D point cloud for visualization, using PCA for dimensionality reduction.

#### `GraphStore` (The Intelligence Layer)
*   `__init__(store, llm_executor_callback, ontology)`: Creates the graph manager on an existing `SafeStore` instance.
*   `build_graph_for_all_documents()`: Scans documents and uses an LLM to build the knowledge graph based on your ontology.
*   `query_graph(natural_language_query, ...)`: Translates a question into a graph traversal, returning nodes, relationships, and/or the original source text.
*   `add_node(...)`, `add_relationship(...)`: Manually edit the graph to add your own expert knowledge.

---

## 🤝 Contributing & License

Contributions are highly welcome! Please open an issue to discuss a new feature or submit a pull request on [GitHub](https://github.com/ParisNeo/safe_store).

Licensed under Apache 2.0. See [LICENSE](LICENSE).