Metadata-Version: 2.1
Name: pgml
Version: 0.9.2
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Summary: Python SDK is designed to facilitate the development of scalable vector search applications on PostgreSQL databases.
Keywords: postgres,machine learning,vector databases,embeddings
Home-Page: https://postgresml.org/
Author: PosgresML <team@postgresml.org>
Author-email: PostgresML <team@postgresml.org>
License: MIT
Requires-Python: >=3.7
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://postgresml.org
Project-URL: Repository, https://github.com/postgresml/postgresml
Project-URL: Documentation, https://github.com/postgresml/postgresml/tree/master/pgml-sdks/pgml/python/

# Open Source Alternative for Building End-to-End Vector Search Applications without OpenAI & Pinecone

# Table of Contents

- [Overview](#overview)
- [Quickstart](#quickstart)
- [Usage](#usage)
- [Examples](./examples/README.md)
- [Developer setup](#developer-setup)
- [Roadmap](#roadmap)

# Overview

Python SDK is designed to facilitate the development of scalable vector search applications on PostgreSQL databases. With this SDK, you can seamlessly manage various database tables related to documents, text chunks, text splitters, LLM (Language Model) models, and embeddings. By leveraging the SDK's capabilities, you can efficiently index LLM embeddings using PgVector for fast and accurate queries.

## Key Features

- **Automated Database Management**: With the SDK, you can easily handle the management of database tables related to documents, text chunks, text splitters, LLM models, and embeddings. This automated management system simplifies the process of setting up and maintaining your vector search application's data structure.

- **Embedding Generation from Open Source Models**: The Python SDK provides the ability to generate embeddings using hundreds of open source models. These models, trained on vast amounts of data, capture the semantic meaning of text and enable powerful analysis and search capabilities.

- **Flexible and Scalable Vector Search**: The Python SDK empowers you to build flexible and scalable vector search applications. The Python SDK seamlessly integrates with PgVector, a PostgreSQL extension specifically designed for handling vector-based indexing and querying. By leveraging these indices, you can perform advanced searches, rank results by relevance, and retrieve accurate and meaningful information from your database.

## Use Cases

Embeddings, the core concept of the Python SDK, find applications in various scenarios, including:

- Search: Embeddings are commonly used for search functionalities, where results are ranked by relevance to a query string. By comparing the embeddings of query strings and documents, you can retrieve search results in order of their similarity or relevance.

- Clustering: With embeddings, you can group text strings by similarity, enabling clustering of related data. By measuring the similarity between embeddings, you can identify clusters or groups of text strings that share common characteristics.

- Recommendations: Embeddings play a crucial role in recommendation systems. By identifying items with related text strings based on their embeddings, you can provide personalized recommendations to users.

- Anomaly Detection: Anomaly detection involves identifying outliers or anomalies that have little relatedness to the rest of the data. Embeddings can aid in this process by quantifying the similarity between text strings and flagging outliers.

- Classification: Embeddings are utilized in classification tasks, where text strings are classified based on their most similar label. By comparing the embeddings of text strings and labels, you can classify new text strings into predefined categories.

## How the Python SDK Works

The Python SDK streamlines the development of vector search applications by abstracting away the complexities of database management and indexing. Here's an overview of how the SDK works:

- **Automatic Document and Text Chunk Management**: The SDK provides a convenient interface to manage documents and pipelines, automatically handling chunking and embedding for you. You can easily organize and structure your text data within the PostgreSQL database.

- **Open Source Model Integration**: With the SDK, you can seamlessly incorporate a wide range of open source models to generate high-quality embeddings. These models capture the semantic meaning of text and enable powerful analysis and search capabilities.

- **Embedding Indexing**: The Python SDK utilizes the PgVector extension to efficiently index the embeddings generated by the open source models. This indexing process optimizes search performance and allows for fast and accurate retrieval of relevant results.

- **Querying and Search**: Once the embeddings are indexed, you can perform vector-based searches on the documents and text chunks stored in the PostgreSQL database. The SDK provides intuitive methods for executing queries and retrieving search results.

# Quickstart

Follow the steps below to quickly get started with the Python SDK for building scalable vector search applications on PostgresML databases.

## Prerequisites

Before you begin, make sure you have the following:

- PostgresML Database: Ensure you have a PostgresML database version >= `2.7.7` You can spin up a database using [Docker](https://github.com/postgresml/postgresml#installation) or [sign up for a free GPU-powered database](https://postgresml.org/signup).

- Set the `DATABASE_URL` environment variable to the connection string of your PostgresML database.

- Python version >=3.8.1

## Installation

To install the Python SDK, use pip:

```
pip install pgml
```

## Sample Code

Once you have the Python SDK installed, you can use the following sample code as a starting point for your vector search application:

```python
from pgml import Collection, Model, Splitter, Pipeline
from datasets import load_dataset
from time import time
from dotenv import load_dotenv
from rich.console import Console
import asyncio

async def main():
        load_dotenv()
    console = Console()

    # Initialize collection
    collection = Collection("quora_collection")
```

**Explanation:**

- The code imports the necessary modules and packages, including pgml, datasets, time, and rich.
- It creates an instance of the Collection class which we will add pipelines and documents onto

Continuing within `async def main():`

```python
    # Create a pipeline using the default model and splitter
    model = Model()
    splitter = Splitter()
    pipeline = Pipeline("quorav1", model, splitter)
    await collection.add_pipeline(pipeline)
```

**Explanation**

- The code creates an instance of `Model` and `Splitter` using their default arguments.
- Finally, the code constructs a pipeline called `"quroav1"` and add it to the collection we Initialized above. This pipeline automatically generates chunks and embeddings for every upserted document.

Continuing with `async def main():`

```python
    # Prep documents for upserting
    data = load_dataset("squad", split="train")
    data = data.to_pandas()
    data = data.drop_duplicates(subset=["context"])
    documents = [
        {"id": r["id"], "text": r["context"], "title": r["title"]}
        for r in data.to_dict(orient="records")
    ]

    # Upsert documents
    await collection.upsert_documents(documents[:200])
```

**Explanation**

- The code loads the "squad" dataset, converts it to a pandas DataFrame, and drops any duplicate context values.
- It creates a list of dictionaries representing the documents to be indexed, with each dictionary containing the document's id, text, and title.
- Finally, they are upserted. As mentioned above, the pipeline added earlier automatically runs and generates chunks and embeddings for each document.

Continuing within `async def main():`

```python
    # Query
    query = "Who won 20 grammy awards?"
    results = await collection.query().vector_recall(query, pipeline).limit(5).fetch_all()
    console.print(results)
    # Archive collection
    await collection.archive()
```

**Explanation:**

- The `query` method is called to perform a vector-based search on the collection. The query string is `Who won more than 20 grammy awards?`, and the top 5 results are requested.
- The search results are printed.
- Finally, the `archive` method is called to archive the collection and free up resources in the PostgresML database.

Call `main` function in an async loop.

```python
asyncio.run(main())
```

**Running the Code**

Open a terminal or command prompt and navigate to the directory where the file is saved.

Execute the following command:

```
python vector_search.py
```

You should see the search results printed in the terminal. As you can see, our vector search engine found the right text chunk with the answer we are looking for.

```python
[
    (
        0.8423336495860181,
        'Beyoncé has won 20 Grammy Awards, both as a solo artist and member of Destiny\'s Child, making her the second most honored female artist by the Grammys, behind Alison Krauss and the most nominated woman in Grammy Award history with 52 nominations. "Single Ladies (Put a Ring on It)" won Song of the Year in 2010 while "Say My Name" and 
"Crazy in Love" had previously won Best R&B Song. Dangerously in Love, B\'Day and I Am... Sasha Fierce have all won Best Contemporary R&B Album. Beyoncé set the record for the most Grammy awards won by a female artist in one night in 2010 when she won six awards, breaking the tie she previously held with Alicia Keys, Norah Jones, Alison Krauss, 
and Amy Winehouse, with Adele equaling this in 2012. Following her role in Dreamgirls she was nominated for Best Original Song for "Listen" and Best Actress at the Golden Globe Awards, and Outstanding Actress in a Motion Picture at the NAACP Image Awards. Beyoncé won two awards at the Broadcast Film Critics Association Awards 2006; Best Song for 
"Listen" and Best Original Soundtrack for Dreamgirls: Music from the Motion Picture.',
        {'id': '56becc903aeaaa14008c949f', 'title': 'Beyoncé'}
    ),
    (
        0.8210567582713351,
        'A self-described "modern-day feminist", Beyoncé creates songs that are often characterized by themes of love, relationships, and monogamy, as well as female sexuality and empowerment. On stage, her dynamic, highly choreographed performances have led to critics hailing her as one of the best entertainers in contemporary popular music. 
Throughout a career spanning 19 years, she has sold over 118 million records as a solo artist, and a further 60 million with Destiny\'s Child, making her one of the best-selling music artists of all time. She has won 20 Grammy Awards and is the most nominated woman in the award\'s history. The Recording Industry Association of America recognized 
her as the Top Certified Artist in America during the 2000s decade. In 2009, Billboard named her the Top Radio Songs Artist of the Decade, the Top Female Artist of the 2000s and their Artist of the Millennium in 2011. Time listed her among the 100 most influential people in the world in 2013 and 2014. Forbes magazine also listed her as the most 
powerful female musician of 2015.',
        {'id': '56be88473aeaaa14008c9080', 'title': 'Beyoncé'}
    )
]
```

# Usage

## High-level Description

The Python SDK provides a set of functionalities to build scalable vector search applications on PostgresQL databases. It enables users to create a collection, which represents a schema in the database, to store tables for documents, chunks, models, splitters, and embeddings. The Collection class in the SDK handles all operations related to these tables, allowing users to interact with the collection and perform various tasks.

## Collections

Collections are the organizational building blocks of the SDK. They manage all documents and related chunks, embeddings, tsvectors, and pipelines.

### Creating Collections

By default, collections will read and write to the database specified by `DATABASE_URL`.

**Create a Collection that uses the default `DATABASE_URL` environment variable.**
```python
collection = Collection("test_collection")
```

**Create a Collection that reads from a different database than that set by the environment variable `DATABASE_URL`.**
```python
collection = Collection("test_collection", CUSTOM_DATABASE_URL)
```

### Upserting Documents

The `upsert_documents` method can be used to insert new documents and update existing documents.

New documents are dictionaries with two required keys: `id` and `text`. All other keys/value pairs are stored as metadata for the document.

**Upsert new documents with metadata**
```python
documents = [
    {
        "id": "Document 1",
        "text": "Here are the contents of Document 1",
        "random_key": "this will be metadata for the document"
    },
    {
        "id": "Document 2",
        "text": "Here are the contents of Document 2",
        "random_key": "this will be metadata for the document"
    }
]
collection = Collection("test_collection")
await collection.upsert_documents(documents)
```

Document metadata can be updated by upserting the document without the `text` key.

**Update document metadata**
```python
documents = [
    {
        "id": "Document 1",
        "random_key": "this will be NEW metadata for the document"
    },
    {
        "id": "Document 2",
        "random_key": "this will be NEW metadata for the document"
    }
]
collection = Collection("test_collection")
await collection.upsert_documents(documents)
```

### Getting Documents

Documents can be retrieved using the `get_documents` method on the collection object

**Get the first 100 documents**
```python
collection = Collection("test_collection")
documents = await collection.get_documents({ "limit": 100 })
```

#### Pagination

The Python SDK supports limit-offset pagination and keyset pagination

**Limit-Offset pagination**
```python
collection = Collection("test_collection")
documents = await collection.get_documents({ "limit": 100, "offset": 10 })
```

**Keyset pagination**
```python
collection = Collection("test_collection")
documents = await collection.get_documents({ "limit": 100, "last_row_id": 10 })
```

The `last_row_id` can be taken from the `row_id` field in the returned document's dictionary.

#### Filtering

Metadata and full text filtering are supported just like they are in vector recall.

**Metadata and full text filtering**
```python
collection = Collection("test_collection")
documents = await collection.get_documents({
    "limit": 100,
    "offset": 10,
    "filter": {
        "metadata": {
            "id": {
                "$eq": 1
            }
        },
        "full_text_search": {
            "configuration": "english",
            "text": "Some full text query"
        }
    }
})

```

### Deleting Documents

Documents can be deleted with the `delete_documents` method on the collection object.

Metadata and full text filtering are supported just like they are in vector recall.

```python
documents = await collection.delete_documents({
    "metadata": {
        "id": {
            "$eq": 1
        }
    },
    "full_text_search": {
        "configuration": "english",
        "text": "Some full text query"
    }
})
```

### Searching Collections

The Python SDK is specifically designed to provide powerful, flexible vector search.

Pipelines are required to perform search. See the [Pipelines Section](#pipelines) for more information about using Pipelines.

**Basic vector search**
```python
collection = Collection("test_collection")
pipeline = Pipeline("test_pipeline")
results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()
```

**Vector search with custom limit**
```python
collection = Collection("test_collection")
pipeline = Pipeline("test_pipeline")
results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).limit(10).fetch_all()
```

#### Metadata Filtering

We provide powerful and flexible arbitrarly nested metadata filtering based off of [MongoDB Comparison Operators](https://www.mongodb.com/docs/manual/reference/operator/query-comparison/). We support each operator mentioned except the `$nin`.

**Vector search with $eq metadata filtering**
```python
collection = Collection("test_collection")
pipeline = Pipeline("test_pipeline")
results = (
    await collection.query()
    .vector_recall("Here is some query", pipeline)
    .limit(10)
    .filter({
        "metadata": {
            "uuid": {
                "$eq": 1
            }    
        }
    })
    .fetch_all()
)
```

The above query would filter out all documents that do not contain a key `uuid` equal to `1`.

**Vector search with $gte metadata filtering**
```python
collection = Collection("test_collection")
pipeline = Pipeline("test_pipeline")
results = (
    await collection.query()
    .vector_recall("Here is some query", pipeline)
    .limit(10)
    .filter({
        "metadata": {
            "index": {
                "$gte": 3
            }    
        }
    })
    .fetch_all()
)
```

The above query would filter out all documents that do not contain a key `index` with a value greater than `3`.

**Vector search with $or and $and metadata filtering**
```python
collection = Collection("test_collection")
pipeline = Pipeline("test_pipeline")
results = (
    await collection.query()
    .vector_recall("Here is some query", pipeline)
    .limit(10)
    .filter({
        "metadata": {
            "$or": [
                {
                    "$and": [
                        {
                            "uuid": {
                                "$eq": 1
                            }    
                        },
                        {
                            "index": {
                                "$lt": 100 
                            }
                        }
                    ] 
                },
                {
                   "special": {
                        "$ne": True
                    } 
                }
            ]    
        }
    })
    .fetch_all()
)
```

The above query would filter out all documents that do not have a key `special` with a value `True` or (have a key `uuid` equal to 1 and a key `index` less than 100).

#### Full Text Filtering

If full text search is enabled for the associated Pipeline, documents can be first filtered by full text search and then recalled by embedding similarity.

```python
collection = Collection("test_collection")
pipeline = Pipeline("test_pipeline")
results = (
    await collection.query()
    .vector_recall("Here is some query", pipeline)
    .limit(10)
    .filter({
        "full_text_search": {
            "configuration": "english",
            "text": "Match Me"
        }
    })
    .fetch_all()
)
```

The above query would first filter out all documents that do not match the full text search criteria, and then perform vector recall on the remaining documents.

## Pipelines

Collections can have any number of Pipelines. Each Pipeline is ran everytime documents are upserted.

Pipelines are composed of a Model, Splitter, and additional optional arguments.

### Models

Models are used for embedding chuncked documents. We support most every open source model on [Hugging Face](https://huggingface.co/), and also OpenAI's embedding models.

**Create a default Model "intfloat/e5-small" with default parameters: {}**
```python
model = Model()
```

**Create a Model with custom parameters**
```python
model = Model(
    name="hkunlp/instructor-base",
    parameters={"instruction": "Represent the Wikipedia document for retrieval: "}    
)
```

**Use an OpenAI model**
```python
model = Model(name="text-embedding-ada-002", source="openai")
```

### Splitters

Splitters are used to split documents into chunks before embedding them. We support splitters found in [LangChain](https://www.langchain.com/).

**Create a default Splitter "recursive_character" with default parameters: {}**
```python
splitter = Splitter()
```

**Create a Splitter with custom parameters**
```python
splitter = Splitter(
    name="recursive_character", 
    parameters={"chunk_size": 1500, "chunk_overlap": 40}
)
```

### Adding Pipelines to a Collection 

When adding a Pipeline to a collection it is required that Pipeline has a Model and Splitter.

The first time a Pipeline is added to a Collection it will automatically chunk and embed any documents already in that Collection. 

```python
model = Model()
splitter = Splitter()
pipeline = Pipeline("test_pipeline", model, splitter)
await collection.add_pipeline(pipeline)
```

### Enabling full text search

Pipelines can take additional arguments enabling full text search. When full text search is enabled, in addition to automatically chunking and embedding, the Pipeline will create the necessary tsvectors to perform full text search.

For more information on full text search please see: [Postgres Full Text Search](https://www.postgresql.org/docs/15/textsearch.html).

```python
model = Model()
splitter = Splitter()
pipeline = Pipeline("test_pipeline", model, splitter, {
    "full_text_search": {
        "active": True,
        "configuration": "english"
    }
})
await collection.add_pipeline(pipeline)
```

### Configuring HNSW Indexing Parameters

Our SDK utilizes [pgvector](https://github.com/pgvector/pgvector) for storing vectors and performing recall. We use HNSW indexing as it is the most performant mix of performance and recall.

Our SDK allows for configuration of `m` (the maximum number of connections per layer (16 by default)) and `ef_construction` (the size of the dynamic candidate list when constructing the graph (64 by default)) per pipeline.

```python
model = Model()
splitter = Splitter()
pipeline = Pipeline("test_pipeline", model, splitter, {
    "hnsw": {
        "m": 100,
        "ef_construction": 200 
    }
})
await collection.add_pipeline(pipeline)
```

### Searching with Pipelines

Pipelines are a required argument when performing vector search. After a Pipeline has been added to a Collection, the Model and Splitter can be omitted when instantiating it.

```python
pipeline = Pipeline("test_pipeline")
collection = Collection("test_collection")
results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()    
```

### Enabling, Disabling, and Removing Pipelines

Pipelines can be disabled or removed to prevent them from running automatically when documents are upserted. 

**Disable a Pipeline**
```python
pipeline = Pipeline("test_pipeline")
collection = Collection("test_collection")
await collection.disable_pipeline(pipeline)
```

Disabling a Pipeline prevents it from running automatically, but leaves all chunks and embeddings already created by that Pipeline in the database.

**Enable a Pipeline**
```python
pipeline = Pipeline("test_pipeline")
collection = Collection("test_collection")
await collection.enable_pipeline(pipeline)
```

Enabling a Pipeline will cause it to automatically run and chunk and embed all documents it may have missed while disabled.

**Remove a Pipeline**
```python
pipeline = Pipeline("test_pipeline")
collection = Collection("test_collection")
await collection.remove_pipeline(pipeline)
```

Removing a Pipeline deletes it and all associated data from the database. Removed Pipelines cannot be re-enabled but can be recreated.

## Upgrading

Changes between SDK versions are not necessarily backwards compatible. We provide a migrate function to help transition smoothly.

```python
from pgml import migrate
await migrate()
```

This will migrate all collections to be compatible with the latest SDK version.

## Developer Setup

This Python library is generated from our core rust-sdk. Please check [rust-sdk documentation](../README.md) for developer setup.

## Roadmap

- [x] Enable filters on document metadata in `vector_search`. [Issue](https://github.com/postgresml/postgresml/issues/663)
- [x] `text_search` functionality on documents using Postgres text search. [Issue](https://github.com/postgresml/postgresml/issues/664)
- [x] `hybrid_search` functionality that does a combination of `vector_search` and `text_search`. [Issue](https://github.com/postgresml/postgresml/issues/665)
- [x] Ability to call and manage OpenAI embeddings for comparison purposes. [Issue](https://github.com/postgresml/postgresml/issues/666)
- [x] Perform chunking on the DB with multiple langchain splitters. [Issue](https://github.com/postgresml/postgresml/issues/668)
- [ ] Save `vector_search` history for downstream monitoring of model performance. [Issue](https://github.com/postgresml/postgresml/issues/667)

