Metadata-Version: 2.1
Name: searchflow
Version: 0.0.88
Summary: An assistant helping you to index webpages into structured datasets.
License: MIT
Author: Ben Selleslagh
Author-email: ben@vectrix.ai
Requires-Python: >=3.11,<3.13
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: asyncpg (>=0.29.0,<0.30.0)
Requires-Dist: colorlog (>=6.8.2,<7.0.0)
Requires-Dist: fake-useragent (>=1.5.1,<2.0.0)
Requires-Dist: fastapi (>=0.111.0,<0.112.0)
Requires-Dist: greenlet (>=3.0.3,<4.0.0)
Requires-Dist: langchain (>=0.2.7,<0.3.0)
Requires-Dist: langchain-anthropic (>=0.1.23,<0.2.0)
Requires-Dist: langchain-cohere
Requires-Dist: langchain-community (>=0.2.7,<0.3.0)
Requires-Dist: langchain-openai (>=0.1.14,<0.2.0)
Requires-Dist: langchain-postgres (>=0.0.9,<0.0.10)
Requires-Dist: langgraph (>=0.2.14,<0.3.0)
Requires-Dist: langgraph-cli (>=0.1.48,<0.2.0)
Requires-Dist: o365 (>=2.0.36,<3.0.0)
Requires-Dist: poppler-utils (>=0.1.0,<0.2.0)
Requires-Dist: protobuf (>=4.21.6,<5.0.0)
Requires-Dist: psycopg-binary (>=3.2.1,<4.0.0)
Requires-Dist: psycopg2-binary (>=2.9.9,<3.0.0)
Requires-Dist: spider-client (>=0.0.69,<0.0.70)
Requires-Dist: streamlit (>=1.36.0,<2.0.0)
Requires-Dist: supabase (>=2.7.2,<3.0.0)
Requires-Dist: tiktoken (>=0.7.0,<0.8.0)
Requires-Dist: trafilatura (>=1.9.0,<2.0.0)
Requires-Dist: unstructured-client (>=0.25.5,<0.26.0)
Requires-Dist: unstructured[all-docs] (>=0.15.7,<0.16.0)
Requires-Dist: uvicorn (>=0.30.1,<0.31.0)
Requires-Dist: validators (>=0.28.1,<0.29.0)
Description-Content-Type: text/markdown

![PyPI - Version](https://img.shields.io/pypi/v/searchflow) ![Website](https://img.shields.io/website?url=https%3A%2F%2Fvectrix.ai) ![GitHub License](https://img.shields.io/github/license/vectrix-ai/SearchFlow)
 ![X (formerly Twitter) Follow](https://img.shields.io/twitter/follow/bselleslagh) ![Docker Image Version (tag)](https://img.shields.io/docker/v/bselleslagh/searchflow/latest)



# SearchFlow

SearchFlow is an assistant designed to help you index webpages into structured datasets. It leverages various tools and models to scrape, process, and store web content efficiently.

## Features

- **Web Scraping**: Uses `trafilatura` for focused crawling and web scraping.
- **Document Processing**: Supports chunking and processing of various document types.
- **Database Management**: Manages projects, documents, and prompts using PostgreSQL.
- **Vector Search**: Utilizes vector search for document retrieval.
- **LLM Integration**: Integrates with language models for question answering and document grading.

## Installation

To set up the development environment, use the provided `Dockerfile` and `.devcontainer/devcontainer.json` for a consistent development setup.

### Prerequisites

- Docker
- Python 3.11 or higher

### Steps



## Usage

Install SearchFlow via pip:
```bash
pip install searchflow
```

### Quickstart
1. **Initialize the Database**

```python
from searchflow.db.postgresql import DB
db = DB()
db.create_project(project_name="example_project")
```

2. **Create a project**

```python
db.create_project(project_name="example_project")
```

3. **Import Data from a URL**

```python
from searchflow.importers import WebScraper
scraper = WebScraper(project_name='MyProject', db=db)
scraper.full_import("https://example.com", max_pages=100)
````

4. ** Upload a file to the project **

```python
from searchflow.importers import Files
with open("path/to/your/file.pdf", "rb") as f:
bytes_data = f.read()
files = Files()
files.upload_file(
document_data=[(bytes_data, "file.pdf")],
project_name="MyProject",
inference_type="local"
)
```

5. **List Files in a Project**

```python
files.list_files(project_name="MyProject")
```

6. **Remove a File from a Project**

```python
files.remove_file(project_name="MyProject", file_name="file.pdf")
```

### Question Answering



### Vector Search
To perform a similarity search:

```python
from searchflow.db.postgresql import DB
db = DB()
results = db.similarity_search(project_name="example_project", query="example query"
```
