Metadata-Version: 2.1
Name: lexoid
Version: 0.1.8
Summary: 
Requires-Python: >=3.10,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: bs4 (>=0.0.2,<0.0.3)
Requires-Dist: docx2pdf (>=0.1.8,<0.2.0)
Requires-Dist: google-generativeai (>=0.8.1,<0.9.0)
Requires-Dist: huggingface-hub (>=0.27.0,<0.28.0)
Requires-Dist: loguru (>=0.7.2,<0.8.0)
Requires-Dist: markdown (>=3.7,<4.0)
Requires-Dist: markdownify (>=0.13.1,<0.14.0)
Requires-Dist: nest-asyncio (>=1.6.0,<2.0.0)
Requires-Dist: openai (>=1.47.0,<2.0.0)
Requires-Dist: opencv-python (>=4.10.0.84,<5.0.0.0)
Requires-Dist: pandas (>=2.2.3,<3.0.0)
Requires-Dist: pdfplumber (>=0.11.4,<0.12.0)
Requires-Dist: pikepdf (>=9.3.0,<10.0.0)
Requires-Dist: playwright (>=1.49.0,<2.0.0)
Requires-Dist: pypdfium2 (>=4.30.0,<5.0.0)
Requires-Dist: pyqt5 (>=5.15.11,<6.0.0) ; platform_system != "debian"
Requires-Dist: pyqtwebengine (>=5.15.7,<6.0.0) ; platform_system != "debian"
Requires-Dist: python-docx (>=1.1.2,<2.0.0)
Requires-Dist: python-dotenv (>=1.0.0,<2.0.0)
Requires-Dist: tabulate (>=0.9.0,<0.10.0)
Description-Content-Type: text/markdown

# Lexoid

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/oidlabs-com/Lexoid/blob/main/examples/example_notebook_colab.ipynb)
[![GitHub license](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://github.com/oidlabs-com/Lexoid/blob/main/LICENSE)
[![PyPI](https://img.shields.io/pypi/v/lexoid)](https://pypi.org/project/lexoid/)

Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.

## Motivation:
- Use the multi-modal advancement of LLMs
- Enable convenience for users
- Collaborate with a permissive license

## Installation
### Installing with pip
```
pip install lexoid
```

To use LLM-based parsing, define the following environment variables or create a `.env` file with the following definitions
```
OPENAI_API_KEY=""
GOOGLE_API_KEY=""
```

Optionally, to use `Playwright` for retrieving web content (instead of the `requests` library):
```
playwright install --with-deps --only-shell chromium
```

### Building `.whl` from source
```
make build
```

### Creating a local installation
To install dependencies:
```
make install
```
or, to install with dev-dependencies:
```
make dev
```

To activate virtual environment:
```
source .venv/bin/activate
```

## Usage
[Example Notebook](https://github.com/oidlabs-com/Lexoid/blob/main/examples/example_notebook.ipynb)

[Example Colab Notebook](https://drive.google.com/file/d/1v9R6VOUp9CEGalgZGeg5G57XzHqh_tB6/view?usp=sharing)

Here's a quick example to parse documents using Lexoid:
``` python
from lexoid.api import parse
from lexoid.api import ParserType

parsed_md = parse("https://www.justice.gov/eoir/immigration-law-advisor", parser_type="LLM_PARSE", raw=True)
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="LLM_PARSE", raw=True)

print(parsed_md)
```

### Parameters
- path (str): The file path or URL.
- parser_type (str, optional): The type of parser to use ("LLM_PARSE" or "STATIC_PARSE"). Defaults to "AUTO".
- raw (bool, optional): Return raw text or structured data. Defaults to False.
- pages_per_split (int, optional): Number of pages per split for chunking. Defaults to 4.
- max_threads (int, optional): Maximum number of threads for parallel processing. Defaults to 4.
- **kwargs: Additional arguments for the parser.

## Benchmark
Initial results (_more updates soon_)
_Note:_ Benchmarks done in zero-shot scenario currently
| Rank | Model/Framework | Similarity | Time (s) |
|------|-----------|------------|----------|
| 1 | gpt-4o | 0.799 | 21.77|
| 2 | gemini-2.0-flash-exp | 0.797 | 13.47 |
| 3 | gemini-exp-1121 | 0.779 | 30.88 |
| 4 | gemini-1.5-pro | 0.742 | 15.77 |
| 5 | gpt-4o-mini | 0.721 | 14.86 |
| 6 | gemini-1.5-flash | 0.702 | 4.56 |
| 7 | Llama-3.2-11B-Vision-Instruct (via HF) | 0.582 | 21.74 |
| 8 | Llama-3.2-11B-Vision-Instruct-Turbo (via Together AI) | 0.556 | 4.58 |
| 9 | Llama-3.2-90B-Vision-Instruct-Turbo (via Together AI) | 0.527 | 10.57 |
| 10 | Llama-Vision-Free (via Together AI) | 0.435 | 8.42 |

