Metadata-Version: 2.4
Name: navexa
Version: 0.1.2
Summary: Navexa document indexing and reasoning workflows
Author: Navexa Team
License-Expression: MIT
Project-URL: Homepage, https://github.com/deBUGger404/navexa.git
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai==1.101.0
Requires-Dist: pymupdf==1.26.4
Requires-Dist: PyPDF2==3.0.1
Requires-Dist: tiktoken==0.11.0
Requires-Dist: python-dotenv==1.1.1
Requires-Dist: pyyaml==6.0.2
Requires-Dist: docling>=2.74.0
Provides-Extra: notebook
Requires-Dist: ipywidgets>=8.0.0; extra == "notebook"
Requires-Dist: jupyterlab_widgets>=3.0.0; extra == "notebook"
Requires-Dist: widgetsnbextension>=4.0.0; extra == "notebook"
Dynamic: license-file

# Navexa

Navexa is a Python library for turning PDFs into a hierarchical tree and running reasoning-based retrieval (vectorless RAG) on top of that tree.

Core capabilities:
- PDF indexing into `tree_navexa.json`
- Structured, semi-structured, unstructured, and transcript indexing flows
- Optional LLM-assisted TOC/outline/summaries
- Reasoning-based node retrieval and grounded answering
- Built-in usage and estimated cost tracking for LLM calls

## Installation

## Option 1: Local editable install (recommended during development)

```bash
cd /home/rakesh/ds_project/navexa
python3 -m pip install -e .
```

## Option 2: Regular local install

```bash
cd /home/rakesh/ds_project/navexa
python3 -m pip install .
```

## Option 3: From PyPI (after you publish)

```bash
python3 -m pip install navexa
```

For Jupyter Notebook/Lab, install the notebook extra to avoid the common tqdm warning
(`TqdmWarning: IProgress not found`):

```bash
python3 -m pip install "navexa[notebook]"
```

## Option 4: From Git (if hosted)

```bash
python3 -m pip install git+https://github.com/<your-org>/navexa.git
```

## Environment and LLM Setup

Navexa reads environment from OS variables and `.env` files.

Env loading order:
1. `NAVEXA_ENV_FILE` (explicit file path)
2. `.env` in current working directory (or parent)
3. repo-local `.env` (backward compatibility)

Minimal setup for OpenAI:

```bash
export NAVEXA_LLM_PROVIDER="openai"
export OPENAI_API_KEY="..."
export OPENAI_MODEL_NAME="gpt-4.1-mini"
```

Minimal setup for Azure OpenAI:

```bash
export NAVEXA_LLM_PROVIDER="azure"
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_BASE_URL="https://<resource>.openai.azure.com/openai/v1"
export AZURE_DEPLOYMENT_NAME="<deployment-name>"
export AZURE_DEPLOYMENT_RAW_NAME="gpt-4.1-mini"
```

Use custom `.env` path:

```bash
export NAVEXA_ENV_FILE="/absolute/path/to/.env"
```

You can also copy and fill:
- `.env.example`

## Supported Environment Variables

## LLM credentials and routing

| Variable | Purpose | Example |
|---|---|---|
| `NAVEXA_LLM_PROVIDER` | LLM provider switch | `openai` or `azure` |
| `OPENAI_API_KEY` | OpenAI API key (openai provider) | `sk-...` |
| `OPENAI_MODEL_NAME` | OpenAI model name (openai provider) | `gpt-4.1-mini` |
| `AZURE_OPENAI_API_KEY` | Azure API key (azure provider) | `...` |
| `AZURE_OPENAI_BASE_URL` | Azure base URL (azure provider) | `https://.../openai/v1` |
| `AZURE_DEPLOYMENT_NAME` | Azure deployment name (azure provider) | `my-deployment` |
| `AZURE_DEPLOYMENT_RAW_NAME` | Raw model name for pricing map/metadata | `gpt-4.1-mini` |

## Model resolution order

If `model=None` is passed to API calls, Navexa resolves model from:
1. explicit `model=` parameter
2. if provider is `openai`: `OPENAI_MODEL_NAME`
3. if provider is `azure`: `AZURE_DEPLOYMENT_NAME`
4. if still missing: raise configuration error (no fallback)

## Pipeline defaults

| Variable | Purpose | Default |
|---|---|---|
| `NAVEXA_MODE` | Runtime mode (`llm` or `no-llm`) | `no-llm` |
| `NAVEXA_DOCUMENT_TYPE` | Default doc type | `structured` |
| `NAVEXA_VERBOSE` | Verbosity (`low`,`medium`,`high` or `1`,`2`,`3`) | `medium` |
| `NAVEXA_IF_ADD_NODE_SUMMARY` | Include summaries (`yes`/`no`) | `yes` |
| `NAVEXA_MAX_TOKEN_NUM_EACH_NODE` | Max tokens per node | `12000` |
| `NAVEXA_MAX_PAGE_NUM_EACH_NODE` | Max pages per node | `8` |
| `NAVEXA_DISABLE_SUMMARY` | Force-disable summaries | `0` |

## Docling parser controls

| Variable | Purpose | Default |
|---|---|---|
| `NAVEXA_PARSER_MODEL` | Parser backend | `docling` |
| `NAVEXA_DOCLING_OUTPUT_FORMAT` | `markdown` or `text` | `markdown` |
| `NAVEXA_DOCLING_OCR` | Enable OCR (`0/1`) | `0` |
| `NAVEXA_RAPIDOCR_BACKEND` | OCR backend | `torch` |
| `NAVEXA_DOCLING_FORCE_FULL_PAGE_OCR` | Force OCR on full page | `0` |
| `NAVEXA_DOCLING_PICTURE_DESCRIPTION` | Picture descriptions | `0` |
| `NAVEXA_DOCLING_REMOTE_SERVICES` | Enable remote services | `0` |
| `NAVEXA_DOCLING_IMAGE_MODE` | `placeholder`, `embedded`, `referenced`, `none` | `placeholder` |
| `NAVEXA_DOCLING_QUIET` | Reduce Docling/RapidOCR logs | `1` |

## Quick Start

```python
from navexa import index_structured_document_tree, save_document_tree

result = index_structured_document_tree(
    pdf_path="/path/to/file.pdf",
    mode="llm",
    model="gpt-4.1-mini",
    verbosity="medium",
    parser_model="docling",
    output_format="markdown",
    max_token_num_each_node=12000,
    max_page_num_each_node=8,
    if_add_node_summary="yes",
)

save = save_document_tree(
    index_result=result,
    out_dir=None,               # defaults to <pdf_dir>/<pdf_stem>_navexa_out
    write_tree=True,
    write_validation=True,
    write_compat=False,
)

print(save.out_dir)
print(save.paths)
print(result.tree_navexa["cost"])
```

## Public APIs

Top-level imports (from `navexa`):
- `index_structured_document_tree`
- `index_semi_structured_document_tree`
- `index_unstructured_document_tree`
- `index_transcript_document_tree`
- `save_document_tree`
- `index_and_save_document_tree`
- `fetch_document_tree`
- `fetch_validation_report`
- `fetch_compat_tree`
- `build_search_tree_view`
- `build_node_index`
- `reason_over_tree`
- `print_reasoning_trace`
- `extract_selected_context`
- `answer_from_context`
- `run_reasoning_rag`
- `load_navexa_env`

## Indexing Functions

### 1) `index_structured_document_tree(...)`

LLM requirement: optional  
Best for: documents with clear headings/TOC

### 2) `index_semi_structured_document_tree(...)`

LLM requirement: required  
Best for: inconsistent headings/order

### 3) `index_unstructured_document_tree(...)`

LLM requirement: optional  
Best for: weak heading structure

### 4) `index_transcript_document_tree(...)`

LLM requirement: required  
Best for: meeting/interview transcript documents

## Shared Indexing Parameters and Values

| Parameter | Type | Allowed Values | Default |
|---|---|---|---|
| `pdf_path` | `str` | valid PDF path | required |
| `model` | `Optional[str]` | provider model/deployment name | `None` |
| `mode` | `Optional[str]` | `llm`, `use-llm`, `with-llm`, `no-llm` | `None` |
| `verbosity` | `Optional[str]` | `low`, `medium`, `high`, `1`, `2`, `3`, `debug`, `detailed` | `None` |
| `parser_model` | `Optional[str]` | currently `docling` only | `None` |
| `output_format` | `Optional[str]` | `markdown`, `text` | `None` |
| `max_token_num_each_node` | `int` | `>=1` | `12000` |
| `max_page_num_each_node` | `int` | `>=1` | `8` |
| `if_add_node_summary` | `str` | `yes`, `no` | `yes` |

Function-specific extra parameters:
- `semi_heading_prompt_template` in `index_semi_structured_document_tree`
- `transcript_topic_prompt_template` in `index_transcript_document_tree`

Backward compatibility:
- `index_document_tree(...)` still exists and routes to structured flow.

## Save and Fetch APIs

## Save

`save_document_tree(index_result, out_dir=None, save_mode="explicit", write_tree=True, write_validation=False, write_compat=False)`

Notes:
- At least one of `write_tree/write_validation/write_compat` must be `True`
- If `out_dir=None`, default is `<pdf_dir>/<pdf_stem>_navexa_out`

## Fetch

- `fetch_document_tree(source, file_name="tree_navexa.json")`
- `fetch_validation_report(source)`
- `fetch_compat_tree(source)`

`source` can be:
- in-memory dict
- JSON file path
- output directory path
- `IndexResult` object

## Reasoning and RAG APIs

## Tree preparation

`build_search_tree_view(tree, strip_fields=("exclusive_text", "full_text"))`

`build_node_index(tree, include_page_ranges=True, exclude_fields=None)`

## Tree reasoning

`reason_over_tree(query, tree, model=None, prompt_template=None, llm_callable=None, return_prompt=False, verbosity=None, strip_fields=("exclusive_text","full_text"), prompt_extra=None)`

Return object fields:
- `thinking`
- `node_list`
- `raw_response`
- `used_prompt` (if requested)
- `parsed_json`

## Trace print

`print_reasoning_trace(reasoning_result, node_index)`

## Context extraction

`extract_selected_context(tree, node_list, text_mode="inclusive", dedupe_ancestor=True)`

`text_mode` values:
- `inclusive` -> uses `full_text`
- `exclusive` -> uses `exclusive_text`

`dedupe_ancestor` behavior:
- `True`: if parent and child are both selected, child is dropped
- `False`: keeps all selected nodes

## Answer generation

`answer_from_context(query, context_text, model=None, prompt_template=None, llm_callable=None, return_prompt=False, verbosity=None, prompt_extra=None)`

## End-to-end

`run_reasoning_rag(query, tree_or_source, model=None, tree_prompt_template=None, answer_prompt_template=None, llm_callable=None, return_prompt=False, verbosity=None, strip_fields=("exclusive_text","full_text"), text_mode="inclusive", dedupe_ancestor=True, tree_prompt_extra=None, answer_prompt_extra=None)`

Returns:
- `tree`, `tree_view`, `node_index`
- `reasoning`, `context`, `answer`
- `cost_before`, `cost_after`, `cost_delta`

## End-to-End Example (Index + RAG)

```python
from navexa import (
    index_structured_document_tree,
    save_document_tree,
    fetch_document_tree,
    run_reasoning_rag,
    print_reasoning_trace,
)

pdf_path = "/path/to/file.pdf"
query = "What are the key warnings?"

index_result = index_structured_document_tree(
    pdf_path=pdf_path,
    mode="llm",
    model=None,
    verbosity="high",
)

save_result = save_document_tree(
    index_result=index_result,
    out_dir="/path/to/output",
    write_tree=True,
    write_validation=True,
    write_compat=False,
)

tree = fetch_document_tree(save_result.out_dir)

rag = run_reasoning_rag(
    query=query,
    tree_or_source=tree,
    model=None,
    return_prompt=True,
    verbosity="high",
)

print_reasoning_trace(rag.reasoning, rag.node_index)
print("\nAnswer:\n", rag.answer.answer)
print("\nCost delta:\n", rag.cost_delta)
```

## CLI Usage

After install, CLI entry point:

```bash
navexa-index --pdf /path/file.pdf --out-dir /path/out
```

If `navexa-index` is "command not found":
- If you installed into a virtualenv, activate it first: `source <venv>/bin/activate`
- If you installed with `pip install --user`, the script is usually at `~/.local/bin/navexa-index`. Add it to PATH once:

```bash
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
```

All CLI flags:

| Flag | Allowed Values | Default |
|---|---|---|
| `--pdf` | PDF path | required |
| `--out-dir` | output dir | required |
| `--model` | model/deployment string | `None` |
| `--mode` | `llm`, `no-llm` | `NAVEXA_MODE` or `no-llm` |
| `--document-type` | `structured`, `semi_structured`, `unstructured`, `transcript` | `structured` |
| `--parser-model` | `docling` | `docling` |
| `--output-format` | `markdown`, `text` | `markdown` |
| `--verbose` | `1`,`2`,`3`,`low`,`medium`,`high` | `medium` |
| `--max-token-num-each-node` | int | `12000` |
| `--max-page-num-each-node` | int | `8` |
| `--if-add-node-summary` | `yes`, `no` | `yes` |
| `--with-validation` | switch | off |
| `--with-compat` | switch | off |

Example:

```bash
navexa-index \
  --pdf /path/file.pdf \
  --out-dir /path/out \
  --mode llm \
  --document-type structured \
  --output-format markdown \
  --verbose high \
  --if-add-node-summary yes \
  --with-validation
```

## Output Files

Generated files:
- `tree_navexa.json` (canonical)
- `validation_report.json` (optional)
- `tree_legacy_compat.json` (optional)

`tree_navexa.json` top-level keys include:
- `doc_id`, `doc_name`, `pages`, `pipeline_version`
- `source` (path/hash/page count)
- `cost` (calls/tokens/estimated cost)
- `pipeline` (mode, document type, parser settings, node limits, steps)
- `structure` (hierarchical nodes with `children`)
- `transcript` (present for transcript flow)

## Logging and Cost

Verbosity:
- `low`: compact result logs
- `medium`: stage-by-stage logs
- `high`: debug-level details and usage deltas

Cost fields are available in:
- `tree_navexa["cost"]`
- `IndexResult.meta["cost"]`
- `RAGResult.cost_delta` for reasoning runs

## Prompt Overrides and Custom LLM

You can override prompts:
- `semi_heading_prompt_template`
- `transcript_topic_prompt_template`
- `tree_prompt_template`
- `answer_prompt_template`

You can provide your own LLM callable:

```python
def my_llm(prompt: str, model: str, stage: str):
    if stage == "tree_search":
        return '{"thinking":"picked relevant nodes","node_list":["0003","0010"]}'
    return "Answer based on selected context."
```

Pass it with `llm_callable=my_llm`.

## Failure Behavior and Troubleshooting

Common cases:
- Missing API key in LLM-required flows (`semi_structured`, `transcript`) -> raises `RuntimeError`
- Missing API key in optional LLM flow -> switches to `no-llm`
- Invalid `parser_model` -> `ValueError` (currently only `docling`)
- Invalid `output_format` -> `ValueError` (`markdown` or `text`)
- Empty/invalid tree input to reasoning APIs -> `ValueError`

Notebook note:
- `asyncio.run()` loop conflicts are handled in reasoning functions with thread fallback.
- If you still run into notebook event-loop issues, restart kernel and re-run imports.


## Acknowledgment

Navexa is an independent implementation. Some architecture patterns and selected adapted code paths were informed by PageIndex.

Attribution files:
- `THIRD_PARTY_NOTICES.md`
- `LICENSE`

## License

This project is licensed under MIT.

Practical compliance checklist:
1. Keep this repository `LICENSE`.
2. Keep third-party attribution notices.
3. Preserve upstream MIT notice for copied/substantially adapted portions.

This documentation is technical guidance, not legal advice.
