Metadata-Version: 2.4
Name: navexa
Version: 0.1.6
Summary: Navexa document indexing and reasoning workflows
Author: Navexa Team
License-Expression: MIT
Project-URL: Homepage, https://debugger404.github.io/navexa-docs
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai==1.101.0
Requires-Dist: pymupdf==1.26.4
Requires-Dist: PyPDF2==3.0.1
Requires-Dist: tiktoken==0.11.0
Requires-Dist: python-dotenv==1.1.1
Requires-Dist: pyyaml==6.0.2
Requires-Dist: docling>=2.74.0
Requires-Dist: onnxruntime>=1.18.0
Provides-Extra: notebook
Requires-Dist: ipywidgets>=8.0.0; extra == "notebook"
Requires-Dist: jupyterlab_widgets>=3.0.0; extra == "notebook"
Requires-Dist: widgetsnbextension>=4.0.0; extra == "notebook"
Dynamic: license-file

<p align="center">
  <img src="https://debugger404.github.io/navexa-docs/fill_icon2.png" alt="Navexa icon" width="120" />
</p>

<h1 align="center">Navexa</h1>

<p align="center">
  Tree-first PDF indexing and reasoning RAG for structured, semi-structured,
  unstructured, and transcript documents.
</p>

<p align="center">
  <a href="https://pypi.org/project/navexa/"><img src="https://img.shields.io/pypi/v/navexa?logo=pypi&label=PyPI" alt="PyPI version" /></a>
  <a href="https://pypi.org/project/navexa/"><img src="https://img.shields.io/pypi/pyversions/navexa" alt="Python versions" /></a>
  <a href="https://github.com/debugger404/navexa/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-green.svg" alt="License: MIT" /></a>
  <a href="https://debugger404.github.io/navexa-docs/"><img src="https://img.shields.io/badge/docs-live-blue" alt="Documentation" /></a>
  <a href="https://github.com/debugger404/navexa/issues"><img src="https://img.shields.io/github/issues/debugger404/navexa" alt="GitHub issues" /></a>
  <a href="https://github.com/debugger404/navexa/blob/main/CONTRIBUTING.md"><img src="https://img.shields.io/badge/PRs-welcome-brightgreen.svg" alt="PRs welcome" /></a>
  <a href="https://github.com/debugger404/navexa"><img src="https://img.shields.io/badge/github-navexa-black?logo=github" alt="GitHub Repo" /></a>
</p>

<p align="center">
  <a href="https://debugger404.github.io/navexa-docs/">Documentation</a>
  ·
  <a href="https://github.com/debugger404/navexa">GitHub Repository</a>
  ·
  <a href="https://github.com/debugger404/navexa/blob/main/CONTRIBUTING.md">Contributing</a>
  ·
  <a href="RELEASE.md">Release Notes</a>
  ·
  <a href="https://github.com/debugger404/navexa/issues">Issues</a>
  ·
  <a href="#installation">Installation</a>
  ·
  <a href="#quick-start">Quick Start</a>
</p>

---

Navexa is a Python library for turning PDFs into a hierarchical tree and running reasoning-based retrieval (vectorless RAG) on top of that tree.

Core capabilities:
- PDF indexing into `tree_navexa.json`
- Structured, semi-structured, unstructured, and transcript indexing flows
- Optional LLM-assisted TOC/outline/summaries
- Reasoning-based node retrieval and grounded answering
- Built-in usage and estimated cost tracking for LLM calls

## What's New in 0.1.6

- Preferred grouped runtime config:
  - `model_config={...}` for provider/model credentials and routing
  - `parser_config={...}` for parser name, output format, and Docling options
- CLI now matches the Python API:
  - `--model-config`
  - `--parser-config`
- `semi_structured` supports both `llm` and `no-llm`
- `transcript` remains the only indexing flow that is strictly LLM-required
- Recoverable Docling issues are surfaced as:
  - one-line Navexa warnings at runtime
  - structured entries in `tree_navexa.pipeline.warnings`
- External reasoning integrations use `BaseExternalLLM`

For the full version history, see [`RELEASE.md`](RELEASE.md).

## Installation

### Option 1: Use the current local source (editable)

```bash
cd navexa
python3 -m pip install -e .
```

### Option 2: From source (regular)

```bash
cd navexa
python3 -m pip install .
```

### Option 3: From PyPI (recommended for users)

```bash
python3 -m pip install navexa
```

### Update Navexa

Upgrade to latest:

```bash
python3 -m pip install --upgrade navexa
```

Install a specific version:

```bash
python3 -m pip install navexa==0.1.6
```

For Jupyter Notebook/Lab, install the notebook extra to avoid the common tqdm warning
(`TqdmWarning: IProgress not found`):

```bash
python3 -m pip install "navexa[notebook]"
```

### Check the installed version

```python
import navexa
print(navexa.__version__)
```

### Option 4: From GitHub source

```bash
python3 -m pip install git+https://github.com/debugger404/navexa.git
```

### Use the local source in notebooks

If you want a notebook to use the current local copy of Navexa:

```bash
cd navexa
python3 -m pip install -e .
```

If a notebook shows `ModuleNotFoundError: navexa`, check which Python
environment the notebook kernel is using:

```python
import navexa, sys
print(navexa.__file__)
print(sys.executable)
```

## Environment and LLM Setup

You do not need a `.env` file for parser/indexing behavior.
Use `parser_config={...}` in API calls when you want per-run parser behavior.

Use environment variables only when needed for:
- LLM credentials/provider routing, or
- global defaults you want shared across runs.

Navexa reads environment from OS variables and `.env` files.

Env loading order:
1. `NAVEXA_ENV_FILE` (explicit file path)
2. `.env` in current working directory (or parent)
3. repo-local `.env` (backward compatibility)

Minimal setup for OpenAI:

```bash
export NAVEXA_LLM_PROVIDER="openai"
export OPENAI_API_KEY="..."
export OPENAI_MODEL_NAME="gpt-4.1-mini"
```

Minimal setup for Azure OpenAI:

```bash
export NAVEXA_LLM_PROVIDER="azure"
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_BASE_URL="https://<resource>.openai.azure.com"
export AZURE_DEPLOYMENT_NAME="<deployment-name>"
export AZURE_DEPLOYMENT_RAW_NAME="gpt-4.1-mini"
```

If `AZURE_OPENAI_BASE_URL` is set to just the resource URL, Navexa appends
`/openai/v1` automatically.

Use custom `.env` path (optional):

```bash
export NAVEXA_ENV_FILE="./.env"
```

You can also copy and fill (optional):
- `.env.example`

## Direct `model_config` in Code

If you do not want to rely on `.env` loading, you can pass provider/model
settings directly in API calls with `model_config={...}`.

Recommended shape:

```python
model_config = {
    "provider": "openai",      # openai | azure
    "model": "gpt-4.1-mini",   # Azure: deployment name
    "api_key": "...",
    "base_url": None,          # required for azure
    "pricing_model": None,     # optional; useful for Azure custom deployment names
}
```

Azure example:

```python
model_config = {
    "provider": "azure",
    "model": "my-gpt41mini-deployment",
    "api_key": "...",
    "base_url": "https://<resource>.openai.azure.com",
    "pricing_model": "gpt-4.1-mini",
}
```

Azure aliases also work:
- `deployment_name` instead of `model`
- `deployment_raw_name` instead of `pricing_model`

For Azure, `model_config` takes the same core values as the env setup:
- `provider="azure"`
- `api_key` -> same value as `AZURE_OPENAI_API_KEY`
- `base_url` -> same value as `AZURE_OPENAI_BASE_URL`
- `model` or `deployment_name` -> same value as `AZURE_DEPLOYMENT_NAME`
- `pricing_model` or `deployment_raw_name` -> same value as `AZURE_DEPLOYMENT_RAW_NAME`

Use it in API code:

```python
from navexa import index_structured_document_tree

result = index_structured_document_tree(
    pdf_path="/path/to/file.pdf",
    mode="llm",
    model_config={
        "provider": "azure",
        "deployment_name": "my-gpt41mini-deployment",
        "deployment_raw_name": "gpt-4.1-mini",
        "api_key": "...",
        "base_url": "https://<resource>.openai.azure.com",
    },
)
```

Support:
- Python API supports `model_config={...}`
- CLI supports `--model-config`
- legacy `model=` / `--model` still work for compatibility but are deprecated

Model/credential precedence:

Preferred usage:
1. env defaults, or
2. `model_config={...}` in code

Deprecated compatibility path:
3. explicit `model=...`

Actual resolution order today:
1. explicit `model=...` (deprecated but still supported)
2. `model_config["model"]`
3. env default model (`OPENAI_MODEL_NAME` or `AZURE_DEPLOYMENT_NAME`)
4. if still missing: configuration error

Credential/base URL precedence:

1. `model_config`
2. `.env` / OS environment

Output safety:
- Navexa stores a redacted summary of resolved model settings in
  `tree_navexa.pipeline.model_config`
- API keys are never written to output JSON

## Recommended Parser Setup

Use this rule:

1. Put long-lived defaults in `.env`
2. Pass `parser_config={...}` in the function when you want per-run overrides
3. Avoid mixing `parser_config` with old parser fields

Recommended parser shape:

```python
parser_config = {
    "name": "docling",
    "output_format": "markdown",
    "options": {
        "profile": "balanced",
        "do_ocr": True,
        "force_full_page_ocr": True,
        "do_table_structure": True,
        "do_picture_description": False,
        "enable_remote_services": False,
        "backend": "torch",
        "image_mode": "placeholder",
        "quiet": True,
    },
}
```

Precedence order:

1. built-in defaults
2. `.env` defaults
3. explicit `parser_config`

For Docling parser options specifically, the merge order is:

1. selected profile defaults
2. matching `.env` values
3. explicit `parser_config["options"]`

Example:

- profile: `fast_text` sets `do_ocr=False`
- `.env`: `NAVEXA_DOCLING_OCR=1`
- code: `parser_config["options"]={"do_ocr": False}`

Final value:
- `do_ocr=False`

Why:
- profile is the starting point
- `.env` overrides the profile
- explicit function config overrides both

Deprecated Python API parser inputs:
- `parser_model`
- `output_format`
- `docling_options`

These legacy Python API fields still work for compatibility, but they are deprecated.
If `parser_config` is provided, it takes precedence and the legacy parser fields are ignored.

## Supported Environment Variables

These env values are defaults. If you pass `model_config={...}` or
`parser_config={...}` in code, the explicit code values win.

## LLM credentials and routing

| Variable | Purpose | Example |
|---|---|---|
| `NAVEXA_LLM_PROVIDER` | LLM provider switch | `openai` or `azure` |
| `OPENAI_API_KEY` | OpenAI API key (openai provider) | `sk-...` |
| `OPENAI_MODEL_NAME` | OpenAI model name (openai provider) | `gpt-4.1-mini` |
| `AZURE_OPENAI_API_KEY` | Azure API key (azure provider) | `...` |
| `AZURE_OPENAI_BASE_URL` | Azure base URL (azure provider) | `https://<resource>.openai.azure.com` |
| `AZURE_DEPLOYMENT_NAME` | Azure deployment name (azure provider) | `my-deployment` |
| `AZURE_DEPLOYMENT_RAW_NAME` | Raw model name for pricing map/metadata | `gpt-4.1-mini` |

## Model resolution order

Navexa resolves model from:
1. explicit `model=` parameter (deprecated but still supported)
2. `model_config["model"]`
3. if provider is `openai`: `OPENAI_MODEL_NAME`
4. if provider is `azure`: `AZURE_DEPLOYMENT_NAME`
5. if still missing: raise configuration error (no fallback)

Provider resolution order:
1. `model_config["provider"]`
2. `NAVEXA_LLM_PROVIDER`
3. default: `openai`

## Pipeline defaults

| Variable | Purpose | Default |
|---|---|---|
| `NAVEXA_MODE` | Runtime mode (`llm` or `no-llm`) | `no-llm` |
| `NAVEXA_DOCUMENT_TYPE` | Default doc type | `structured` |
| `NAVEXA_VERBOSE` | Verbosity (`low`,`medium`,`high` or `1`,`2`,`3`) | `medium` |
| `NAVEXA_IF_ADD_NODE_SUMMARY` | Include summaries (`yes`/`no`) | `yes` |
| `NAVEXA_MAX_TOKEN_NUM_EACH_NODE` | Max tokens per node | `12000` |
| `NAVEXA_MAX_PAGE_NUM_EACH_NODE` | Max pages per node | `8` |
| `NAVEXA_DISABLE_SUMMARY` | Force-disable summaries | `0` |

## Parser Environment Defaults

| Variable | Purpose | Default |
|---|---|---|
| `NAVEXA_DOCLING_PROFILE` | Docling profile preset (`balanced`, `image_manual`, `fast_text`) | `balanced` |
| `NAVEXA_PARSER_MODEL` | Parser backend | `docling` |
| `NAVEXA_DOCLING_OUTPUT_FORMAT` | `markdown` or `text` | `markdown` |
| `NAVEXA_DOCLING_OCR` | Enable OCR (`0/1`) | `1` |
| `NAVEXA_RAPIDOCR_BACKEND` | OCR backend | `torch` |
| `NAVEXA_DOCLING_FORCE_FULL_PAGE_OCR` | Force OCR on full page | `1` |
| `NAVEXA_DOCLING_TABLE_STRUCTURE` | Enable table structure extraction (`0/1`) | `1` |
| `NAVEXA_DOCLING_PICTURE_DESCRIPTION` | Picture descriptions | `0` |
| `NAVEXA_DOCLING_REMOTE_SERVICES` | Enable remote services | `0` |
| `NAVEXA_DOCLING_IMAGE_MODE` | `placeholder`, `embedded`, `referenced`, `none` | `placeholder` |
| `NAVEXA_DOCLING_QUIET` | Reduce Docling/RapidOCR logs | `1` |

These env values act as defaults for `parser_config` when the matching field is not passed explicitly in code.

## Quick Start

```python
from navexa import index_structured_document_tree, save_document_tree

result = index_structured_document_tree(
    pdf_path="/path/to/file.pdf",
    mode="llm",
    model_config={
        "provider": "openai",
        "model": "gpt-4.1-mini",
        "api_key": "...",
    },
    verbosity="medium",
    parser_config={
        "name": "docling",
        "output_format": "markdown",
        "options": {
            "profile": "balanced",            # balanced | image_manual | fast_text
            "do_ocr": True,                   # optional override
            "force_full_page_ocr": True,      # optional override
            "do_table_structure": True,       # optional override
            "do_picture_description": False,  # optional override
            "backend": "torch",               # torch | onnxruntime
            "image_mode": "placeholder",      # placeholder | embedded | referenced | none
            "quiet": True,
        },
    },
    max_token_num_each_node=12000,
    max_page_num_each_node=8,
    if_add_node_summary="yes",
)

save = save_document_tree(
    index_result=result,
    out_dir=None,               # defaults to <pdf_dir>/<pdf_stem>_navexa_out
    write_tree=True,
    write_validation=True,
    write_compat=False,
)

print(save.out_dir)
print(save.paths)
print(result.tree_navexa["cost"])
```

## Public APIs

Top-level imports (from `navexa`):
- `index_structured_document_tree`
- `index_semi_structured_document_tree`
- `index_unstructured_document_tree`
- `index_transcript_document_tree`
- `save_document_tree`
- `index_and_save_document_tree`
- `fetch_document_tree`
- `fetch_validation_report`
- `fetch_compat_tree`
- `build_search_tree_view`
- `build_node_index`
- `reason_over_tree`
- `print_reasoning_trace`
- `extract_selected_context`
- `answer_from_context`
- `run_reasoning_rag`
- `load_navexa_env`

## Indexing Functions

### 1) `index_structured_document_tree(...)`

LLM requirement: optional  
Best for: documents with clear headings/TOC

### 2) `index_semi_structured_document_tree(...)`

LLM requirement: optional  
Best for: inconsistent headings/order where deterministic parsing works, but LLM can improve weak heading normalization

Behavior:
- without LLM, Navexa still builds headings deterministically from the markdown tree and
  base outline pipeline
- if that outline is weak, it falls back to heuristic heading generation from page text
- with LLM enabled, Navexa uses the same deterministic base and then improves weak
  headings via heading normalization

Runtime signal:
- logs include `heading_source=existing|heuristic|llm` for semi-structured flow
- output JSON includes `pipeline.semi_structured_source`

### 3) `index_unstructured_document_tree(...)`

LLM requirement: optional  
Best for: weak heading structure; builds chunk-based synthetic sections

### 4) `index_transcript_document_tree(...)`

LLM requirement: required  
Best for: meeting/interview transcript documents

Note: transcript indexing does not use parser/Docling configuration, so this API does not expose
`parser_config`, `parser_model`, `output_format`, or `docling_options`.

### TOC/Section Strategy by Document Type

| Document Type | Uses TOC Detection Pipeline | Uses LLM for TOC/Heading | Deterministic Fallback |
|---|---|---|---|
| `structured` | yes | optional | yes |
| `semi_structured` | yes | optional (heading normalization enhancer) | yes |
| `unstructured` | no (chunk-first synthetic sections) | optional (title generation only) | yes |
| `transcript` | n/a (topic grouping flow) | required | no (requires LLM) |

## Parser Config Summary

New grouped parameter: `parser_config`

Available in:
- `index_structured_document_tree(...)`
- `index_semi_structured_document_tree(...)`
- `index_unstructured_document_tree(...)`
- compatibility wrappers (`index_document_tree(...)`, `index_and_save_document_tree(...)`)

Recommended usage:

```python
parser_config={
    "name": "docling",
    "output_format": "markdown",
    "options": {"profile": "balanced"},
}
```

Effective parser config is:
- logged at runtime, and
- stored in output JSON at `tree_navexa.pipeline.parser_config`
- recoverable Docling extraction warnings are stored at `tree_navexa.pipeline.warnings`

Recoverable warning behavior:
- Navexa suppresses raw Docling tracebacks when usable content was still recovered
- Navexa logs a one-line warning instead
- the warning is preserved in output JSON for later inspection

Precedence reminder:
- parser profile defaults are applied first
- matching `.env` values override the profile defaults
- explicit `parser_config["options"]` overrides both
- if legacy Python API fields are also passed, `parser_config` wins

Inside `parser_config`:
- `name`
- `output_format`
- `options`

Current default behavior:
- `profile="balanced"`
- `do_ocr=True`
- `force_full_page_ocr=True`
- `do_table_structure=True`
- `do_picture_description=False`
- `enable_remote_services=False`
- `backend="torch"`
- `image_mode="placeholder"`
- `quiet=True`

Deprecated Python API parser fields:
- `parser_model`
- `output_format`
- `docling_options`

These remain accepted in the Python API for compatibility but are deprecated. Prefer `parser_config`.

## Shared Indexing Parameters and Values

| Parameter | Type | Allowed Values | Default |
|---|---|---|---|
| `pdf_path` | `str` | valid PDF path | required |
| `model` | `Optional[str]` | provider model/deployment name | deprecated |
| `model_config` | `Optional[dict]` | see model section above | `None` |
| `mode` | `Optional[str]` | `llm`, `use-llm`, `with-llm`, `no-llm` | `None` |
| `verbosity` | `Optional[str]` | `low`, `medium`, `high`, `1`, `2`, `3`, `debug`, `detailed` | `None` |
| `parser_config` | `Optional[dict]` | see parser section below | `None` |
| `parser_model` | `Optional[str]` | currently `docling` only | deprecated |
| `output_format` | `Optional[str]` | `markdown`, `text` | deprecated |
| `docling_options` | `Optional[dict]` | legacy Docling options | deprecated |
| `max_token_num_each_node` | `int` | `>=1` | `12000` |
| `max_page_num_each_node` | `int` | `>=1` | `8` |
| `if_add_node_summary` | `str` | `yes`, `no` | `yes` |

The parser-related parameters above apply to:
- `index_structured_document_tree(...)`
- `index_semi_structured_document_tree(...)`
- `index_unstructured_document_tree(...)`

Model-related note:
- prefer `model_config` for in-code configuration
- `model` still works for compatibility, but it is deprecated
- all indexing APIs accept `model`
- all indexing APIs accept `model_config`
- if both are passed, `model` wins over `model_config["model"]`

Transcript-specific note:
- `index_transcript_document_tree(...)` does not expose `mode`, `parser_config`,
  `parser_model`, `output_format`, or `docling_options`

Structured/semi-structured note:
- if you set `parser_config["output_format"]="text"` or legacy `output_format="text"`
  for `structured` or `semi_structured`, Navexa will force it to `markdown`
- reason: heading/tree extraction for these flows depends on markdown heading structure
- Navexa logs a warning when this coercion happens

Function-specific extra parameters:
- `semi_heading_prompt_template` in `index_semi_structured_document_tree`
- `transcript_topic_prompt_template` in `index_transcript_document_tree`

Backward compatibility:
- `index_document_tree(...)` still exists and routes to structured flow.

## `parser_config` Dictionary Reference

Use `parser_config` when you want parser behavior controlled directly in code.

Step by step:

1. set parser name in `parser_config["name"]`
2. set parser output in `parser_config["output_format"]`
3. put parser-specific options in `parser_config["options"]`
4. pass only the fields you want to override

Example:

```python
parser_config = {
    "name": "docling",
    "output_format": "markdown",
    "options": {
        "profile": "image_manual",
        "backend": "torch",
        "do_picture_description": True,
    },
}
```

Current parser names:
- `docling`

Current output formats:
- `markdown`
- `text`

Important behavior:
- `structured` and `semi_structured` do not truly run in `text` parser mode
- if `output_format="text"` is requested for those flows, Navexa coerces it to `markdown`
- `unstructured` can still use either `markdown` or `text`

`parser_config["options"]` for `name="docling"`:

| Key | Type | Allowed Values | Default | Scenario |
|---|---|---|---|---|
| `profile` | `str` | `balanced`, `image_manual`, `fast_text` | `balanced` | Start-point preset |
| `do_ocr` | `bool` | `True`, `False` | `True` | Turn OCR on/off |
| `force_full_page_ocr` | `bool` | `True`, `False` | `True` | Scanned/image PDFs |
| `do_table_structure` | `bool` | `True`, `False` | `True` | Table-heavy docs |
| `do_picture_description` | `bool` | `True`, `False` | `False` | Image/manual docs |
| `enable_remote_services` | `bool` | `True`, `False` | `False` | Remote enrichment |
| `backend` | `str` | `torch`, `onnxruntime` | `torch` | OCR runtime choice |
| `image_mode` | `str` | `placeholder`, `embedded`, `referenced`, `none` | `placeholder` | Markdown image policy |
| `quiet` | `bool` | `True`, `False` | `True` | Reduce parser logs |

Profile behavior:

| Profile | OCR | Full-page OCR | Table Structure | Picture Description | Best For |
|---|---|---|---|---|---|
| `balanced` | on | on | on | off | Most documents |
| `image_manual` | on | on | on | on | Image-heavy manuals/decks |
| `fast_text` | off | off | on | off | Native digital text PDFs |

Backend note:
- `backend="torch"`: default, generally most stable.
- `backend="onnxruntime"`: often lighter/faster on CPU-only environments.
- `onnxruntime` is included in Navexa install dependencies.

Quiet note:
- `quiet=True` reduces normal Docling/RapidOCR log noise
- major failures still raise clear Navexa errors
- recoverable parser issues are still reported through `pipeline.warnings`

Legacy compatibility note:
- `parser_model`, `output_format`, and `docling_options` still work
- they are deprecated in the Python API
- if `parser_config` is also passed, `parser_config` wins

## Save and Fetch APIs

## Save

`save_document_tree(index_result, out_dir=None, save_mode="explicit", write_tree=True, write_validation=False, write_compat=False)`

Notes:
- At least one of `write_tree/write_validation/write_compat` must be `True`
- If `out_dir=None`, default is `<pdf_dir>/<pdf_stem>_navexa_out`

## Fetch

- `fetch_document_tree(source, file_name="tree_navexa.json")`
- `fetch_validation_report(source)`
- `fetch_compat_tree(source)`

`source` can be:
- in-memory dict
- JSON file path
- output directory path
- `IndexResult` object

## Reasoning and RAG APIs

## Tree preparation

`build_search_tree_view(tree, strip_fields=("exclusive_text", "full_text"))`

`build_node_index(tree, include_page_ranges=True, exclude_fields=None)`

## Tree reasoning

`reason_over_tree(query, tree, model=None, model_config=None, prompt_template=None, llm_callable=None, return_prompt=False, verbosity=None, strip_fields=("exclusive_text","full_text"), prompt_extra=None)`

Return object fields:
- `thinking`
- `node_list`
- `raw_response`
- `used_prompt` (if requested)
- `parsed_json`

## Trace print

`print_reasoning_trace(reasoning_result, node_index)`

## Context extraction

`extract_selected_context(tree, node_list, text_mode="inclusive", dedupe_ancestor=True)`

`text_mode` values:
- `inclusive` -> uses `full_text`
- `exclusive` -> uses `exclusive_text`

`dedupe_ancestor` behavior:
- `True`: if parent and child are both selected, child is dropped
- `False`: keeps all selected nodes

## Answer generation

`answer_from_context(query, context_text, model=None, model_config=None, prompt_template=None, llm_callable=None, return_prompt=False, verbosity=None, prompt_extra=None)`

## End-to-end

`run_reasoning_rag(query, tree_or_source, model=None, model_config=None, tree_prompt_template=None, answer_prompt_template=None, llm_callable=None, return_prompt=False, verbosity=None, strip_fields=("exclusive_text","full_text"), text_mode="inclusive", dedupe_ancestor=True, tree_prompt_extra=None, answer_prompt_extra=None)`

Returns:
- `tree`, `tree_view`, `node_index`
- `reasoning`, `context`, `answer`
- `cost_before`, `cost_after`, `cost_delta`

## Prompt Template + `prompt_extra` Examples

Important:
- `prompt_extra` (and `tree_prompt_extra` / `answer_prompt_extra`) is passed as JSON payload.
- If your template includes `{extra_json}`, Navexa replaces it with payload JSON.
- If your template does not include `{extra_json}` and payload is provided, Navexa appends an `Additional ... (JSON)` section automatically.
- If payload is empty, nothing is appended.

Example: `reason_over_tree(...)` with custom prompt template and `prompt_extra`

```python
from navexa import reason_over_tree

custom_tree_prompt = """
You are a strict node selector.
Question: {query}

Tree:
{tree_json}

Constraints:
{extra_json}

Return JSON:
{"thinking":"...", "node_list":["0001"]}
"""

reasoning = reason_over_tree(
    query="What are key warnings?",
    tree=tree,
    model_config={"provider": "openai", "model": "gpt-4.1-mini", "api_key": "..."},
    prompt_template=custom_tree_prompt,
    prompt_extra={"must_include_terms": ["warning", "precaution"], "max_nodes": 2},
    return_prompt=True,
)

print(reasoning.node_list)
print(reasoning.used_prompt)
```

Example: `answer_from_context(...)` with `prompt_extra`

```python
from navexa import answer_from_context

answer_template = """
Answer only from context.
Question: {query}
Context: {context}
Policy: {extra_json}
"""

answer = answer_from_context(
    query="What are key warnings?",
    context_text="Warnings: liver toxicity; monitor ALT/AST.",
    model_config={"provider": "openai", "model": "gpt-4.1-mini", "api_key": "..."},
    prompt_template=answer_template,
    prompt_extra={"style": "bullet", "max_points": 3},
    return_prompt=True,
)

print(answer.answer)
print(answer.used_prompt)
```

Example: `run_reasoning_rag(...)` with separate extras

```python
from navexa import run_reasoning_rag

rag = run_reasoning_rag(
    query="What are key warnings?",
    tree_or_source=tree,
    model_config={"provider": "openai", "model": "gpt-4.1-mini", "api_key": "..."},
    tree_prompt_template=custom_tree_prompt,
    answer_prompt_template=answer_template,
    tree_prompt_extra={"max_nodes": 2},
    answer_prompt_extra={"style": "short"},
    return_prompt=True,
)

print(rag.reasoning.used_prompt)
print(rag.answer.used_prompt)
```

## End-to-End Example (Index + RAG)

```python
from navexa import (
    index_structured_document_tree,
    save_document_tree,
    fetch_document_tree,
    run_reasoning_rag,
    print_reasoning_trace,
)

pdf_path = "/path/to/file.pdf"
query = "What are the key warnings?"

index_result = index_structured_document_tree(
    pdf_path=pdf_path,
    mode="llm",
    model_config={
        "provider": "openai",
        "model": "gpt-4.1-mini",
        "api_key": "...",
    },
    verbosity="high",
)

save_result = save_document_tree(
    index_result=index_result,
    out_dir="/path/to/output",
    write_tree=True,
    write_validation=True,
    write_compat=False,
)

tree = fetch_document_tree(save_result.out_dir)

rag = run_reasoning_rag(
    query=query,
    tree_or_source=tree,
    model_config={
        "provider": "openai",
        "model": "gpt-4.1-mini",
        "api_key": "...",
    },
    return_prompt=True,
    verbosity="high",
)

print_reasoning_trace(rag.reasoning, rag.node_index)
print("\nAnswer:\n", rag.answer.answer)
print("\nCost delta:\n", rag.cost_delta)
```

## CLI Usage

After install, CLI entry point:

```bash
navexa-index --pdf /path/file.pdf --out-dir /path/out
```

If `navexa-index` is "command not found":
- If you installed into a virtualenv, activate it first: `source <venv>/bin/activate`
- If you installed with `pip install --user`, the script is usually at `~/.local/bin/navexa-index`. Add it to PATH once:

```bash
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
```

All CLI flags:

| Flag | Allowed Values | Default |
|---|---|---|
| `--pdf` | PDF path | required |
| `--out-dir` | output dir | required |
| `--model-config` | JSON object or JSON file path | `None` |
| `--model` | model/deployment string | deprecated |
| `--mode` | `llm`, `no-llm` | `NAVEXA_MODE` or `no-llm` |
| `--document-type` | `structured`, `semi_structured`, `unstructured`, `transcript` | `structured` |
| `--parser-config` | JSON object or JSON file path | `None` |
| `--parser-model` | `docling` | `docling` |
| `--output-format` | `markdown`, `text` | `markdown` |
| `--docling-profile` | `balanced`, `image_manual`, `fast_text` | `balanced` |
| `--docling-ocr` | `0`, `1` | profile/env resolved |
| `--docling-force-full-page-ocr` | `0`, `1` | profile/env resolved |
| `--docling-table-structure` | `0`, `1` | profile/env resolved |
| `--docling-picture-description` | `0`, `1` | profile/env resolved |
| `--docling-backend` | string (e.g. `torch`, `onnxruntime`) | profile/env resolved |
| `--docling-image-mode` | `placeholder`, `embedded`, `referenced`, `none` | profile/env resolved |
| `--docling-remote-services` | `0`, `1` | profile/env resolved |
| `--docling-quiet` | `0`, `1` | profile/env resolved |
| `--verbose` | `1`,`2`,`3`,`low`,`medium`,`high` | `medium` |
| `--max-token-num-each-node` | int | `12000` |
| `--max-page-num-each-node` | int | `8` |
| `--if-add-node-summary` | `yes`, `no` | `yes` |
| `--with-validation` | switch | off |
| `--with-compat` | switch | off |

Preferred CLI parser setup:

- use `--model-config` for grouped provider/model credentials and routing
- use legacy `--model` only as a compatibility override
- if both `--model-config` and `--model` are passed, legacy `--model` wins for compatibility
- use `--parser-config` for the grouped parser model
- use the old flat parser flags only as compatibility shorthands
- if `--parser-config` is passed together with old parser flags, `--parser-config` wins and the old parser flags are ignored
- if `--document-type transcript` is used, parser configuration is ignored

Example with grouped model + parser config:

```bash
navexa-index \
  --pdf /path/file.pdf \
  --out-dir /path/out \
  --mode llm \
  --document-type structured \
  --model-config '{"provider":"azure","deployment_name":"my-deploy","deployment_raw_name":"gpt-4.1-mini","api_key":"...","base_url":"https://<resource>.openai.azure.com"}' \
  --parser-config '{"name":"docling","output_format":"markdown","options":{"profile":"balanced","backend":"torch","quiet":true}}' \
  --verbose high \
  --if-add-node-summary yes \
  --with-validation
```

Example with grouped parser config:

```bash
navexa-index \
  --pdf /path/file.pdf \
  --out-dir /path/out \
  --mode llm \
  --document-type structured \
  --parser-config '{"name":"docling","output_format":"markdown","options":{"profile":"balanced","backend":"torch","quiet":true}}' \
  --verbose high \
  --if-add-node-summary yes \
  --with-validation
```

Equivalent legacy shorthand example:

```bash
navexa-index \
  --pdf /path/file.pdf \
  --out-dir /path/out \
  --mode llm \
  --document-type structured \
  --output-format markdown \
  --verbose high \
  --if-add-node-summary yes \
  --with-validation
```

## Output Files

Generated files:
- `tree_navexa.json` (canonical)
- `validation_report.json` (optional)
- `tree_legacy_compat.json` (optional)

`tree_navexa.json` top-level keys include:
- `doc_id`, `doc_name`, `pages`, `pipeline_version`
- `source` (path/hash/page count)
- `cost` (calls/tokens/estimated cost)
- `pipeline` (mode, document type, parser settings, node limits, steps)
- `structure` (hierarchical nodes with `children`)
- `transcript` (present for transcript flow)

Useful `pipeline` fields:
- `model_config`: resolved safe model/provider metadata
- `parser_config`: resolved parser configuration
- `warnings`: recoverable parser/runtime warnings

## Logging and Cost

Verbosity:
- `low`: compact result logs
- `medium`: stage-by-stage logs
- `high`: debug-level details and usage deltas

Cost fields are available in:
- `tree_navexa["cost"]`
- `IndexResult.meta["cost"]`
- `RAGResult.cost_delta` for reasoning runs

## Prompt Overrides and Custom LLM

Detailed docs:
- `docs/README.md`
- `docs/custom_llm_integration.md`
- `docs/prompt_templates.md`

You can override prompts:
- `semi_heading_prompt_template`
- `transcript_topic_prompt_template`
- `tree_prompt_template`
- `answer_prompt_template`

Use a provider-backed `BaseExternalLLM` adapter with `llm_callable=...`.

Behavior:
- if `llm_callable` is provided, Navexa uses your external `BaseExternalLLM` adapter
- if `llm_callable` is not provided, Navexa uses internal provider/env client

## Failure Behavior and Troubleshooting

Common cases:
- Missing API key in LLM-required flows (`transcript`) -> raises `RuntimeError`
- `mode="llm"` with missing credentials (any document type) -> raises `RuntimeError`
- Missing API key in optional LLM flow with `mode` unset (`structured` / `semi_structured` / `unstructured`) -> runs in effective `no-llm`
- Invalid `parser_config["name"]` or legacy `parser_model` -> `ValueError` (currently only `docling`)
- Invalid `parser_config["output_format"]` or legacy `output_format` -> `ValueError` (`markdown` or `text`)
- Empty/invalid tree input to reasoning APIs -> `ValueError`
- Recoverable Docling OCR/parser issues -> Navexa logs a one-line warning and stores it in `tree_navexa.pipeline.warnings`
- Major Docling failure with no usable extracted content -> Navexa raises `RuntimeError` with retry guidance instead of showing raw third-party traceback

Notebook note:
- `asyncio.run()` loop conflicts are handled in reasoning functions with thread fallback.
- If you still run into notebook event-loop issues, restart kernel and re-run imports.


## Acknowledgment

Navexa is an independent implementation. Some architecture patterns and selected adapted code paths were informed by PageIndex.

Attribution files:
- `THIRD_PARTY_NOTICES.md`
- `LICENSE`

## License

This project is licensed under MIT.

Practical compliance checklist:
1. Keep this repository `LICENSE`.
2. Keep third-party attribution notices.
3. Preserve upstream MIT notice for copied/substantially adapted portions.

This documentation is technical guidance, not legal advice.
