Metadata-Version: 2.4
Name: onprem
Version: 0.22.1
Summary: A tool for running on-premises large language models on non-public data
Author-email: "Arun S. Maiya" <arun@maiya.net>
License: Apache-2.0
Project-URL: Repository, https://github.com/amaiya/onprem
Project-URL: Documentation, https://amaiya.github.io/onprem
Keywords: nbdev,jupyter,notebook,python
Classifier: Natural Language :: English
Classifier: Intended Audience :: Developers
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: unstructured[all-docs]
Requires-Dist: nltk>=3.9.1
Requires-Dist: PyMuPDF
Requires-Dist: pymupdf4llm==0.0.17
Requires-Dist: extract-msg
Requires-Dist: tabulate
Requires-Dist: pandoc
Requires-Dist: pypandoc
Requires-Dist: requests
Requires-Dist: tqdm
Requires-Dist: syntok
Requires-Dist: pandas
Requires-Dist: sentence_transformers
Requires-Dist: cmake
Requires-Dist: setfit
Requires-Dist: langchain>=0.3.18
Requires-Dist: langchain-community>=0.3.18
Requires-Dist: langchain_litellm==0.1.4
Requires-Dist: litellm<=1.82.6
Requires-Dist: langchain-openai
Requires-Dist: langchain-huggingface
Requires-Dist: huggingface_hub
Requires-Dist: transformers<5
Requires-Dist: accelerate
Requires-Dist: langdetect
Requires-Dist: charset_normalizer
Requires-Dist: python-magic
Requires-Dist: whoosh-reloaded
Requires-Dist: pyparsing
Requires-Dist: openpyxl
Requires-Dist: streamlit
Requires-Dist: gmft
Provides-Extra: dev
Requires-Dist: nbdev; extra == "dev"
Provides-Extra: chroma
Requires-Dist: langchain_chroma; extra == "chroma"
Requires-Dist: chromadb; extra == "chroma"
Provides-Extra: elasticsearch
Requires-Dist: elasticsearch; extra == "elasticsearch"
Provides-Extra: sharepoint
Requires-Dist: requests_ntlm; extra == "sharepoint"
Provides-Extra: explain
Requires-Dist: shap; extra == "explain"
Requires-Dist: matplotlib; extra == "explain"
Provides-Extra: chinese
Requires-Dist: jieba; extra == "chinese"
Provides-Extra: aws
Requires-Dist: boto3; extra == "aws"
Provides-Extra: agent
Requires-Dist: patchpal; extra == "agent"
Provides-Extra: guidance
Requires-Dist: guidance>=0.1.5; extra == "guidance"
Provides-Extra: all
Requires-Dist: langchain_chroma; extra == "all"
Requires-Dist: chromadb; extra == "all"
Requires-Dist: elasticsearch; extra == "all"
Requires-Dist: requests_ntlm; extra == "all"
Requires-Dist: shap; extra == "all"
Requires-Dist: matplotlib; extra == "all"
Requires-Dist: jieba; extra == "all"
Requires-Dist: boto3; extra == "all"
Requires-Dist: patchpal; extra == "all"
Requires-Dist: guidance>=0.1.5; extra == "all"
Dynamic: license-file

# OnPrem.LLM


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

> A privacy-conscious toolkit for document intelligence — local by
> default, cloud-capable

**[OnPrem.LLM](https://github.com/amaiya/onprem)** (or “OnPrem” for
short) is a Python-based toolkit for applying large language models
(LLMs) to sensitive, non-public data in offline or restricted
environments. Inspired largely by the
[privateGPT](https://github.com/imartinez/privateGPT) project,
**OnPrem.LLM** is designed for fully local execution, but also supports
integration with a wide range of cloud LLM providers (e.g., OpenAI,
Anthropic).

**Key Features:**

- Fully local execution with option to leverage cloud as needed. See
  [the cheatsheet](https://amaiya.github.io/onprem/#cheat-sheet).
- Analysis pipelines for [many different
  tasks](https://amaiya.github.io/onprem/#examples), including
  information extraction, summarization, classification,
  question-answering, and agents.
- Support for environments with modest computational resources through
  modules like the
  [SparseStore](https://amaiya.github.io/onprem/examples_rag.html#advanced-example-nsf-awards)
  (e.g., RAG without having to store embeddings in advance).
- Easily integrate with existing tools in your local environment like
  [Elasticsearch and
  Sharepoint](https://amaiya.github.io/onprem/examples_vectorstore_factory.html).
- A [visual workflow
  builder](https://amaiya.github.io/onprem/workflows.html#visual-workflow-builder)
  to assemble complex document analysis pipelines with a point-and-click
  interface.

The full documentation is [here](https://amaiya.github.io/onprem/).

<!--A Google Colab demo of installing and using **OnPrem.LLM** is [here](https://colab.research.google.com/drive/1LVeacsQ9dmE1BVzwR3eTLukpeRIMmUqi?usp=sharing).
-->

**Quick Start**

``` python
# install
!pip install onprem[chroma]
from onprem import LLM, utils

# local LLM with Ollama as backend
!ollama pull llama3.2
llm = LLM('ollama/llama3.2')

# basic prompting
result = llm.prompt('Give me a short one sentence definition of an LLM.')

# RAG
utils.download('https://www.arxiv.org/pdf/2505.07672', '/tmp/my_documents/paper.pdf')
llm.ingest('/tmp/my_documents')
result = llm.ask('What is OnPrem.LLM?')

# switch to cloud LLM using Anthropic as backend
llm = LLM("anthropic/claude-3-7-sonnet-latest")

# structured outputs
from pydantic import BaseModel, Field
class MeasuredQuantity(BaseModel):
    value: str = Field(description="numerical value")
    unit: str = Field(description="unit of measurement")
structured_output = llm.pydantic_prompt('He was going 35 mph.', pydantic_model=MeasuredQuantity)
print(structured_output.value) # 35
print(structured_output.unit)  # mph

# Safely launch a sandboxed AI agent
from onprem.pipelines import AgentExecutor
executor = AgentExecutor(model='openai/gpt-5-mini', sandbox=True)
result = executor.run("""
Search this directory for all .md files and:
1. Extract all headings (# ## ###)
2. Count total words in each file
3. Create an index file 'documentation_index.md' with:
   - List of all markdown files
   - Word count for each
   - Main topics covered (from headings)
""")
```

Many LLM backends are supported (e.g.,
[llama_cpp](https://github.com/abetlen/llama-cpp-python),
[transformers](https://github.com/huggingface/transformers),
[Ollama](https://ollama.com/),
[vLLM](https://github.com/vllm-project/vllm),
[OpenAI](https://platform.openai.com/docs/models),
[Anthropic](https://docs.anthropic.com/en/docs/about-claude/models/overview),
etc.).

------------------------------------------------------------------------

<center>
<p align="center">
<img src="https://raw.githubusercontent.com/amaiya/onprem/refs/heads/master/images/onprem.png" border="0" alt="onprem.llm" width="200"/>
</p>
</center>
<center>
<p align="center">

**[Install](https://amaiya.github.io/onprem/#install) \|
[Usage](https://amaiya.github.io/onprem/#how-to-use) \| [Web
UI](https://amaiya.github.io/onprem/webapp.html) \|
[Examples](https://amaiya.github.io/onprem/#examples) \|
[FAQ](https://amaiya.github.io/onprem/#faq) \| [How to
Cite](https://amaiya.github.io/onprem/#how-to-cite)**

</p>
</center>

*Latest News* 🔥

- \[2026/03\] v0.22.0 released and now includes the **AgentExecutor**:
  safely launch AI agents in a sandboxed environment to solve problems
  in two lines of code. See [the example notebook on
  agents](https://amaiya.github.io/onprem/examples_agent.html).
- \[2026/01\] v0.21.0 released and now includes support for
  **metadata-based query routing**. See the [query routing example
  here](https://amaiya.github.io/onprem/pipelines.rag.html#example-using-query-routing-with-rag).
  Also included in this release: [provider-implemented structured
  outputs](https://amaiya.github.io/onprem/#natively-supported-structured-outputs)
  (e.g., structured outputs with OpenAI, Anthropic, and AWS GovCloud
  Bedrock).
- \[2025/12\] v0.20.0 released and now includes support for
  **asynchronous prompts**. See [the
  example](https://amaiya.github.io/onprem/examples.html#asynchronous-prompts).
- \[2025/09\] v0.19.0 released and now includes support for
  **workflows**: YAML-configured pipelines for complex document
  analyses. See [the workflow
  documentation](https://amaiya.github.io/onprem/workflows.html) for
  more information.
- \[2025/08\] v0.18.0 released and can now be used with AWS GovCloud
  LLMs. See [this
  example](https://amaiya.github.io/onprem/llm.backends.html#examples)
  for more information.
- \[2025/07\] v0.17.0 released and now allows you to connect directly to
  SharePoint for search and RAG. See the [example notebook on vector
  stores](https://amaiya.github.io/onprem/examples_vectorstore_factory.html#rag-with-sharepoint-documents)
  for more information.

------------------------------------------------------------------------

## Install

Once you have [installed
PyTorch](https://pytorch.org/get-started/locally/), you can install
**OnPrem.LLM** with:

``` sh
pip install onprem
```

**Chroma**: If using RAG with the default Chroma “Dense” vectorstore
(instead of [sparse
vectorstore](https://amaiya.github.io/onprem/#step-1-ingest-the-documents-into-a-vector-database)),
run `pip install[chroma]`.

**AI Agents**: If using OnPrem.LLM to launch [AI
agents](https://amaiya.github.io/onprem/examples_agent.html), run
`pip install onprem[agent]`.

**Llama-cpp-python is optional:**

If using llama-cpp-python as the LLM backend:

- **CPU:** `pip install llama-cpp-python` ([extra
  steps](https://github.com/amaiya/onprem/blob/master/MSWindows.md)
  required for Microsoft Windows)
- **GPU**: Follow [instructions
  below](https://amaiya.github.io/onprem/#on-gpu-accelerated-inference-with-llama-cpp-python).

Installing llama-cpp-python is *optional* if any of the following is
true:

- You are using [Ollama](https://ollama.com/) as the LLM backend.
- You use Hugging Face Transformers (instead of llama-cpp-python) as the
  LLM backend by supplying the `model_id` parameter when instantiating
  an LLM, as [shown
  here](https://amaiya.github.io/onprem/#using-hugging-face-transformers-instead-of-llama.cpp).
- You are using **OnPrem.LLM** with an LLM being served through an
  [external REST API](https://amaiya.github.io/onprem/#cheat-sheet)
  (e.g., vLLM, OpenLLM).
- You are using **OnPrem.LLM** with a [cloud
  LLM](https://amaiya.github.io/onprem/#cheat-sheet) (see cheat sheet
  below).

### On GPU-Accelerated Inference With `llama-cpp-python`

When installing **llama-cpp-python** with
`pip install llama-cpp-python`, the LLM will run on your **CPU**. To
generate answers much faster, you can run the LLM on your **GPU** by
building **llama-cpp-python** based on your operating system.

- **Linux**:
  `CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir`
- **Mac**: `CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python`
- **Windows 11**: Follow the instructions
  [here](https://github.com/amaiya/onprem/blob/master/MSWindows.md#using-the-system-python-in-windows-11s).
- **Windows Subsystem for Linux (WSL2)**: Follow the instructions
  [here](https://github.com/amaiya/onprem/blob/master/MSWindows.md#using-wsl2-with-gpu-acceleration).

For Linux and Windows, you will need [an up-to-date NVIDIA
driver](https://www.nvidia.com/en-us/drivers/) along with the [CUDA
toolkit](https://developer.nvidia.com/cuda-downloads) installed before
running the installation commands above.

After following the instructions above, supply the `n_gpu_layers=-1`
parameter when instantiating an LLM to use your GPU for fast inference:

``` python
llm = LLM(n_gpu_layers=-1, ...)
```

Quantized models with 8B parameters and below can typically run on GPUs
with as little as 6GB of VRAM. If a model does not fit on your GPU
(e.g., you get a “CUDA Error: Out-of-Memory” error), you can offload a
subset of layers to the GPU by experimenting with different values for
the `n_gpu_layers` parameter (e.g., `n_gpu_layers=20`). Setting
`n_gpu_layers=-1`, as shown above, offloads all layers to the GPU.

See [the FAQ](https://amaiya.github.io/onprem/#faq) for extra tips, if
you experience issues with
[llama-cpp-python](https://pypi.org/project/llama-cpp-python/)
installation.

## How to Use

### Setup

``` python
from onprem import LLM

llm = LLM(verbose=False) # default model and backend are used
```

#### Cheat Sheet

*Local Models:* A number of different local LLM backends are supported.

- **Llama-cpp**: `llm = LLM(default_model="llama", n_gpu_layers=-1)`

- **Llama-cpp with selected GGUF model via URL**:

  ``` python
   # prompt templates are required for user-supplied GGUF models (see FAQ)
   llm = LLM(model_url='https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf', 
             prompt_template= "<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>", n_gpu_layers=-1)
  ```

- **Llama-cpp with selected GGUF model via file path**:

  ``` python
   # prompt templates are required for user-supplied GGUF models (see FAQ)
   llm = LLM(model_url='zephyr-7b-beta.Q4_K_M.gguf', 
             model_download_path='/path/to/folder/to/where/you/downloaded/model',
             prompt_template= "<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>", n_gpu_layers=-1)
  ```

- **Hugging Face Transformers**:
  `llm = LLM(model_id='Qwen/Qwen2.5-0.5B-Instruct', device='cuda')`

- **Ollama**: `llm = LLM(model_url="ollama://llama3.2", api_key='na')`

- **Also Ollama**:
  `llm = LLM(model_url="ollama/llama3.2", api_key='na')`

- **Also Ollama**:
  `llm = LLM(model_url='http://localhost:11434/v1', api_key='na', model='llama3.2')`

- **vLLM**:
  `llm = LLM(model_url='http://localhost:8666/v1', api_key='na', model='Qwen/Qwen2.5-0.5B-Instruct')`

- **Also vLLM**:
  `llm = LLM('hosted_vllm/served-model-name', api_base="http://localhost:8666/v1", api_key="test123")`
  (assumes `served-model-name` parameter is supplied to
  `vllm.entrypoints.openai.api_server`).

- **vLLM with gpt-oss** (assumes `served-model-name` parameter is
  supplied to vLLM):

  ``` python
  # important: set max_tokens to high value due to intermediate reasoning steps that are generated
  llm = LLM(model_url='http://localhost:8666/v1', api_key='your_api_key', model=served_model_name, max_tokens=32000)
  result = llm.prompt(prompt, reasoning_effort="high")
  ```

*Cloud Models:* In addition to local LLMs, all cloud LLM providers
supported by [LiteLLM](https://github.com/BerriAI/litellm) are
compatible:

- **Anthropic Claude**:
  `llm = LLM(model_url="anthropic/claude-3-7-sonnet-latest")`

- **OpenAI GPT-4o**: `llm = LLM(model_url="openai/gpt-4o")`

- **AWS GovCloud Bedrock** (assumes AWS_ACCESS_KEY_ID and
  AWS_SECRET_ACCESS_KEY are set as environment variables)

  ``` python
  from onprem import LLM
  inference_arn = "YOUR INFERENCE ARN"
  endpoint_url = "YOUR ENDPOINT URL"
  region_name = "us-gov-east-1" # replace as necessary
  # set up LLM connection to Bedrock on AWS GovCloud
  llm = LLM( f"govcloud-bedrock://{inference_arn}", region_name=region_name, endpoint_url=endpoint_url)
  response = llm.prompt("Write a haiku about the moon.")
  ```

The instantiations above are described in more detail below.

#### GGUF Models and Llama.cpp

The default LLM backend is
[llama-cpp-python](https://github.com/abetlen/llama-cpp-python), and the
default model is currently a 7B-parameter model called
**Zephyr-7B-beta**, which is automatically downloaded and used.
Llama.cpp run models in [GGUF](https://huggingface.co/docs/hub/en/gguf)
format. The two other default models are `llama` and `mistral`. For
instance, if `default_model='llama'` is supplied, then a
**Llama-3.1-8B-Instsruct** model is automatically downloaded and used:

``` python
# Llama 3.1 is downloaded here and the correct prompt template for Llama-3.1 is automatically configured and used
llm = LLM(default_model='llama')
```

*Choosing Your Own Models:* Of course, you can also easily supply the
URL or path to an LLM of your choosing to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) (see the
[FAQ](https://amaiya.github.io/onprem/#faq) for an example).

*Supplying Extra Parameters:* Any extra parameters supplied to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) are forwarded
directly to
[llama-cpp-python](https://github.com/abetlen/llama-cpp-python), the
default LLM backend.

#### Changing the Default LLM Backend

If `default_engine="transformers"` is supplied to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm), Hugging Face
[transformers](https://github.com/huggingface/transformers) is used as
the LLM backend. Extra parameters to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) (e.g.,
‘device=’cuda’`) are forwarded diretly to`transformers.pipeline`. If supplying a`model_id\`
parameter, the default LLM backend is automatically changed to Hugging
Face [transformers](https://github.com/huggingface/transformers).

``` python
# LLama-3.1 model quantized using AWQ is downloaded and run with Hugging Face transformers (requires GPU)
llm = LLM(default_model='llama', default_engine='transformers')

# Using a custom model with Hugging Face Transformers
llm = LLM(model_id='Qwen/Qwen2.5-0.5B-Instruct', device_map='cpu')
```

See
[here](https://amaiya.github.io/onprem/#using-hugging-face-transformers-instead-of-llama.cpp)
for more information about using Hugging Face
[transformers](https://github.com/huggingface/transformers) as the LLM
backend.

You can also connect to **Ollama**, local LLM APIs (e.g., vLLM), and
cloud LLMs.

``` python
# connecting to an LLM served by Ollama
lm = LLM(model_url='ollama/llama3.2')

# connecting to an LLM served through vLLM (set API key as needed)
llm = LLM(model_url='http://localhost:8000/v1', api_key='token-abc123', model='Qwen/Qwen2.5-0.5B-Instruct')`

# connecting to a cloud-backed LLM (e.g., OpenAI, Anthropic).
llm = LLM(model_url="openai/gpt-4o-mini")  # OpenAI
llm = LLM(model_url="anthropic/claude-3-7-sonnet-20250219") # Anthropic
```

**OnPrem.LLM** suppports any provider and model supported by the
[LiteLLM](https://github.com/BerriAI/litellm) package.

See
[here](https://amaiya.github.io/onprem/#connecting-to-llms-served-through-rest-apis)
for more information on *local* LLM APIs.

More information on using OpenAI models specifically with **OnPrem.LLM**
is [here](https://amaiya.github.io/onprem/examples_openai.html).

#### Supplying Parameters to the LLM Backend

Extra parameters supplied to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) and
[`LLM.prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.prompt)
are passed directly to the LLM backend. Parameter names will vary
depending on the backend you chose.

For instance, with the default llama-cpp backend, the default context
window size (`n_ctx`) is set to 3900 and the default output size
(`max_tokens`) is set 512. Both are configurable parameters to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm). Increase if
you have larger prompts or need longer outputs. Other parameters (e.g.,
`api_key`, `device_map`, etc.) can be supplied directly to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) and will be
routed to the LLM backend or API (e.g., llama-cpp-python, Hugging Face
transformers, vLLM, OpenAI, etc.). The `max_tokens` parameter can also
be adjusted on-the-fly by supplying it to
[`LLM.prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.prompt).

On the other hand, for Ollama models, context window and output size are
controlled by `num_ctx` and `num_predict`, respectively.

With the Hugging Face transformers, setting the context window size is
not needed, but the output size is controlled by the `max_new_tokens`
parameter to
[`LLM.prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.prompt).

#### Using Hugging Face Transformers Instead of Llama.cpp

By default, the LLM backend employed by **OnPrem.LLM** is
[llama-cpp-python](https://github.com/abetlen/llama-cpp-python), which
requires models in [GGUF format](https://huggingface.co/docs/hub/gguf).
As of v0.5.0, it is now possible to use [Hugging Face
transformers](https://github.com/huggingface/transformers) as the LLM
backend instead. This is accomplished by using the `model_id` parameter
(instead of supplying a `model_url` argument). In the example below, we
run the
[Llama-3.1-8B](https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4)
model.

``` python
# llama-cpp-python does NOT need to be installed when using model_id parameter
llm = LLM(model_id="hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4", device_map='cuda')
```

This allows you to more easily use any model on the Hugging Face hub in
[SafeTensors format](https://huggingface.co/docs/safetensors/index)
provided it can be loaded with the Hugging Face `transformers.pipeline`.
Note that, when using the `model_id` parameter, the `prompt_template` is
set automatically by `transformers`.

The Llama-3.1 model loaded above was quantized using
[AWQ](https://huggingface.co/docs/transformers/main/en/quantization/awq),
which allows the model to fit onto smaller GPUs (e.g., laptop GPUs with
6GB of VRAM) similar to the default GGUF format. AWQ models will require
the [autoawq](https://pypi.org/project/autoawq/) package to be
installed: `pip install autoawq` (AWQ only supports Linux system,
including Windows Subsystem for Linux). If you do need to load a model
that is not quantized, you can supply a quantization configuration at
load time (known as “inflight quantization”). In the following example,
we load an unquantized [Zephyr-7B-beta
model](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) that will be
quantized during loading to fit on GPUs with as little as 6GB of VRAM:

``` python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
)
llm = LLM(model_id="HuggingFaceH4/zephyr-7b-beta", device_map='cuda', 
          model_kwargs={"quantization_config":quantization_config})
```

When supplying a `quantization_config`, the
[bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/installation)
library, a lightweight Python wrapper around CUDA custom functions, in
particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 &
4-bit quantization functions, is used. There are ongoing efforts by the
bitsandbytes team to support multiple backends in addition to CUDA. If
you receive errors related to bitsandbytes, please refer to the
[bitsandbytes
documentation](https://huggingface.co/docs/bitsandbytes/main/en/installation).

## Built-In Web App

**OnPrem.LLM** includes a built-in Web app to access the LLM. To start
it, run the following command after installation:

``` shell
onprem --port 8000
```

Then, enter `localhost:8000` (or `<domain_name>:8000` if running on
remote server) in a Web browser to access the application:

<img src="https://raw.githubusercontent.com/amaiya/onprem/master/images/onprem_welcome.png" border="1" alt="screenshot" width="775"/>

For more information, [see the corresponding
documentation](https://amaiya.github.io/onprem/webapp.html).

## Examples

The [documentation](https://amaiya.github.io/onprem/) includes many
examples.

### 💡 Getting Started

| Documentation Link                                                  | Example                      |
|---------------------------------------------------------------------|------------------------------|
| [Prompting Examples](https://amaiya.github.io/onprem/examples.html) | Problem-Solving With Prompts |

### 📚 Document Processing

| Documentation Link                                                                             | Example                                           |
|------------------------------------------------------------------------------------------------|---------------------------------------------------|
| [Text Extraction](https://amaiya.github.io/onprem/examples_text_extraction.html)               | Document Text Extraction (PDFs, Word, PowerPoint) |
| [Document Summarization](https://amaiya.github.io/onprem/examples_summarization.html)          | Document Summarization                            |
| [Information Extraction](https://amaiya.github.io/onprem/examples_information_extraction.html) | Information Extraction from Documents             |

### 🧠 Question-Answering & Search

| Documentation Link                                                                          | Example                                     |
|---------------------------------------------------------------------------------------------|---------------------------------------------|
| [RAG Example](https://amaiya.github.io/onprem/examples_rag.html)                            | Question-Answering with RAG                 |
| [Vector Stores Tutorial](https://amaiya.github.io/onprem/examples_vectorstore_factory.html) | Using Different Vector Stores               |
| [Semantic Similarity](https://amaiya.github.io/onprem/examples_semantic.html)               | Computing Semantic Similarity Between Texts |

### 🎯 Classification & Analysis

| Documentation Link                                                                           | Example                                  |
|----------------------------------------------------------------------------------------------|------------------------------------------|
| [Text Classification](https://amaiya.github.io/onprem/examples_classification.html)          | Few-Shot Text Classification             |
| [Survey Analysis](https://amaiya.github.io/onprem/examples_qualitative_survey_analysis.html) | Auto-Coding Qualitative Survey Responses |
| [Legal Analysis](https://amaiya.github.io/onprem/examples_legal_analysis.html)               | Legal and Regulatory Document Analysis   |

### 🛠️ Advanced Features

| Documentation Link                                                                 | Example                                            |
|------------------------------------------------------------------------------------|----------------------------------------------------|
| [Agent Examples](https://amaiya.github.io/onprem/examples_agent.html)              | Agent-Based Task Execution with Tools              |
| [Structured Outputs](https://amaiya.github.io/onprem/examples_guided_prompts.html) | Structured and Guided Outputs with Pydantic Models |
| [Workflow Builder](https://amaiya.github.io/onprem/workflows.html)                 | Workflow Builder for Document Analysis             |

## FAQ

1.  **How do I use other models with OnPrem.LLM?**

    > You can supply any model of your choice using the `model_url` and
    > `model_id` parameters to `LLM` (see cheat sheet above).

    > Here, we will go into detail on how to supply a custom GGUF model
    > using the llma.cpp backend.

    > You can find llama.cpp-supported models with `GGUF` in the file
    > name on
    > [huggingface.co](https://huggingface.co/models?sort=trending&search=gguf).

    > Make sure you are pointing to the URL of the actual GGUF model
    > file, which is the “download” link on the model’s page. An example
    > for **Mistral-7B** is shown below:

    > <img src="https://raw.githubusercontent.com/amaiya/onprem/master/images/model_download_link.png" border="1" alt="screenshot" width="775"/>

    > When using the llama.cpp backend, GGUF models have specific prompt
    > formats that need to supplied to `LLM`. For instance, the prompt
    > template required for **Zephyr-7B**, as described on the [model’s
    > page](https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF), is:
    >
    > `<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>`
    >
    > So, to use the **Zephyr-7B** model, you must supply the
    > `prompt_template` argument to the `LLM` constructor (or specify it
    > in the `webapp.yml` configuration for the Web app).
    >
    > ``` python
    > # how to use Zephyr-7B with OnPrem.LLM
    > llm = LLM(model_url='https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf',
    >           prompt_template = "<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>",
    >           n_gpu_layers=33)
    > llm.prompt("List three cute names for a cat.")
    > ```

    > Prompt templates are **not** required for any other LLM backend
    > (e.g., when using Ollama as backend or when using `model_id`
    > parameter for transformers models). Prompt templates are also not
    > required if using any of the default models.

2.  **When installing `onprem`, I’m getting “build” errors related to
    `llama-cpp-python` (or `chroma-hnswlib`) on Windows/Mac/Linux?**

    > See [this LangChain documentation on
    > LLama.cpp](https://python.langchain.com/docs/integrations/llms/llamacpp)
    > for help on installing the `llama-cpp-python` package for your
    > system. Additional tips for different operating systems are shown
    > below:

    > For **Linux** systems like Ubuntu, try this:
    > `sudo apt-get install build-essential g++ clang`. Other tips are
    > [here](https://github.com/oobabooga/text-generation-webui/issues/1534).

    > For **Windows** systems, please try following [these
    > instructions](https://github.com/amaiya/onprem/blob/master/MSWindows.md).
    > We recommend you use [Windows Subsystem for Linux
    > (WSL)](https://learn.microsoft.com/en-us/windows/wsl/install)
    > instead of using Microsoft Windows directly. If you do need to use
    > Microsoft Window directly, be sure to install the [Microsoft C++
    > Build
    > Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
    > and make sure the **Desktop development with C++** is selected.

    > For **Macs**, try following [these
    > tips](https://github.com/imartinez/privateGPT/issues/445#issuecomment-1563333950).

    > There are also various other tips for each of the above OSes in
    > [this privateGPT repo
    > thread](https://github.com/imartinez/privateGPT/issues/445). Of
    > course, you can also [easily
    > use](https://colab.research.google.com/drive/1LVeacsQ9dmE1BVzwR3eTLukpeRIMmUqi?usp=sharing)
    > **OnPrem.LLM** on Google Colab.

    > Finally, if you still can’t overcome issues with building
    > `llama-cpp-python`, you can try [installing the pre-built wheel
    > file](https://abetlen.github.io/llama-cpp-python/whl/cpu/llama-cpp-python/)
    > for your system:

    > **Example:**
    > `pip install llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu`
    >
    > **Tip:** There are [pre-built wheel files for
    > `chroma-hnswlib`](https://pypi.org/project/chroma-hnswlib/#files),
    > as well. If running `pip install onprem` fails on building
    > `chroma-hnswlib`, it may be because a pre-built wheel doesn’t yet
    > exist for the version of Python you’re using (in which case you
    > can try downgrading Python).

3.  **I’m behind a corporate firewall and am receiving an SSL error when
    trying to download the model?**

    > Try this:
    >
    > ``` python
    > from onprem import LLM
    > LLM.download_model(url, ssl_verify=False)
    > ```

    > You can download the embedding model (used by `LLM.ingest` and
    > `LLM.ask`) as follows:
    >
    > ``` sh
    > wget --no-check-certificate https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/all-MiniLM-L6-v2.zip
    > ```

    > Supply the unzipped folder name as the `embedding_model_name`
    > argument to `LLM`.

    > If you’re getting SSL errors when even running `pip install`, try
    > this:
    >
    > ``` sh
    > pip install –-trusted-host pypi.org –-trusted-host files.pythonhosted.org pip_system_certs
    > ```

4.  **How do I use this on a machine with no internet access?**

    > Use the `LLM.download_model` method to download the model files to
    > `<your_home_directory>/onprem_data` and transfer them to the same
    > location on the air-gapped machine.

    > For the `ingest` and `ask` methods, you will need to also download
    > and transfer the embedding model files:
    >
    > ``` python
    > from sentence_transformers import SentenceTransformer
    > model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    > model.save('/some/folder')
    > ```

    > Copy the `some/folder` folder to the air-gapped machine and supply
    > the path to `LLM` via the `embedding_model_name` parameter.

5.  **My model is not loading when I call `llm = LLM(...)`?**

    > This can happen if the model file is corrupt (in which case you
    > should delete from `<home directory>/onprem_data` and
    > re-download). It can also happen if the version of
    > `llama-cpp-python` needs to be upgraded to the latest.

6.  **I’m getting an `“Illegal instruction (core dumped)` error when
    instantiating a `langchain.llms.Llamacpp` or `onprem.LLM` object?**

    > Your CPU may not support instructions that `cmake` is using for
    > one reason or another (e.g., [due to Hyper-V in VirtualBox
    > settings](https://stackoverflow.com/questions/65780506/how-to-enable-avx-avx2-in-virtualbox-6-1-16-with-ubuntu-20-04-64bit)).
    > You can try turning them off when building and installing
    > `llama-cpp-python`:

    > ``` sh
    > # example
    > CMAKE_ARGS="-DGGML_CUDA=ON -DGGML_AVX2=OFF -DGGML_AVX=OFF -DGGML_F16C=OFF -DGGML_FMA=OFF" FORCE_CMAKE=1 pip install --force-reinstall llama-cpp-python --no-cache-dir
    > ```

7.  **How can I speed up
    [`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest)?**

    > By default, a GPU, if available, will be used to compute
    > embeddings, so ensure PyTorch is installed with GPU support. You
    > can explicitly control the device used for computing embeddings
    > with the `embedding_model_kwargs` argument.
    >
    > ``` python
    > from onprem import LLM
    > llm  = LLM(embedding_model_kwargs={'device':'cuda'})
    > ```

    > You can also supply `store_type="sparse"` to `LLM` to use a sparse
    > vector store, which sacrifices a small amount of inference speed
    > (`LLM.ask`) for significant speed ups during ingestion
    > (`LLM.ingest`).
    >
    > ``` python
    > from onprem import LLM
    > llm  = LLM(store_type="sparse")
    > ```
    >
    > Note, however, that, unlike dense vector stores, sparse vector
    > stores assume answer sources will contain at least one word in
    > common with the question.

<!--
8. **What are ways in which OnPrem.LLM has been used?**
    > Examples include:
    > - extracting key performance parameters and other performance attributes from engineering documents
    > - auto-coding responses to government requests for information (RFIs)
    > - analyzing the Federal Aquisition Regulations (FAR)
    > - understanding where and how Executive Order 14028 on cybersecurity aligns with the National Cybersecurity Strategy
    > - generating a summary of ways to improve a course from thousdands of reviews
    > - extracting specific information of interest from resumes for talent acquisition.
&#10;-->

## How to Cite

Please cite the [following paper](https://arxiv.org/abs/2509.21040) when
using **OnPrem.LLM**:

    @article{maiya2025generativeaiffrdcs,
          title={Generative AI for FFRDCs}, 
          author={Arun S. Maiya},
          year={2025},
          eprint={2509.21040},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2509.21040}, 
    }
