Metadata-Version: 2.4
Name: onprem
Version: 0.21.0
Summary: A tool for running on-premises large language models on non-public data
Home-page: https://github.com/amaiya/onprem
Author: Arun S. Maiya
Author-email: arun@maiya.net
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: unstructured[all-docs]
Requires-Dist: nltk>=3.9.1
Requires-Dist: PyMuPDF
Requires-Dist: pymupdf4llm==0.0.17
Requires-Dist: extract-msg
Requires-Dist: tabulate
Requires-Dist: pandoc
Requires-Dist: pypandoc
Requires-Dist: requests
Requires-Dist: tqdm
Requires-Dist: syntok
Requires-Dist: pandas
Requires-Dist: sentence_transformers
Requires-Dist: cmake
Requires-Dist: setfit
Requires-Dist: guidance>=0.1.5
Requires-Dist: langchain>=0.3.18
Requires-Dist: langchain-community>=0.3.18
Requires-Dist: langchain_litellm==0.1.4
Requires-Dist: litellm
Requires-Dist: langchain-openai
Requires-Dist: langchain-huggingface
Requires-Dist: huggingface_hub
Requires-Dist: transformers
Requires-Dist: accelerate
Requires-Dist: langdetect
Requires-Dist: charset_normalizer
Requires-Dist: python-magic
Requires-Dist: whoosh-reloaded
Requires-Dist: pyparsing
Requires-Dist: openpyxl
Requires-Dist: streamlit
Requires-Dist: smolagents
Requires-Dist: markdownify
Requires-Dist: mcpadapt
Requires-Dist: gmft
Requires-Dist: datasets==3.6.0
Provides-Extra: dev
Requires-Dist: nbdev; extra == "dev"
Provides-Extra: chroma
Requires-Dist: chromadb; extra == "chroma"
Requires-Dist: langchain_chroma; extra == "chroma"
Provides-Extra: explain
Requires-Dist: shap; extra == "explain"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# OnPrem.LLM


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

> A privacy-conscious toolkit for document intelligence — local by
> default, cloud-capable

**[OnPrem.LLM](https://github.com/amaiya/onprem)** (or “OnPrem” for
short) is a Python-based toolkit for applying large language models
(LLMs) to sensitive, non-public data in offline or restricted
environments. Inspired largely by the
[privateGPT](https://github.com/imartinez/privateGPT) project,
**OnPrem.LLM** is designed for fully local execution, but also supports
integration with a wide range of cloud LLM providers (e.g., OpenAI,
Anthropic).

**Key Features:**

- Fully local execution with option to leverage cloud as needed. See
  [the cheatsheet](https://amaiya.github.io/onprem/#cheat-sheet).
- Analysis pipelines for [many different
  tasks](https://amaiya.github.io/onprem/#examples), including
  information extraction, summarization, classification,
  question-answering, and agents.
- Support for environments with modest computational resources through
  modules like the
  [SparseStore](https://amaiya.github.io/onprem/examples_rag.html#advanced-example-nsf-awards)
  (e.g., RAG without having to store embeddings in advance).
- Easily integrate with existing tools in your local environment like
  [Elasticsearch and
  Sharepoint](https://amaiya.github.io/onprem/examples_vectorstore_factory.html).
- A [visual workflow
  builder](https://amaiya.github.io/onprem/workflows.html#visual-workflow-builder)
  to assemble complex document analysis pipelines with a point-and-click
  interface.

The full documentation is [here](https://amaiya.github.io/onprem/).

<!--A Google Colab demo of installing and using **OnPrem.LLM** is [here](https://colab.research.google.com/drive/1LVeacsQ9dmE1BVzwR3eTLukpeRIMmUqi?usp=sharing).
-->

**Quick Start**

``` python
# install
!pip install onprem[chroma]
from onprem import LLM, utils

# local LLM with Ollama as backend
!ollama pull llama3.2
llm = LLM('ollama/llama3.2')

# basic prompting
result = llm.prompt('Give me a short one sentence definition of an LLM.')

# RAG
utils.download('https://www.arxiv.org/pdf/2505.07672', '/tmp/my_documents/paper.pdf')
llm.ingest('/tmp/my_documents')
result = llm.ask('What is OnPrem.LLM?')

# switch to cloud LLM using Anthropic as backend
llm = LLM("anthropic/claude-3-7-sonnet-latest")

# structured outputs
from pydantic import BaseModel, Field
class MeasuredQuantity(BaseModel):
    value: str = Field(description="numerical value")
    unit: str = Field(description="unit of measurement")
structured_output = llm.pydantic_prompt('He was going 35 mph.', pydantic_model=MeasuredQuantity)
print(structured_output.value) # 35
print(structured_output.unit)  # mph
```

Many LLM backends are supported (e.g.,
[llama_cpp](https://github.com/abetlen/llama-cpp-python),
[transformers](https://github.com/huggingface/transformers),
[Ollama](https://ollama.com/),
[vLLM](https://github.com/vllm-project/vllm),
[OpenAI](https://platform.openai.com/docs/models),
[Anthropic](https://docs.anthropic.com/en/docs/about-claude/models/overview),
etc.).

------------------------------------------------------------------------

<center>
<p align="center">
<img src="https://raw.githubusercontent.com/amaiya/onprem/refs/heads/master/images/onprem.png" border="0" alt="onprem.llm" width="200"/>
</p>
</center>
<center>
<p align="center">

**[Install](https://amaiya.github.io/onprem/#install) \|
[Usage](https://amaiya.github.io/onprem/#how-to-use) \|
[Examples](https://amaiya.github.io/onprem/#examples) \| [Web
UI](https://amaiya.github.io/onprem/webapp.html) \|
[FAQ](https://amaiya.github.io/onprem/#faq) \| [How to
Cite](https://amaiya.github.io/onprem/#how-to-cite)**

</p>
</center>

*Latest News* 🔥

- \[2026/01\] v0.21.0 released and now includes support for
  **metadata-based query routing**. See the [query routing example
  here](https://amaiya.github.io/onprem/pipelines.rag.html#example-using-query-routing-with-rag).
  Also included in this release: [provider-implemented structured
  outputs](https://amaiya.github.io/onprem/#natively-supported-structured-outputs)
  (e.g., structured outputs with OpenAI, Anthropic, and AWS GovCloud
  Bedrock).
- \[2025/12\] v0.20.0 released and now includes support for
  **asynchronous prompts**. See [the
  example](https://amaiya.github.io/onprem/examples.html#asynchronous-prompts).
- \[2025/09\] v0.19.0 released and now includes support for
  **workflows**: YAML-configured pipelines for complex document
  analyses. See [the workflow
  documentation](https://amaiya.github.io/onprem/workflows.html) for
  more information.
- \[2025/08\] v0.18.0 released and can now be used with AWS GovCloud
  LLMs. See [this
  example](https://amaiya.github.io/onprem/llm.backends.html#examples)
  for more information.
- \[2025/07\] v0.17.0 released and now allows you to connect directly to
  SharePoint for search and RAG. See the [example notebook on vector
  stores](https://amaiya.github.io/onprem/examples_vectorstore_factory.html#rag-with-sharepoint-documents)
  for more information.
- \[2025/07\] v0.16.0 released and now includes out-of-the-box support
  for **Elasticsearch** as a vector store for RAG and semantic search in
  addition to other vector store backends. See the [example notebook on
  vector
  stores](https://amaiya.github.io/onprem/examples_vectorstore_factory.html)
  for more information.
- \[2025/06\] v0.15.0 released and now includes support for solving
  tasks with **agents**. See the [example notebook on
  agents](https://amaiya.github.io/onprem/examples_agent.html) for more
  information.
- \[2025/05\] v0.14.0 released and now includes a point-and-click
  interface for **Document Analysis**: applying prompts to individual
  passages in uploaded documents. See the [Web UI
  documentation](https://amaiya.github.io/onprem/webapp.html) for more
  information.
- \[2025/04\] v0.13.0 released and now includes streamlined support for
  Ollama and many cloud LLMs via special URLs (e.g.,
  `model_url="ollama://llama3.2"`,
  `model_url="anthropic://claude-3-7-sonnet-latest"`). See the [cheat
  sheet](https://amaiya.github.io/onprem/#how-to-use) for examples.
  (**Note: Please use `onprem>=0.13.1` due to bug in v0.13.0.**)
- \[2025/04\] v0.12.0 released and now includes a re-vamped and improved
  Web UI with support for interactive chatting, document
  question-answering (RAG), and document search (both keyword searches
  and semantic searches). See the [Web UI
  documentation](https://amaiya.github.io/onprem/webapp.html) for more
  information.

------------------------------------------------------------------------

## Install

Once you have [installed
PyTorch](https://pytorch.org/get-started/locally/), you can install
**OnPrem.LLM** with the following steps:

1.  Install **llama-cpp-python** (*optional* - see below):
    - **CPU:** `pip install llama-cpp-python` ([extra
      steps](https://github.com/amaiya/onprem/blob/master/MSWindows.md)
      required for Microsoft Windows)
    - **GPU**: Follow [instructions
      below](https://amaiya.github.io/onprem/#on-gpu-accelerated-inference).
2.  Install **OnPrem.LLM** with Chroma packages:
    `pip install onprem[chroma]`

For RAG using only a [sparse
vectorstore](https://amaiya.github.io/onprem/#step-1-ingest-the-documents-into-a-vector-database),
you can install OnPrem.LLM without the extra chroma packages:
`pip install onprem`.

**Note:** Installing **llama-cpp-python** is *optional* if any of the
following is true:

- You are using [Ollama](https://ollama.com/) as the LLM backend.
- You use Hugging Face Transformers (instead of llama-cpp-python) as the
  LLM backend by supplying the `model_id` parameter when instantiating
  an LLM, as [shown
  here](https://amaiya.github.io/onprem/#using-hugging-face-transformers-instead-of-llama.cpp).
- You are using **OnPrem.LLM** with an LLM being served through an
  [external REST API](https://amaiya.github.io/onprem/#cheat-sheet)
  (e.g., vLLM, OpenLLM).
- You are using **OnPrem.LLM** with a [cloud
  LLM](https://amaiya.github.io/onprem/#cheat-sheet) (more information
  below).

### On GPU-Accelerated Inference With `llama-cpp-python`

When installing **llama-cpp-python** with
`pip install llama-cpp-python`, the LLM will run on your **CPU**. To
generate answers much faster, you can run the LLM on your **GPU** by
building **llama-cpp-python** based on your operating system.

- **Linux**:
  `CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir`
- **Mac**: `CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python`
- **Windows 11**: Follow the instructions
  [here](https://github.com/amaiya/onprem/blob/master/MSWindows.md#using-the-system-python-in-windows-11s).
- **Windows Subsystem for Linux (WSL2)**: Follow the instructions
  [here](https://github.com/amaiya/onprem/blob/master/MSWindows.md#using-wsl2-with-gpu-acceleration).

For Linux and Windows, you will need [an up-to-date NVIDIA
driver](https://www.nvidia.com/en-us/drivers/) along with the [CUDA
toolkit](https://developer.nvidia.com/cuda-downloads) installed before
running the installation commands above.

After following the instructions above, supply the `n_gpu_layers=-1`
parameter when instantiating an LLM to use your GPU for fast inference:

``` python
llm = LLM(n_gpu_layers=-1, ...)
```

Quantized models with 8B parameters and below can typically run on GPUs
with as little as 6GB of VRAM. If a model does not fit on your GPU
(e.g., you get a “CUDA Error: Out-of-Memory” error), you can offload a
subset of layers to the GPU by experimenting with different values for
the `n_gpu_layers` parameter (e.g., `n_gpu_layers=20`). Setting
`n_gpu_layers=-1`, as shown above, offloads all layers to the GPU.

See [the FAQ](https://amaiya.github.io/onprem/#faq) for extra tips, if
you experience issues with
[llama-cpp-python](https://pypi.org/project/llama-cpp-python/)
installation.

## How to Use

### Setup

``` python
from onprem import LLM

llm = LLM(verbose=False) # default model and backend are used
```

#### Cheat Sheet

*Local Models:* A number of different local LLM backends are supported.

- **Llama-cpp**: `llm = LLM(default_model="llama", n_gpu_layers=-1)`

- **Llama-cpp with selected GGUF model via URL**:

  ``` python
   # prompt templates are required for user-supplied GGUF models (see FAQ)
   llm = LLM(model_url='https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf', 
             prompt_template= "<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>", n_gpu_layers=-1)
  ```

- **Llama-cpp with selected GGUF model via file path**:

  ``` python
   # prompt templates are required for user-supplied GGUF models (see FAQ)
   llm = LLM(model_url='zephyr-7b-beta.Q4_K_M.gguf', 
             model_download_path='/path/to/folder/to/where/you/downloaded/model',
             prompt_template= "<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>", n_gpu_layers=-1)
  ```

- **Hugging Face Transformers**:
  `llm = LLM(model_id='Qwen/Qwen2.5-0.5B-Instruct', device='cuda')`

- **Ollama**: `llm = LLM(model_url="ollama://llama3.2", api_key='na')`

- **Also Ollama**:
  `llm = LLM(model_url="ollama/llama3.2", api_key='na')`

- **Also Ollama**:
  `llm = LLM(model_url='http://localhost:11434/v1', api_key='na', model='llama3.2')`

- **vLLM**:
  `llm = LLM(model_url='http://localhost:8666/v1', api_key='na', model='Qwen/Qwen2.5-0.5B-Instruct')`

- **Also vLLM**:
  `llm = LLM('hosted_vllm/served-model-name', api_base="http://localhost:8666/v1", api_key="test123")`
  (assumes `served-model-name` parameter is supplied to
  `vllm.entrypoints.openai.api_server`).

- **vLLM with gpt-oss** (assumes `served-model-name` parameter is
  supplied to vLLM):

  ``` python
  # important: set max_tokens to high value due to intermediate reasoning steps that are generated
  llm = LLM(model_url='http://localhost:8666/v1', api_key='your_api_key', model=served_model_name, max_tokens=32000)
  result = llm.prompt(prompt, reasoning_effort="high")
  ```

*Cloud Models:* In addition to local LLMs, all cloud LLM providers
supported by [LiteLLM](https://github.com/BerriAI/litellm) are
compatible:

- **Anthropic Claude**:
  `llm = LLM(model_url="anthropic/claude-3-7-sonnet-latest")`

- **OpenAI GPT-4o**: `llm = LLM(model_url="openai/gpt-4o")`

- **AWS GovCloud Bedrock** (assumes AWS_ACCESS_KEY_ID and
  AWS_SECRET_ACCESS_KEY are set as environment variables)

  ``` python
  from onprem import LLM
  inference_arn = "YOUR INFERENCE ARN"
  endpoint_url = "YOUR ENDPOINT URL"
  region_name = "us-gov-east-1" # replace as necessary
  # set up LLM connection to Bedrock on AWS GovCloud
  llm = LLM( f"govcloud-bedrock://{inference_arn}", region_name=region_name, endpoint_url=endpoint_url)
  response = llm.prompt("Write a haiku about the moon.")
  ```

The instantiations above are described in more detail below.

#### GGUF Models and Llama.cpp

The default LLM backend is
[llama-cpp-python](https://github.com/abetlen/llama-cpp-python), and the
default model is currently a 7B-parameter model called
**Zephyr-7B-beta**, which is automatically downloaded and used.
Llama.cpp run models in [GGUF](https://huggingface.co/docs/hub/en/gguf)
format. The two other default models are `llama` and `mistral`. For
instance, if `default_model='llama'` is supplied, then a
**Llama-3.1-8B-Instsruct** model is automatically downloaded and used:

``` python
# Llama 3.1 is downloaded here and the correct prompt template for Llama-3.1 is automatically configured and used
llm = LLM(default_model='llama')
```

*Choosing Your Own Models:* Of course, you can also easily supply the
URL or path to an LLM of your choosing to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) (see the
[FAQ](https://amaiya.github.io/onprem/#faq) for an example).

*Supplying Extra Parameters:* Any extra parameters supplied to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) are forwarded
directly to
[llama-cpp-python](https://github.com/abetlen/llama-cpp-python), the
default LLM backend.

#### Changing the Default LLM Backend

If `default_engine="transformers"` is supplied to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm), Hugging Face
[transformers](https://github.com/huggingface/transformers) is used as
the LLM backend. Extra parameters to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) (e.g.,
‘device=’cuda’`) are forwarded diretly to`transformers.pipeline`. If supplying a`model_id\`
parameter, the default LLM backend is automatically changed to Hugging
Face [transformers](https://github.com/huggingface/transformers).

``` python
# LLama-3.1 model quantized using AWQ is downloaded and run with Hugging Face transformers (requires GPU)
llm = LLM(default_model='llama', default_engine='transformers')

# Using a custom model with Hugging Face Transformers
llm = LLM(model_id='Qwen/Qwen2.5-0.5B-Instruct', device_map='cpu')
```

See
[here](https://amaiya.github.io/onprem/#using-hugging-face-transformers-instead-of-llama.cpp)
for more information about using Hugging Face
[transformers](https://github.com/huggingface/transformers) as the LLM
backend.

You can also connect to **Ollama**, local LLM APIs (e.g., vLLM), and
cloud LLMs.

``` python
# connecting to an LLM served by Ollama
lm = LLM(model_url='ollama/llama3.2')

# connecting to an LLM served through vLLM (set API key as needed)
llm = LLM(model_url='http://localhost:8000/v1', api_key='token-abc123', model='Qwen/Qwen2.5-0.5B-Instruct')`

# connecting to a cloud-backed LLM (e.g., OpenAI, Anthropic).
llm = LLM(model_url="openai/gpt-4o-mini")  # OpenAI
llm = LLM(model_url="anthropic/claude-3-7-sonnet-20250219") # Anthropic
```

**OnPrem.LLM** suppports any provider and model supported by the
[LiteLLM](https://github.com/BerriAI/litellm) package.

See
[here](https://amaiya.github.io/onprem/#connecting-to-llms-served-through-rest-apis)
for more information on *local* LLM APIs.

More information on using OpenAI models specifically with **OnPrem.LLM**
is [here](https://amaiya.github.io/onprem/examples_openai.html).

#### Supplying Parameters to the LLM Backend

Extra parameters supplied to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) and
[`LLM.prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.prompt)
are passed directly to the LLM backend. Parameter names will vary
depending on the backend you chose.

For instance, with the default llama-cpp backend, the default context
window size (`n_ctx`) is set to 3900 and the default output size
(`max_tokens`) is set 512. Both are configurable parameters to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm). Increase if
you have larger prompts or need longer outputs. Other parameters (e.g.,
`api_key`, `device_map`, etc.) can be supplied directly to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm) and will be
routed to the LLM backend or API (e.g., llama-cpp-python, Hugging Face
transformers, vLLM, OpenAI, etc.). The `max_tokens` parameter can also
be adjusted on-the-fly by supplying it to
[`LLM.prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.prompt).

On the other hand, for Ollama models, context window and output size are
controlled by `num_ctx` and `num_predict`, respectively.

With the Hugging Face transformers, setting the context window size is
not needed, but the output size is controlled by the `max_new_tokens`
parameter to
[`LLM.prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.prompt).

### Send Prompts to the LLM to Solve Problems

This is an example of few-shot prompting, where we provide an example of
what we want the LLM to do.

``` python
prompt = """Extract the names of people in the supplied sentences.
Separate names with commas and place on a single line.

# Example 1:
Sentence: James Gandolfini and Paul Newman were great actors.
People:
James Gandolfini, Paul Newman

# Example 2:
Sentence:
I like Cillian Murphy's acting. Florence Pugh is great, too.
People:"""

saved_output = llm.prompt(prompt, stop=['\n\n'])
```


    Cillian Murphy, Florence Pugh

**Additional prompt examples are [shown
here](https://amaiya.github.io/onprem/examples.html).**

### Talk to Your Documents

Answers are generated from the content of your documents (i.e.,
[retrieval augmented generation](https://arxiv.org/abs/2005.11401) or
RAG). Here, we will use [GPU
offloading](https://amaiya.github.io/onprem/#speeding-up-inference-using-a-gpu)
to speed up answer generation using the default model. However, the
Zephyr-7B model may perform even better, responds faster, and is used in
our **[RAG example
notebook](https://amaiya.github.io/onprem/examples_rag.html)**.

``` python
from onprem import LLM

llm = LLM(n_gpu_layers=-1, store_type='sparse', verbose=False)
```

    llama_new_context_with_model: n_ctx_per_seq (3904) < n_ctx_train (32768) -- the full capacity of the model will not be utilized

The default embedding model is:
`sentence-transformers/all-MiniLM-L6-v2`. You can change it by supplying
the `embedding_model_name` to
[`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm).

#### Step 1: Ingest the Documents into a Vector Database

As of v0.10.0, you have the option of storing documents in either a
dense vector store (i.e., Chroma) or a sparse vector store (i.e., a
built-in keyword search index). Sparse vector stores sacrifice a small
amount of inference speed for significant improvements in ingestion
speed (useful for larger document sets) and also assume answer sources
will include at least one word from the question. To select the store
type, supply either `store_type="dense"` or `store_type="sparse"` when
creating the [`LLM`](https://amaiya.github.io/onprem/llm.base.html#llm).
As you can see above, we use a sparse vector store here.

``` python
llm.ingest("./tests/sample_data")
```

    Creating new vectorstore at /home/amaiya/onprem_data/vectordb/sparse
    Loading documents from ./tests/sample_data
    Split into 354 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
    Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods

    Loading new documents: 100%|██████████████████████| 6/6 [00:09<00:00,  1.51s/it]
    Processing and chunking 43 new documents: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 116.11it/s]
    100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 354/354 [00:00<00:00, 2548.70it/s]

The default `chunk_size` is set quite low at 1000 characters. You
increase by supplying `chunk_size` to `llm.ingest`. You can customize
the ingestion process even further by accessing the underlying vector
store directly, as illustrated in the [advanced RAG
example](https://amaiya.github.io/onprem/examples_rag.html#advanced-example-nsf-awards).

#### Step 2: Answer Questions About the Documents

``` python
question = """What is  ktrain?"""
result = llm.ask(question)
```

     ktrain is a low-code machine learning platform. It provides out-of-the-box support for training models on various types of data such as text, vision, graph, and tabular.

The sources used by the model to generate the answer are stored in
`result['source_documents']`. You can adjust the number of sources
(i.e., chunks) considered by suppyling the `limit` parameter to
`llm.ask`.

``` python
print("\nSources:\n")
for i, document in enumerate(result["source_documents"]):
    print(f"\n{i+1}.> " + document.metadata["source"] + ":")
    print(document.page_content)
```


    Sources:


    1.> /home/amaiya/projects/ghub/onprem/nbs/tests/sample_data/ktrain_paper/ktrain_paper.pdf:
    transferred to, and executed on new data in a production environment.
    ktrain is a Python library for machine learning with the goal of presenting a simple,
    uniﬁed interface to easily perform the above steps regardless of the type of data (e.g., text
    vs. images vs. graphs). Moreover, each of the three steps above can be accomplished in
    ©2022 Arun S. Maiya.
    License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are

    2.> /home/amaiya/projects/ghub/onprem/nbs/tests/sample_data/ktrain_paper/ktrain_paper.pdf:
    custom models and data formats, as well. Inspired by other low-code (and no-code) open-
    source ML libraries such as fastai (Howard and Gugger, 2020) and ludwig (Molino et al.,
    2019), ktrain is intended to help further democratize machine learning by enabling begin-
    ners and domain experts with minimal programming or data science experience to build
    sophisticated machine learning models with minimal coding. It is also a useful toolbox for

    3.> /home/amaiya/projects/ghub/onprem/nbs/tests/sample_data/ktrain_paper/ktrain_paper.pdf:
    Apache license, and available on GitHub at: https://github.com/amaiya/ktrain.
    2. Building Models
    Supervised learning tasks in ktrain follow a standard, easy-to-use template.
    STEP 1: Load and Preprocess Data. This step involves loading data from diﬀerent
    sources and preprocessing it in a way that is expected by the model. In the case of text,
    this may involve language-speciﬁc preprocessing (e.g., tokenization). In the case of images,

    4.> /home/amaiya/projects/ghub/onprem/nbs/tests/sample_data/ktrain_paper/ktrain_paper.pdf:
    AutoKeras (Jin et al., 2019) and AutoGluon (Erickson et al., 2020) lack some key “pre-
    canned” features in ktrain, which has the strongest support for natural language processing
    and graph-based data. Support for additional features is planned for the future.
    5. Conclusion
    This work presented ktrain, a low-code platform for machine learning. ktrain currently in-
    cludes out-of-the-box support for training models on text, vision, graph, and tabular

### Extract Text from Documents

The
[`load_single_document`](https://amaiya.github.io/onprem/ingest.base.html#load_single_document)
function can extract text from a range of different document formats
(e.g., PDFs, Microsoft PowerPoint, Microsoft Word, etc.). It is
automatically invoked when calling
[`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest).
Extracted text is represented as LangChain `Document` objects, where
`Document.page_content` stores the extracted text and
`Document.metadata` stores any extracted document metadata.

For PDFs, in particular, a number of different options are available
depending on your use case.

**Fast PDF Extraction (default)**

- **Pro:** Fast
- **Con:** Does not infer/retain structure of tables in PDF documents

``` python
from onprem.ingest import load_single_document

docs = load_single_document('tests/sample_data/ktrain_paper/ktrain_paper.pdf')
docs[0].metadata
```

    {'source': '/home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf',
     'file_path': '/home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf',
     'page': 0,
     'total_pages': 9,
     'format': 'PDF 1.4',
     'title': '',
     'author': '',
     'subject': '',
     'keywords': '',
     'creator': 'LaTeX with hyperref',
     'producer': 'dvips + GPL Ghostscript GIT PRERELEASE 9.22',
     'creationDate': "D:20220406214054-04'00'",
     'modDate': "D:20220406214054-04'00'",
     'trapped': ''}

**Automatic OCR of PDFs**

- **Pro:** Automatically extracts text from scanned PDFs
- **Con:** Slow

The
[`load_single_document`](https://amaiya.github.io/onprem/ingest.base.html#load_single_document)
function will automatically OCR PDFs that require it (i.e., PDFs that
are scanned hard-copies of documents). If a document is OCR’ed during
extraction, the `metadata['ocr']` field will be populated with `True`.

``` python
docs = load_single_document('tests/sample_data/ocr_document/lynn1975.pdf')
docs[0].metadata
```

    {'source': '/home/amaiya/projects/ghub/onprem/nbs/sample_data/4/lynn1975.pdf',
     'ocr': True}

**Markdown Conversion in PDFs**

- **Pro**: Better chunking for QA
- **Con**: Slower than default PDF extraction

The
[`load_single_document`](https://amaiya.github.io/onprem/ingest.base.html#load_single_document)
function can convert PDFs to Markdown instead of plain text by supplying
the `pdf_markdown=True` as an argument:

``` python
docs = load_single_document('your_pdf_document.pdf', 
                            pdf_markdown=True)
```

Converting to Markdown can facilitate downstream tasks like
question-answering. For instance, when supplying `pdf_markdown=True` to
[`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest),
documents are chunked in a Markdown-aware fashion (e.g., the abstract of
a research paper tends to be kept together into a single chunk instead
of being split up). Note that Markdown will not be extracted if the
document requires OCR.

**Inferring Table Structure in PDFs**

- **Pro**: Makes it easier for LLMs to analyze information in tables
- **Con**: Slower than default PDF extraction

When supplying `infer_table_structure=True` to either
[`load_single_document`](https://amaiya.github.io/onprem/ingest.base.html#load_single_document)
or
[`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest),
tables are inferred and extracted from PDFs using a TableTransformer
model. Tables are represented as **Markdown** (or **HTML** if Markdown
conversion is not possible).

``` python
docs = load_single_document('your_pdf_document.pdf', 
                            infer_table_structure=True)
```

**Parsing Extracted Text Into Sentences or Paragraphs**

For some analyses (e.g., using prompts for information extraction), it
may be useful to parse the text extracted from documents into individual
sentences or paragraphs. This can be accomplished using the
[`segment`](https://amaiya.github.io/onprem/utils.html#segment)
function:

``` python
from onprem.ingest import load_single_document
from onprem.utils import segment
text = load_single_document('tests/sample_data/sotu/state_of_the_union.txt')[0].page_content
```

``` python
segment(text, unit='paragraph')[0]
```

    'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.  Members of Congress and the Cabinet.  Justices of the Supreme Court.  My fellow Americans.'

``` python
segment(text, unit='sentence')[0]
```

    'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.'

### Summarization Pipeline

Summarize your raw documents (e.g., PDFs, MS Word) with an LLM.

#### Map-Reduce Summarization

Summarize each chunk in a document and then generate a single summary
from the individual summaries.

``` python
from onprem import LLM
llm = LLM(n_gpu_layers=-1, verbose=False, mute_stream=True) # disabling viewing of intermediate summarization prompts/inferences
```

``` python
from onprem.pipelines import Summarizer
summ = Summarizer(llm)

resp = summ.summarize('tests/sample_data/ktrain_paper/ktrain_paper.pdf', max_chunks_to_use=5) # omit max_chunks_to_use parameter to consider entire document
print(resp['output_text'])
```

     Ktrain is an open-source machine learning library that offers a unified interface for various machine learning tasks. The library supports both supervised and non-supervised machine learning, and includes methods for training models, evaluating models, making predictions on new data, and providing explanations for model decisions. Additionally, the library integrates with various explainable AI libraries such as shap, eli5 with lime, and others to provide more interpretable models.

#### Concept-Focused Summarization

Summarize a large document with respect to a particular concept of
interest.

``` python
from onprem import LLM
from onprem.pipelines import Summarizer
```

``` python
llm = LLM(default_model='zephyr', n_gpu_layers=-1, verbose=False, temperature=0)
summ = Summarizer(llm)
summary, sources = summ.summarize_by_concept('tests/sample_data/ktrain_paper/ktrain_paper.pdf', concept_description="question answering")
```


    The context provided describes the implementation of an open-domain question-answering system using ktrain, a low-code library for augmented machine learning. The system follows three main steps: indexing documents into a search engine, locating documents containing words in the question, and extracting candidate answers from those documents using a BERT model pretrained on the SQuAD dataset. Confidence scores are used to sort and prune candidate answers before returning results. The entire workflow can be implemented with only three lines of code using ktrain's SimpleQA module. This system allows for the submission of natural language questions and receives exact answers, as demonstrated in the provided example. Overall, the context highlights the ease and accessibility of building sophisticated machine learning models, including open-domain question-answering systems, through ktrain's low-code interface.

### Information Extraction Pipeline

Extract information from raw documents (e.g., PDFs, MS Word documents)
with an LLM.

``` python
from onprem import LLM
from onprem.pipelines import Extractor
# Notice that we're using a cloud-based, off-premises model here! See "OpenAI" section below.
llm = LLM(model_url='openai://gpt-3.5-turbo', verbose=False, mute_stream=True, temperature=0) 
extractor = Extractor(llm)
prompt = """Extract the names of research institutions (e.g., universities, research labs, corporations, etc.) 
from the following sentence delimited by three backticks. If there are no organizations, return NA.  
If there are multiple organizations, separate them with commas.
```{text}```
"""
df = extractor.apply(prompt, fpath='tests/sample_data/ktrain_paper/ktrain_paper.pdf', pdf_pages=[1], stop=['\n'])
df.loc[df['Extractions'] != 'NA'].Extractions[0]
```

    /home/amaiya/projects/ghub/onprem/onprem/core.py:159: UserWarning: The model you supplied is gpt-3.5-turbo, an external service (i.e., not on-premises). Use with caution, as your data and prompts will be sent externally.
      warnings.warn(f'The model you supplied is {self.model_name}, an external service (i.e., not on-premises). '+\

    'Institute for Defense Analyses'

### Few-Shot Classification

Make accurate text classification predictions using only a tiny number
of labeled examples.

``` python
# create classifier
from onprem.pipelines import FewShotClassifier
clf = FewShotClassifier(use_smaller=True)

# Fetching data
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import numpy as np
classes = ["soc.religion.christian", "sci.space"]
newsgroups = fetch_20newsgroups(subset="all", categories=classes)
corpus, group_labels = np.array(newsgroups.data), np.array(newsgroups.target_names)[newsgroups.target]

# Wrangling data into a dataframe and selecting training examples
data = pd.DataFrame({"text": corpus, "label": group_labels})
train_df = data.groupby("label").sample(5)
test_df = data.drop(index=train_df.index)

# X_sample only contains 5 examples of each class!
X_sample, y_sample = train_df['text'].values, train_df['label'].values

# test set
X_test, y_test = test_df['text'].values, test_df['label'].values

# train
clf.train(X_sample,  y_sample, max_steps=20)

# evaluate
print(clf.evaluate(X_test, y_test, print_report=False)['accuracy'])
#output: 0.98

# make predictions
clf.predict(['Elon Musk likes launching satellites.']).tolist()[0]
#output: sci.space
```

**TIP:** You can also easily train a wide range of [traditional text
classification
models](https://amaiya.github.io/onprem/pipelines.classifier.html) using
both Hugging Face transformers and scikit-learn as backends.

### Using Hugging Face Transformers Instead of Llama.cpp

By default, the LLM backend employed by **OnPrem.LLM** is
[llama-cpp-python](https://github.com/abetlen/llama-cpp-python), which
requires models in [GGUF format](https://huggingface.co/docs/hub/gguf).
As of v0.5.0, it is now possible to use [Hugging Face
transformers](https://github.com/huggingface/transformers) as the LLM
backend instead. This is accomplished by using the `model_id` parameter
(instead of supplying a `model_url` argument). In the example below, we
run the
[Llama-3.1-8B](https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4)
model.

``` python
# llama-cpp-python does NOT need to be installed when using model_id parameter
llm = LLM(model_id="hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4", device_map='cuda')
```

This allows you to more easily use any model on the Hugging Face hub in
[SafeTensors format](https://huggingface.co/docs/safetensors/index)
provided it can be loaded with the Hugging Face `transformers.pipeline`.
Note that, when using the `model_id` parameter, the `prompt_template` is
set automatically by `transformers`.

The Llama-3.1 model loaded above was quantized using
[AWQ](https://huggingface.co/docs/transformers/main/en/quantization/awq),
which allows the model to fit onto smaller GPUs (e.g., laptop GPUs with
6GB of VRAM) similar to the default GGUF format. AWQ models will require
the [autoawq](https://pypi.org/project/autoawq/) package to be
installed: `pip install autoawq` (AWQ only supports Linux system,
including Windows Subsystem for Linux). If you do need to load a model
that is not quantized, you can supply a quantization configuration at
load time (known as “inflight quantization”). In the following example,
we load an unquantized [Zephyr-7B-beta
model](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) that will be
quantized during loading to fit on GPUs with as little as 6GB of VRAM:

``` python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
)
llm = LLM(model_id="HuggingFaceH4/zephyr-7b-beta", device_map='cuda', 
          model_kwargs={"quantization_config":quantization_config})
```

When supplying a `quantization_config`, the
[bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/installation)
library, a lightweight Python wrapper around CUDA custom functions, in
particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 &
4-bit quantization functions, is used. There are ongoing efforts by the
bitsandbytes team to support multiple backends in addition to CUDA. If
you receive errors related to bitsandbytes, please refer to the
[bitsandbytes
documentation](https://huggingface.co/docs/bitsandbytes/main/en/installation).

### Structured and Guided Outputs

LLMs do not always listen to instructions properly. **Structured
outputs** for LLMs are a feature ensuring model responses follow a
strict, user-defined format (like JSON or XML schema) instead of
free-form text, making outputs predictable, machine-readable, and easily
integrable into applications.

#### Natively Supported Structured Outputs

A number of LLM services (e.g., vLLM, OpenAI, Anthropic Claude, AWS
GovCloud Bedrock) include native support for producing structured
outputs. To take advantage of this capability when it exists, you can
supply a Pydantic model representing the desired output format to the
`response_format` parameter
of[`LLM.prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.prompt).

Structured outputs for LLMs are a feature ensuring model responses
follow a strict, user-defined format (like JSON or XML schema) instead
of free-form text, making outputs predictable, machine-readable, and
easily integrable into applications.

``` python
from onprem import LLM
from pydantic import BaseModel

class ContactInfo(BaseModel):
    name: str
    email: str
    plan_interest: str
    demo_requested: bool

# Create LLM instance for Claude
llm = LLM("anthropic/claude-3-7-sonnet-latest")

# Use structured output - this should automatically use Claude's native API
result = llm.prompt(
    "Extract info from: John Smith (john@example.com) is interested in our Enterprise plan and wants to schedule a demo for next Tuesday  at 2pm.",
      response_format=ContactInfo
  )

print(f"Name: {result.name}")
print(f"Email: {result.email}")
print(f"Plan: {result.plan_interest}")
print(f"Demo: {result.demo_requested}")
```

The above approach using the `response_format` parameter works with both
**Anthropic** and **OpenAI** as LLM backends.

For **vLLM**, you can generated structured outputs as follows:

``` python

from onprem import LLM
llm = LLM(model_url='http://localhost:8666/v1', api_key='test123', model='MyGPT')
result = llm.prompt('Classify this sentiment: vLLM is wonderful!',
                     extra_body={"structured_outputs": {"choice": ["positive", "negative"]}})
```

An structured output example using **AWS GovCloud Bedrock** is [shown
here](https://amaiya.github.io/onprem/llm.backends.html#structured-outputs-with-aws-govcloud-bedrock).

When using an LLM backend that does not natively support structured
outputs, supplying the `response_format` parameter to
[`LLM.prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.prompt)
will result in an automatic fall back to a prompt-based approach to
structured outputs as described next.

#### Prompt-Based Structured Outputs

The
[`LLM.pydantic_prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.pydantic_prompt)
method also allows you to specify the desired structure of the LLM’s
output as a Pydantic model. Internally,
[`LLM.pydantic_prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.pydantic_prompt)
wraps the user-supplied prompt within a larger prompt telling the LLM to
output results in a specific JSON format. It is sometimes less
efficient/reliable than aforementioned native methods, but is more
generally applicable to any LLM. Since calling
[`LLM.prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.prompt)
with the `response_format` parameter will automatically invoke
[`LLM.pydantic_prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.pydantic_prompt)
when necessary, you will typically not have to call
[`LLM.pydantic_prompt`](https://amaiya.github.io/onprem/llm.base.html#llm.pydantic_prompt)
directly.

``` python
from pydantic import BaseModel, Field

class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

from onprem import LLM
llm = LLM(default_model='llama', verbose=False)
structured_output = llm.pydantic_prompt('Tell me a joke.', pydantic_model=Joke)
```

    llama_new_context_with_model: n_ctx_per_seq (3904) < n_ctx_train (131072) -- the full capacity of the model will not be utilized

    {
      "setup": "Why couldn't the bicycle stand alone?",
      "punchline": "Because it was two-tired!"
    }

The output is a Pydantic object instead of a string:

``` python
structured_output
```

    Joke(setup="Why couldn't the bicycle stand alone?", punchline='Because it was two-tired!')

``` python
print(structured_output.setup)
print()
print(structured_output.punchline)
```

    Why couldn't the bicycle stand alone?

    Because it was two-tired!

You can also use **OnPrem.LLM** with the
[Guidance](https://github.com/guidance-ai/guidance) package to guide the
LLM to generate outputs based on your conditions and constraints. We’ll
show a couple of examples here, but see [our documentation on guided
prompts](https://amaiya.github.io/onprem/examples_guided_prompts.html)
for more information.

``` python
from onprem import LLM

llm = LLM(n_gpu_layers=-1, verbose=False)
from onprem.pipelines.guider import Guider
guider = Guider(llm)
```

With the Guider, you can use use Regular Expressions to control LLM
generation:

``` python
prompt = f"""Question: Luke has ten balls. He gives three to his brother. How many balls does he have left?
Answer: """ + gen(name='answer', regex='\d+')

guider.prompt(prompt, echo=False)
```

    {'answer': '7'}

``` python
prompt = '19, 18,' + gen(name='output', max_tokens=50, stop_regex='[^\d]7[^\d]')
guider.prompt(prompt)
```

<pre style='margin: 0px; padding: 0px; padding-left: 8px; margin-left: -8px; border-radius: 0px; border-left: 1px solid rgba(127, 127, 127, 0.2); white-space: pre-wrap; font-family: ColfaxAI, Arial; font-size: 15px; line-height: 23px;'>19, 18<span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>7</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>6</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>5</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>4</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>3</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>2</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 1</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>0</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 9</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'> 8</span><span style='background-color: rgba(0, 165, 0, 0.15); border-radius: 3px;' title='0.0'>,</span></pre>

    {'output': ' 17, 16, 15, 14, 13, 12, 11, 10, 9, 8,'}

See [the
documentation](https://amaiya.github.io/onprem/examples_guided_prompts.html)
for more examples of how to use
[Guidance](https://github.com/guidance-ai/guidance) with **OnPrem.LLM**.

### Solving Tasks With Agents

``` python
from onprem import LLM
from onprem.pipelines import Agent
llm = LLM('openai/gpt-4o-mini', mute_stream=True) 
agent = Agent(llm)
agent.add_webview_tool()
answer = agent.run("What is the highest level of education of the person listed on this page: https://arun.maiya.net?")
# ANSWER: Ph.D. in Computer Science
```

See the **[example notebook on
agents](https://amaiya.github.io/onprem/examples_agent.html)** for more
information

## Built-In Web App

**OnPrem.LLM** includes a built-in Web app to access the LLM. To start
it, run the following command after installation:

``` shell
onprem --port 8000
```

Then, enter `localhost:8000` (or `<domain_name>:8000` if running on
remote server) in a Web browser to access the application:

<img src="https://raw.githubusercontent.com/amaiya/onprem/master/images/onprem_welcome.png" border="1" alt="screenshot" width="775"/>

For more information, [see the corresponding
documentation](https://amaiya.github.io/onprem/webapp.html).

## Examples

The [documentation](https://amaiya.github.io/onprem/) includes many
examples, including:

- [Prompts for
  Problem-Solving](https://amaiya.github.io/onprem/examples.html)
- [RAG Example](https://amaiya.github.io/onprem/examples_rag.html)
- [Code Generation](https://amaiya.github.io/onprem/examples_code.html)
- [Semantic
  Similarity](https://amaiya.github.io/onprem/examples_semantic.html)
- [Document
  Summarization](https://amaiya.github.io/onprem/examples_summarization.html)
- [Information
  Extraction](https://amaiya.github.io/onprem/examples_information_extraction.html)
- [Text
  Classification](https://amaiya.github.io/onprem/examples_classification.html)
- [Agent-Based Task
  Execution](https://amaiya.github.io/onprem/examples_agent.html)
- [Audo-Coding Survey
  Responses](https://amaiya.github.io/onprem/examples_qualitative_survey_analysis.html)
- [Legal and Regulatory
  Analysis](https://amaiya.github.io/onprem/examples_legal_analysis.html)

## FAQ

1.  **How do I use other models with OnPrem.LLM?**

    > You can supply any model of your choice using the `model_url` and
    > `model_id` parameters to `LLM` (see cheat sheet above).

    > Here, we will go into detail on how to supply a custom GGUF model
    > using the llma.cpp backend.

    > You can find llama.cpp-supported models with `GGUF` in the file
    > name on
    > [huggingface.co](https://huggingface.co/models?sort=trending&search=gguf).

    > Make sure you are pointing to the URL of the actual GGUF model
    > file, which is the “download” link on the model’s page. An example
    > for **Mistral-7B** is shown below:

    > <img src="https://raw.githubusercontent.com/amaiya/onprem/master/images/model_download_link.png" border="1" alt="screenshot" width="775"/>

    > When using the llama.cpp backend, GGUF models have specific prompt
    > formats that need to supplied to `LLM`. For instance, the prompt
    > template required for **Zephyr-7B**, as described on the [model’s
    > page](https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF), is:
    >
    > `<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>`
    >
    > So, to use the **Zephyr-7B** model, you must supply the
    > `prompt_template` argument to the `LLM` constructor (or specify it
    > in the `webapp.yml` configuration for the Web app).
    >
    > ``` python
    > # how to use Zephyr-7B with OnPrem.LLM
    > llm = LLM(model_url='https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf',
    >           prompt_template = "<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>",
    >           n_gpu_layers=33)
    > llm.prompt("List three cute names for a cat.")
    > ```

    > Prompt templates are **not** required for any other LLM backend
    > (e.g., when using Ollama as backend or when using `model_id`
    > parameter for transformers models). Prompt templates are also not
    > required if using any of the default models.

2.  **When installing `onprem`, I’m getting “build” errors related to
    `llama-cpp-python` (or `chroma-hnswlib`) on Windows/Mac/Linux?**

    > See [this LangChain documentation on
    > LLama.cpp](https://python.langchain.com/docs/integrations/llms/llamacpp)
    > for help on installing the `llama-cpp-python` package for your
    > system. Additional tips for different operating systems are shown
    > below:

    > For **Linux** systems like Ubuntu, try this:
    > `sudo apt-get install build-essential g++ clang`. Other tips are
    > [here](https://github.com/oobabooga/text-generation-webui/issues/1534).

    > For **Windows** systems, please try following [these
    > instructions](https://github.com/amaiya/onprem/blob/master/MSWindows.md).
    > We recommend you use [Windows Subsystem for Linux
    > (WSL)](https://learn.microsoft.com/en-us/windows/wsl/install)
    > instead of using Microsoft Windows directly. If you do need to use
    > Microsoft Window directly, be sure to install the [Microsoft C++
    > Build
    > Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
    > and make sure the **Desktop development with C++** is selected.

    > For **Macs**, try following [these
    > tips](https://github.com/imartinez/privateGPT/issues/445#issuecomment-1563333950).

    > There are also various other tips for each of the above OSes in
    > [this privateGPT repo
    > thread](https://github.com/imartinez/privateGPT/issues/445). Of
    > course, you can also [easily
    > use](https://colab.research.google.com/drive/1LVeacsQ9dmE1BVzwR3eTLukpeRIMmUqi?usp=sharing)
    > **OnPrem.LLM** on Google Colab.

    > Finally, if you still can’t overcome issues with building
    > `llama-cpp-python`, you can try [installing the pre-built wheel
    > file](https://abetlen.github.io/llama-cpp-python/whl/cpu/llama-cpp-python/)
    > for your system:

    > **Example:**
    > `pip install llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu`
    >
    > **Tip:** There are [pre-built wheel files for
    > `chroma-hnswlib`](https://pypi.org/project/chroma-hnswlib/#files),
    > as well. If running `pip install onprem` fails on building
    > `chroma-hnswlib`, it may be because a pre-built wheel doesn’t yet
    > exist for the version of Python you’re using (in which case you
    > can try downgrading Python).

3.  **I’m behind a corporate firewall and am receiving an SSL error when
    trying to download the model?**

    > Try this:
    >
    > ``` python
    > from onprem import LLM
    > LLM.download_model(url, ssl_verify=False)
    > ```

    > You can download the embedding model (used by `LLM.ingest` and
    > `LLM.ask`) as follows:
    >
    > ``` sh
    > wget --no-check-certificate https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/all-MiniLM-L6-v2.zip
    > ```

    > Supply the unzipped folder name as the `embedding_model_name`
    > argument to `LLM`.

    > If you’re getting SSL errors when even running `pip install`, try
    > this:
    >
    > ``` sh
    > pip install –-trusted-host pypi.org –-trusted-host files.pythonhosted.org pip_system_certs
    > ```

4.  **How do I use this on a machine with no internet access?**

    > Use the `LLM.download_model` method to download the model files to
    > `<your_home_directory>/onprem_data` and transfer them to the same
    > location on the air-gapped machine.

    > For the `ingest` and `ask` methods, you will need to also download
    > and transfer the embedding model files:
    >
    > ``` python
    > from sentence_transformers import SentenceTransformer
    > model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    > model.save('/some/folder')
    > ```

    > Copy the `some/folder` folder to the air-gapped machine and supply
    > the path to `LLM` via the `embedding_model_name` parameter.

5.  **My model is not loading when I call `llm = LLM(...)`?**

    > This can happen if the model file is corrupt (in which case you
    > should delete from `<home directory>/onprem_data` and
    > re-download). It can also happen if the version of
    > `llama-cpp-python` needs to be upgraded to the latest.

6.  **I’m getting an `“Illegal instruction (core dumped)` error when
    instantiating a `langchain.llms.Llamacpp` or `onprem.LLM` object?**

    > Your CPU may not support instructions that `cmake` is using for
    > one reason or another (e.g., [due to Hyper-V in VirtualBox
    > settings](https://stackoverflow.com/questions/65780506/how-to-enable-avx-avx2-in-virtualbox-6-1-16-with-ubuntu-20-04-64bit)).
    > You can try turning them off when building and installing
    > `llama-cpp-python`:

    > ``` sh
    > # example
    > CMAKE_ARGS="-DGGML_CUDA=ON -DGGML_AVX2=OFF -DGGML_AVX=OFF -DGGML_F16C=OFF -DGGML_FMA=OFF" FORCE_CMAKE=1 pip install --force-reinstall llama-cpp-python --no-cache-dir
    > ```

7.  **How can I speed up
    [`LLM.ingest`](https://amaiya.github.io/onprem/llm.base.html#llm.ingest)?**

    > By default, a GPU, if available, will be used to compute
    > embeddings, so ensure PyTorch is installed with GPU support. You
    > can explicitly control the device used for computing embeddings
    > with the `embedding_model_kwargs` argument.
    >
    > ``` python
    > from onprem import LLM
    > llm  = LLM(embedding_model_kwargs={'device':'cuda'})
    > ```

    > You can also supply `store_type="sparse"` to `LLM` to use a sparse
    > vector store, which sacrifices a small amount of inference speed
    > (`LLM.ask`) for significant speed ups during ingestion
    > (`LLM.ingest`).
    >
    > ``` python
    > from onprem import LLM
    > llm  = LLM(store_type="sparse")
    > ```
    >
    > Note, however, that, unlike dense vector stores, sparse vector
    > stores assume answer sources will contain at least one word in
    > common with the question.

<!--
8. **What are ways in which OnPrem.LLM has been used?**
    > Examples include:
    > - extracting key performance parameters and other performance attributes from engineering documents
    > - auto-coding responses to government requests for information (RFIs)
    > - analyzing the Federal Aquisition Regulations (FAR)
    > - understanding where and how Executive Order 14028 on cybersecurity aligns with the National Cybersecurity Strategy
    > - generating a summary of ways to improve a course from thousdands of reviews
    > - extracting specific information of interest from resumes for talent acquisition.
&#10;-->

## How to Cite

Please cite the [following paper](https://arxiv.org/abs/2509.21040) when
using **OnPrem.LLM**:

    @article{maiya2025generativeaiffrdcs,
          title={Generative AI for FFRDCs}, 
          author={Arun S. Maiya},
          year={2025},
          eprint={2509.21040},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2509.21040}, 
    }
