Metadata-Version: 2.4
Name: llm-scope-observer
Version: 0.1.0a2
Summary: I know I messed it up for both of us and I am sorry, she will understand.. you can go ahead and use the library.
Author: llm-scope maintainers
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: fastapi>=0.110
Requires-Dist: uvicorn[standard]>=0.27
Requires-Dist: psutil>=5.9
Requires-Dist: pydantic>=2.0
Requires-Dist: jinja2>=3.1
Provides-Extra: gpu
Requires-Dist: pynvml>=11.0; extra == "gpu"

# llm-scope-observer

> Lightweight observability and diagnostics for local and self-hosted LLMs.

I know I messed it up for both of us and I am sorry, she will understand.. you can go ahead and use the library.

`llm-scope-observer` is a small Python package that wraps your local LLM calls (Ollama, FastAPI backends, OpenWebUI integrations, custom Python code) and records:

- Latency per call
- Token usage (input, output, total, tokens/sec)
- CPU / RAM / (optional) GPU utilization
- Simple hallucination-risk heuristics
- Error information

All metrics are stored locally (SQLite by default) and visualized in a small FastAPI-based dashboard.

---

## Features

- **Request interceptor**: Decorator to wrap any Python function that calls an LLM.

  ```python
  from llm_scope import monitor
  import ollama

  @monitor(model="llama3")
  def generate(prompt: str) -> str:
      result = ollama.generate(model="llama3", prompt=prompt)
      return result["response"]
  ```

- **Token estimation**:
  - Approximate input and output tokens
  - Track total tokens and tokens/sec per call

- **System metrics snapshot** (per request):
  - CPU %
  - RAM %
  - GPU % (optional, via `pynvml` if installed)

- **Hallucination risk heuristic** (simple, signal-based):
  - Very long answer vs. short prompt
  - Strong claims without references
  - Repetition patterns
  - Basic self-contradiction patterns

- **Local dashboard**:
  - FastAPI backend + simple HTML UI
  - SQLite storage by default
  - Shows latency, token trends, errors, and resource correlation per model

---

## Installation

```bash
pip install llm-scope-observer
```

Optional GPU metrics:

```bash
pip install "llm-scope-observer[gpu]"
```

Requires Python 3.9+.

---

## Quickstart

### 1. Instrument your LLM call

```python
from llm_scope import monitor
import time

@monitor(model="test-model")
def generate(prompt: str) -> str:
    time.sleep(0.1)
    return "hello from llm-scope-observer"
```

Every time `generate(...)` runs, a record is written to a local SQLite database (`llm_scope.db` by default).

### 2. Run the dashboard

After some traffic:

```bash
llm-scope ui --host 127.0.0.1 --port 8000
# or
python -m llm_scope.cli ui --host 127.0.0.1 --port 8000
```

Open:

- http://127.0.0.1:8000/

and you’ll see:

- Average latency per model
- Slowest calls (tail latency)
- Token usage and tokens/sec
- Error counts
- CPU / RAM / GPU vs. latency
- Hallucination score per call

---

## How it works (high level)

- **Middleware / decorator**:
  - `@monitor(model="llama3")` wraps any function.
  - Captures `start`/`end` times, prompt, response, and errors.
  - Sends a metrics record to the storage backend.

- **Metrics**:
  - Token estimation from prompt and response text.
  - System stats from `psutil` (and optionally `pynvml`).
  - Simple heuristics for hallucination risk.

- **Storage**:
  - SQLite via `sqlite3` by default.
  - One table: `llm_calls` with timestamps, model, metrics, error, tags.

- **Dashboard**:
  - FastAPI app.
  - Reads from the same SQLite file.
  - Renders an HTML summary page (no external JS required).

---

## Roadmap

This is an early MVP. Planned next steps include:

- Prompt clustering and slow-prompt detection
- Model A vs. Model B comparison
- Basic alerting hooks and export to tools like Grafana
- Optional HTTP ingestion mode (sidecar / agent model)

---

## License

MIT
