Metadata-Version: 2.4
Name: flexllm
Version: 0.2.1
Summary: High-performance LLM client with batch processing, caching, and checkpoint recovery
Project-URL: Homepage, https://github.com/KenyonY/flexllm
Project-URL: Repository, https://github.com/KenyonY/flexllm
Project-URL: Documentation, https://github.com/KenyonY/flexllm#readme
Project-URL: Issues, https://github.com/KenyonY/flexllm/issues
Author-email: kunyuan <beidongjiedeguang@gmail.com>
License: MIT
License-File: LICENSE
Keywords: async,batch,cache,gemini,llm,multimodal,openai
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: loguru>=0.6.0
Requires-Dist: orjson
Requires-Dist: rich>=12.0.0
Requires-Dist: tqdm>=4.60.0
Provides-Extra: all
Requires-Dist: fire>=0.5.0; extra == 'all'
Requires-Dist: flaxkv2>=0.1.0; extra == 'all'
Requires-Dist: google-generativeai>=0.3.0; extra == 'all'
Requires-Dist: opencv-python>=4.5.0; extra == 'all'
Requires-Dist: pillow>=9.0.0; extra == 'all'
Requires-Dist: pyyaml>=6.0; extra == 'all'
Requires-Dist: requests>=2.28.0; extra == 'all'
Requires-Dist: tiktoken>=0.5.0; extra == 'all'
Provides-Extra: cache
Requires-Dist: flaxkv2>=0.1.0; extra == 'cache'
Provides-Extra: cli
Requires-Dist: fire>=0.5.0; extra == 'cli'
Requires-Dist: pyyaml>=6.0; extra == 'cli'
Requires-Dist: requests>=2.28.0; extra == 'cli'
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: isort>=5.12.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.20.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Provides-Extra: gemini
Requires-Dist: google-generativeai>=0.3.0; extra == 'gemini'
Provides-Extra: image
Requires-Dist: opencv-python>=4.5.0; extra == 'image'
Requires-Dist: pillow>=9.0.0; extra == 'image'
Provides-Extra: test
Requires-Dist: flaxkv2>=0.1.0; extra == 'test'
Requires-Dist: pillow>=9.0.0; extra == 'test'
Requires-Dist: pytest-asyncio>=0.20.0; extra == 'test'
Requires-Dist: pytest>=7.0.0; extra == 'test'
Provides-Extra: token
Requires-Dist: tiktoken>=0.5.0; extra == 'token'
Description-Content-Type: text/markdown

<h1 align="center">flexllm</h1>

<p align="center">
    <strong>High-performance LLM client with batch processing, caching, and checkpoint recovery</strong>
</p>

<p align="center">
    <a href="https://pypi.org/project/flexllm/">
        <img src="https://img.shields.io/pypi/v/flexllm?color=brightgreen&style=flat-square" alt="PyPI version">
    </a>
    <a href="https://github.com/KenyonY/flexllm/blob/main/LICENSE">
        <img alt="License" src="https://img.shields.io/github/license/KenyonY/flexllm.svg?color=blue&style=flat-square">
    </a>
    <a href="https://pypistats.org/packages/flexllm">
        <img alt="pypi downloads" src="https://img.shields.io/pypi/dm/flexllm?style=flat-square">
    </a>
</p>

---

## Features

- **Batch Processing**: Process thousands of requests concurrently with QPS control
- **Response Caching**: Built-in caching with TTL support, avoid duplicate API calls
- **Checkpoint Recovery**: Resume interrupted batch jobs automatically
- **Multi-Provider**: OpenAI, Gemini, and any OpenAI-compatible API (vLLM, Ollama, DeepSeek, Qwen...)
- **Multi-Modal**: Image + text processing with automatic base64 encoding
- **Load Balancing**: Multi-endpoint client pool with failover
- **Async-First**: Built on asyncio for maximum performance
- **CLI Tool**: Quick ask, chat, and test commands

## Installation

```bash
pip install flexllm

# With Gemini support
pip install flexllm[gemini]

# With caching support
pip install flexllm[cache]

# With CLI support
pip install flexllm[cli]

# All features
pip install flexllm[all]
```

## Quick Start

### Single Request

```python
from flexllm import LLMClient

client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key"
)

# Async
response = await client.chat_completions([
    {"role": "user", "content": "Hello!"}
])

# Sync
response = client.chat_completions_sync([
    {"role": "user", "content": "Hello!"}
])
```

### Batch Processing with Checkpoint Recovery

```python
from flexllm import LLMClient

client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    concurrency_limit=50,
    max_qps=100,
)

messages_list = [
    [{"role": "user", "content": "What is 1+1?"}],
    [{"role": "user", "content": "What is 2+2?"}],
    # ... thousands more
]

# Batch processing with checkpoint recovery
# If interrupted, re-running will resume from where it stopped
results = await client.chat_completions_batch(
    messages_list,
    output_file="results.jsonl",  # Auto-save progress
    show_progress=True,
)
```

### Response Caching

```python
from flexllm import LLMClient, ResponseCacheConfig

# Enable caching (avoid duplicate API calls)
client = LLMClient(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    cache=ResponseCacheConfig(enabled=True, ttl=3600),  # 1 hour TTL
)

# Duplicate requests hit cache automatically
result1 = await client.chat_completions(messages)  # API call
result2 = await client.chat_completions(messages)  # Cache hit (instant)
```

### Streaming Response

```python
async for chunk in client.chat_completions_stream(messages):
    print(chunk, end="", flush=True)
```

### Multi-Modal (Vision)

```python
from flexllm import MllmClient

client = MllmClient(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    model="gpt-4o",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "path/to/image.jpg"}}  # Local path or URL
        ]
    }
]

response = await client.call_llm([messages])
```

### Load Balancing with Failover

```python
from flexllm import LLMClientPool

# Create client pool with multiple endpoints
pool = LLMClientPool(
    endpoints=[
        {"base_url": "http://host1:8000/v1", "api_key": "key1", "model": "qwen"},
        {"base_url": "http://host2:8000/v1", "api_key": "key2", "model": "qwen"},
    ],
    load_balance="round_robin",  # round_robin, weighted, random, fallback
    fallback=True,  # Auto switch on failure
)

# Same API as LLMClient
result = await pool.chat_completions(messages)

# Distribute batch requests across endpoints
results = await pool.chat_completions_batch(messages_list, distribute=True)
```

### Gemini Client

```python
from flexllm import GeminiClient

# Gemini Developer API
client = GeminiClient(
    model="gemini-2.5-flash",
    api_key="your-gemini-api-key"
)

# With thinking mode
response = await client.chat_completions(
    messages,
    thinking="high",  # False, True, "minimal", "low", "medium", "high"
)

# Vertex AI mode
client = GeminiClient(
    model="gemini-2.5-flash",
    project_id="your-project-id",
    location="us-central1",
    use_vertex_ai=True,
)
```

### Thinking Mode (DeepSeek, etc.)

```python
from flexllm import OpenAIClient

client = OpenAIClient(
    base_url="https://api.deepseek.com/v1",
    api_key="your-key",
    model="deepseek-reasoner",
)

# Enable thinking
result = await client.chat_completions(
    messages,
    thinking=True,
    return_raw=True,
)

# Parse thinking content
parsed = OpenAIClient.parse_thoughts(result.data)
print("Thinking:", parsed["thought"])
print("Answer:", parsed["answer"])
```

## CLI Usage

```bash
# Quick ask (for scripts/agents)
flexllm ask "What is Python?"
flexllm ask "Explain this" -s "You are a code expert"
echo "long text" | flexllm ask "Summarize"

# Interactive chat
flexllm chat
flexllm chat "Hello"
flexllm chat --model=gpt-4 "Hello"

# List models
flexllm models           # Remote models
flexllm list_models      # Configured models

# Test connection
flexllm test

# Initialize config
flexllm init
```

### CLI Configuration

Create `~/.flexllm/config.yaml`:

```yaml
default: "gpt-4"

models:
  - id: gpt-4
    name: gpt-4
    provider: openai
    base_url: https://api.openai.com/v1
    api_key: your-api-key

  - id: local
    name: local-ollama
    provider: openai
    base_url: http://localhost:11434/v1
    api_key: EMPTY
```

Or use environment variables:

```bash
export FLEXLLM_BASE_URL="https://api.openai.com/v1"
export FLEXLLM_API_KEY="your-key"
export FLEXLLM_MODEL="gpt-4"
```

## API Reference

### LLMClient

Main client for OpenAI-compatible APIs.

```python
LLMClient(
    model: str,                    # Model name
    base_url: str,                 # API base URL
    api_key: str = "EMPTY",        # API key
    provider: str = "auto",        # "auto", "openai", "gemini"
    cache: ResponseCacheConfig = None,  # Cache config
    concurrency_limit: int = 50,   # Max concurrent requests
    max_qps: float = None,         # Max requests per second
    retry_times: int = 3,          # Retry count on failure
    retry_delay: float = 1.0,      # Delay between retries
    timeout: int = 120,            # Request timeout (seconds)
)
```

### Methods

| Method | Description |
|--------|-------------|
| `chat_completions(messages)` | Single async request |
| `chat_completions_sync(messages)` | Single sync request |
| `chat_completions_batch(messages_list)` | Batch async requests |
| `chat_completions_batch_sync(messages_list)` | Batch sync requests |
| `chat_completions_stream(messages)` | Streaming response |

### ResponseCacheConfig

```python
ResponseCacheConfig(
    enabled: bool = False,         # Enable caching
    ttl: int = 86400,              # Time-to-live in seconds (default 24h)
    cache_dir: str = "~/.cache/flexllm/llm_response",
    use_ipc: bool = True,          # Use IPC for multi-process sharing
)

# Shortcuts
ResponseCacheConfig.with_ttl(3600)     # 1 hour TTL
ResponseCacheConfig.persistent()        # Never expire
```

### Token Counting

```python
from flexllm import count_tokens, estimate_cost, estimate_batch_cost

# Count tokens
tokens = count_tokens("Hello world", model="gpt-4")

# Estimate cost
cost = estimate_cost(tokens, model="gpt-4", is_input=True)

# Estimate batch cost
total_cost = estimate_batch_cost(messages_list, model="gpt-4")
```

## Architecture

```
flexllm/
├── flexllm/
│   ├── llm_client.py          # Unified client (recommended)
│   ├── openaiclient.py        # OpenAI-compatible API
│   ├── geminiclient.py        # Google Gemini
│   ├── mllm_client.py         # Multi-modal client
│   ├── client_pool.py         # Load balancing pool
│   ├── response_cache.py      # Response caching
│   ├── token_counter.py       # Token counting & cost
│   ├── async_api/             # Async engine
│   └── processors/            # Image & message processing
```

## License

MIT
