Metadata-Version: 2.4
Name: inferencekit-core
Version: 1.0.0
Summary: A comprehensive AI inference kit with multi-provider support, caching, and monitoring.
License: MIT
License-File: LICENSE
Author: Karan Bhatia
Author-email: karanbhatiakb@gmail.com
Requires-Python: >=3.9,<4.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Provides-Extra: anthropic
Provides-Extra: cohere
Provides-Extra: huggingface
Provides-Extra: openai
Requires-Dist: aiohttp (>=3.9.0,<4.0.0)
Requires-Dist: anthropic (>=1.0.0,<2.0.0) ; extra == "anthropic"
Requires-Dist: asyncpg (>=0.29.0,<0.30.0)
Requires-Dist: cohere (>=1.0.0,<2.0.0) ; extra == "cohere"
Requires-Dist: faiss-cpu (>=1.7.0,<2.0.0)
Requires-Dist: fastapi (>=0.104.0,<0.105.0)
Requires-Dist: huggingface-hub (>=1.0.0,<2.0.0) ; extra == "huggingface"
Requires-Dist: numpy (>=1.24.0,<2.0.0)
Requires-Dist: openai (>=1.0.0,<2.0.0) ; extra == "openai"
Requires-Dist: pydantic (>=2.5.0,<3.0.0)
Requires-Dist: python-dotenv (>=1.0.0,<2.0.0)
Requires-Dist: redis (>=5.0.0,<6.0.0)
Requires-Dist: sentence-transformers (>=2.2.0,<3.0.0)
Requires-Dist: uvicorn[standard] (>=0.24.0,<0.25.0)
Project-URL: Repository, https://github.com/karanbhatia007/inferencekit
Description-Content-Type: text/markdown

# InferenceKit


# InferenceKit

**Help developers cut LLM API costs by 70% using semantic caching and smart routing.**

## Why InferenceKit?

The actual problem it solves:
Right now, every time you or any developer calls GPT-4, Claude, or Llama, you pay per token. Every single call. Even if you asked the exact same question 5 minutes ago. Even if someone else asked almost the same question.
That's like paying for electricity every time you flip a light switch, even if the light is already on.

So InferenceKit does 3 things:
1. **Semantic Caching** — If someone already asked "What is Python?" and now you ask "Explain the Python programming language" — these are basically the same question. InferenceKit catches that and returns the cached answer instantly. Zero cost. Zero API call.
2. **Smart Routing** — Not every question needs GPT-4. "What is 2+2?" should go to the cheapest model. "Write a complex legal analysis" should go to the most capable model. InferenceKit decides automatically based on complexity.
3. **Cost Tracking** — You actually SEE what you're spending, per request, in real time. No surprise bills at the end of the month.

A modern inference toolkit for AI models to solve the problem of wasted API spend.

## Features

- **Semantic Caching**: Cache responses based on semantic similarity using vector embeddings
- **Multi-Provider Support**: OpenAI, Anthropic, Cohere, HuggingFace, and local models
- **Cost Tracking**: Detailed pricing information and cost calculation for different providers
- **Batch Processing**: Efficient batch generation with concurrency control
- **Vector Similarity**: Various similarity metrics (cosine, euclidean, manhattan, jaccard, hamming)
- **Memory Backend**: In-memory storage for requests and responses
- **Type Safety**: Full Pydantic type definitions
- **Streaming Support**: Real-time streaming responses with comprehensive features
    - Support for multiple streaming formats (text, JSON, binary, mixed)
    - Chunked data processing with flow control
    - Integration with existing Model class
    - Built-in buffering and flow control mechanisms
    - Error handling and timeout support
    - Comprehensive metrics and monitoring
    - Cancellation and timeout support
    - Context management for streaming operations
    - Both sync and async streaming support
    - Progress tracking and callbacks
    - Multi-provider streaming support (OpenAI, Anthropic, Local)
    - Automatic format detection and processing
    - Memory-efficient streaming with configurable buffer sizes
    - Concurrent stream management with limits
    - Detailed performance metrics and analytics

## Installation

```bash
pip install inferencekit-kb
```

Or install from source:

```bash
git clone https://github.com/your-org/inferencekit.git
cd inferencekit
poetry install
```

## Quick Start

```python
from inferencekit import Model, ModelConfig, ProviderConfig, ProviderType
from inferencekit.types import ModelType

# Initialize configuration
config = ModelConfig(
    provider=ProviderType.OPENAI,
    model_name="gpt-3.5-turbo",
    cache={
        "enabled": True,
        "max_size": 1000,
        "similarity_threshold": 0.8,
        "ttl_seconds": 3600
    },
    provider_config=ProviderConfig(
        api_key="your-api-key-here",
        timeout=10.0,
        max_retries=3
    )
)

# Create and initialize model
model = Model(config)
await model.initialize()

# Generate text
try:
    response = await model.generate(
        "Explain quantum computing in simple terms.",
        temperature=0.8,
        max_tokens=150
    )
    print(response.content)
finally:
    await model.shutdown()
```

## Advanced Usage

### Semantic Caching

```python
# The cache automatically finds semantically similar requests
response = await model.generate("Tell me about AI")
# Similar request will hit the cache
similar_response = await model.generate("Explain artificial intelligence")
```

### Batch Processing

```python
prompts = [
    "Write a haiku about spring.",
    "Summarize the plot of Romeo and Juliet.",
    "Translate 'Hello, how are you?' to Spanish."
]

batch_response = await model.batch_generate(
    prompts,
    batch_size=2,
    max_concurrency=3,
    temperature=0.6
)
```

### Cost Tracking

```python
from inferencekit.utils.costs import calculate_cost, get_cost_analysis

# Calculate cost for a request
cost_info = calculate_cost(
    "openai",
    "gpt-3.5-turbo",
    input_tokens=100,
    output_tokens=50
)
print(f"Total cost: ${cost_info['total_cost']}")
```

## Streaming

InferenceKit provides comprehensive streaming support for real-time response generation with multiple providers.

### Basic Streaming

```python
from inferencekit import Model
import asyncio

async def main():
    model = Model(config)
    await model.initialize()

    try:
        # Stream text generation
        async for chunk in model.stream_generate(
            "This is a test of streaming functionality.",
            model="gpt-3.5-turbo",
            provider=ProviderType.OPENAI,
            max_tokens=100
        ):
            print(chunk)

    finally:
        await model.shutdown()

asyncio.run(main())
```

### Chat Streaming

```python
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me a story about a dragon."},
]

async for chunk in model.stream_chat(
    messages,
    model="gpt-3.5-turbo",
    provider=ProviderType.OPENAI,
    max_tokens=200
):
    print(chunk)
```

### Progress Tracking

```python
async def progress_callback(progress: float):
    print(f"Progress: {progress * 100:.1f}%")

async for chunk in model.stream_generate(
    "This is a test with progress tracking.",
    model="gpt-3.5-turbo",
    provider=ProviderType.OPENAI,
    max_tokens=150,
    progress_callback=progress_callback
):
    print(chunk)
```

### Context Manager

```python
async with model.streaming_context(
    "This is a test using context manager.",
    model="gpt-3.5-turbo",
    provider=ProviderType.OPENAI,
    max_tokens=100
) as stream:
    async for chunk in stream:
        print(chunk)
```

### Synchronous Streaming

```python
sync_stream = model.stream_generate_sync(
    "This is a test of synchronous streaming.",
    model="gpt-3.5-turbo",
    provider=ProviderType.OPENAI,
    max_tokens=100
)

for chunk in sync_stream:
    print(chunk)
```

### Metrics and Monitoring

```python
# Get streaming statistics
stats = model.get_stream_stats()
print(f"Active streams: {stats['active_streams']}")

# Get metrics for a specific stream
metrics = model.get_stream_metrics(stream_id)
if metrics:
    print(f"Total chunks: {metrics['total_chunks']}")
    print(f"Total bytes: {metrics['total_bytes']}")
    print(f"Duration: {metrics['duration_seconds']}s")
```

### Multiple Concurrent Streams

```python
async def run_stream(prompt: str, stream_id: int):
    print(f"Starting stream {stream_id}")
    try:
        async for chunk in model.stream_generate(
            prompt,
            model="gpt-3.5-turbo",
            provider=ProviderType.OPENAI,
            max_tokens=50
        ):
            print(f"Stream {stream_id}: {chunk}")
    except Exception as e:
        print(f"Stream {stream_id} failed: {e}")
    finally:
        print(f"Stream {stream_id} completed")

# Create multiple streams
prompts = [
    "First stream prompt",
    "Second stream prompt",
    "Third stream prompt",
    "Fourth stream prompt"
]

# Run all streams concurrently
await asyncio.gather(*[
    run_stream(prompt, i + 1) for i, prompt in enumerate(prompts)
])
```

### Error Handling

```python
try:
    async for chunk in model.stream_generate(
        "This should fail",
        model="invalid-model",
        provider=ProviderType.OPENAI,
        max_tokens=50
    ):
        pass
except StreamingError as e:
    print(f"Streaming error: {e}")
```

### Configuration

The streaming system can be configured through the ModelConfig:

```python
config = ModelConfig(
    provider=ProviderType.OPENAI,
    model_name="gpt-3.5-turbo",
    provider_config=ProviderConfig(
        api_key="your-key",
        timeout=10.0,
        max_retries=3
    ),
    # Streaming configuration
    streaming_config={
        "max_concurrent_streams": 10,
        "default_timeout": 60.0,
        "buffer_size": 4096,
        "flow_control": True,
        "metrics": True
    }
)
```

### Supported Formats

Streaming supports multiple data formats:

- **Text**: Plain text streaming (default)
- **JSON**: JSON-formatted responses
- **Binary**: Binary data streaming
- **Mixed**: Automatic format detection

### Flow Control

The streaming system includes built-in flow control to prevent overwhelming consumers:

- Automatic throttling based on chunk rate
- Configurable buffer sizes
- Pause/resume functionality
- Backpressure handling

### Provider Support

Currently supported providers with streaming:

- **OpenAI**: Full streaming support via OpenAI API
- **Anthropic**: Claude models streaming support
- **Local**: Simulated streaming for local models

Additional providers can be added by registering stream handlers with the StreamingManager.

### Model Class

- `generate(prompt, **kwargs)` - Generate text
- `chat(messages, **kwargs)` - Chat with the model
- `embed(text, **kwargs)` - Generate embeddings
- `batch_generate(prompts, **kwargs)` - Batch generation
- `get_history(limit, offset)` - Get request history
- `search_history(query, limit)` - Search history
- `get_cache_stats()` - Get cache statistics
- `clear_cache()` - Clear the cache

### Providers

- **OpenAI**: Full OpenAI API support
- **Anthropic**: Claude models support
- **Cohere**: Command models support
- **HuggingFace**: Transformers models support
- **Local**: Local model inference

### Similarity Functions

- `cosine_similarity(vec1, vec2)`
- `euclidean_distance(vec1, vec2)`
- `manhattan_distance(vec1, vec2)`
- `jaccard_similarity(set1, set2)`
- `hamming_distance(str1, str2)`

## Configuration

### Model Configuration

```python
ModelConfig(
    provider=ProviderType.OPENAI,
    model_name="gpt-3.5-turbo",
    cache=CacheConfig(
        enabled=True,
        max_size=1000,
        similarity_threshold=0.8,
        ttl_seconds=3600
    ),
    provider_config=ProviderConfig(
        api_key="your-key",
        timeout=10.0,
        max_retries=3
    ),
    temperature=0.7,
    max_tokens=1000,
    top_p=1.0,
    frequency_penalty=0.0,
    presence_penalty=0.0
)
```

## Requirements

- Python 3.9+
- Poetry for dependency management

## Dependencies

- pydantic (>=2.0.0)
- python-dotenv (>=1.0.0)
- openai (>=1.0.0)
- aiohttp (>=3.9.0)
- asyncpg (>=0.29.0)
- redis (>=5.0.0)
- numpy (>=1.24.0)
- sentence-transformers (>=2.2.0)
- faiss-cpu (>=1.7.0)

## Development

```bash
# Install development dependencies
poetry install --with dev

# Run tests
poetry run pytest

# Format code
poetry run black .
poetry run isort .

# Type checking
poetry run mypy .
```

## License

MIT License - see LICENSE file for details.

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Ensure all tests pass
6. Submit a pull request

## Support

For issues and questions, please open an issue on the GitHub repository.

## Project Structure

```
inferencekit/
├── __init__.py              # Main package exports
├── core/
│   ├── cache.py             # Semantic cache implementation
│   └── model.py             # Main Model class
├── backends/
│   └── memory_backend.py    # In-memory backend
├── providers/
│   └── openai.py            # OpenAI provider
├── utils/
│   ├── similarity.py        # Vector similarity functions
│   └── costs.py             # Pricing and cost calculations
├── types.py                 # Pydantic type definitions
├── exceptions.py            # Custom exceptions
├── examples/
│   └── basic_usage.py       # Usage examples
└── tests/                   # Test files
```
