Metadata-Version: 2.4
Name: llm-talk
Version: 0.2.0
Summary: LLM-to-LLM interview framework for evaluating conversational capabilities
Project-URL: Homepage, https://github.com/am1t/llm-talk
Project-URL: Repository, https://github.com/am1t/llm-talk
Project-URL: Bug Tracker, https://github.com/am1t/llm-talk/issues
Author-email: Amit Gawande <reply@amitgawande.com>
License: MIT
License-File: LICENSE
Keywords: ai,benchmark,conversation,evaluation,interview,llm
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: aisuite[anthropic,openai]>=0.1.11
Provides-Extra: dev
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: examples
Requires-Dist: python-dotenv>=1.0.0; extra == 'examples'
Provides-Extra: web
Requires-Dist: streamlit; extra == 'web'
Description-Content-Type: text/markdown

# llm-talk: LLM-to-LLM interview framework for exploring conversational behavior

[![PyPI](https://img.shields.io/pypi/v/llm-talk)](https://pypi.org/project/llm-talk/)
[![Tests](https://github.com/am1t/llm-talk/actions/workflows/test.yml/badge.svg)](https://github.com/am1t/llm-talk/actions/workflows/test.yml)

This started as a quick way to evaluate a new AI provider — and turned into something more interesting. Turns out LLMs have distinct interviewing personalities, struggle with conversational closure, and tend to converge on the same philosophical territory regardless of who's asking.

Full write-up, with example interview transcripts: [When LLMs Interview Each Other](https://dev.amitgawande.com/2026/when-llms-interview-each-other)

## Quick Start

```python
from llm_talk import Interview

result = Interview("openai", "claude").run()
result.save("output.md")
print(result.evaluation)
```

## Installation

```bash
pip install llm-talk
```

Set your API keys as environment variables:

```bash
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
```

## Usage

### Model Aliases

Use short names or full `provider:model` strings:

| Alias | Resolves to |
|-------|------------|
| `openai` | `openai:gpt-4o-mini` |
| `claude` | `anthropic:claude-sonnet-4-5` |
| `gpt4` | `openai:gpt-4o` |
| `sonnet` | `anthropic:claude-sonnet-4-5` |
| `opus` | `anthropic:claude-opus-4-5` |
| `gemini` | `google:gemini-2.0-flash` |
| `mistral` | `mistral:mistral-large-latest` |

```python
Interview("openai:gpt-4o", "anthropic:claude-opus-4-5").run()
```

### Custom Topics

```python
result = Interview(
    "openai", "claude",
    topics="Test their reasoning and code generation capabilities"
).run(turns=30)
```

### Topic Templates

```python
from llm_talk import Interview, TopicTemplate

result = Interview("openai", "claude", topics=TopicTemplate.TECHNICAL).run()
```

Available templates: `DEFAULT`, `TECHNICAL`, `CREATIVE`, `CULTURAL`.

### Control Turn Count

```python
result = Interview("openai", "claude").run(turns=100)
```

### Custom Evaluator

By default, Claude evaluates the conversation. You can change the evaluator model, dimensions, or prompt:

```python
from llm_talk import Interview, EvaluatorTemplate

# Use a different model as evaluator
result = Interview("claude", "openai", evaluator="gpt4").run()

# Focus evaluation on specific dimensions
result = Interview(
    "claude", "openai",
    evaluation_dimensions=EvaluatorTemplate.TECHNICAL_DIMENSIONS,
).run()

# Override the evaluator system prompt
result = Interview(
    "openai", "claude",
    evaluator_system_prompt="You are a terse, skeptical reviewer. One paragraph per section. No flattery.",
).run()

# Fully custom evaluator prompt
result = Interview(
    "openai", "claude",
    evaluator_user_prompt="Score the interviewee on depth, honesty, and clarity (1-10 each). Reply with a markdown table.",
).run()
```

### Access Results

```python
result = Interview("openai", "claude").run()

print(result.evaluation)        # Claude's evaluation
print(result.total_turns)       # Actual turns completed
print(result.loop_detected)     # Was a loop detected?
print(result.loop_reason)       # Why the loop was detected

result.save("interview.md")     # Save as markdown
data = result.to_dict()         # Export as dict (for JSON)
```

## How It Works

1. Two LLM agents are created — one as interviewer, one as interviewee
2. They have a natural conversation for the specified number of turns
3. Loop detection monitors for repetitive patterns (farewell loops, etc.)
4. Claude evaluates the conversation quality and generates a report
5. Results include the full conversation, evaluation, and metadata

## Examples

See the [examples/](examples/) directory for more usage patterns:

- [basic_interview.py](examples/basic_interview.py) — simplest usage
- [custom_topics.py](examples/custom_topics.py) — focused topic areas
- [custom_evaluator.py](examples/custom_evaluator.py) — evaluator model, dimensions, and prompts
- [batch_experiments.py](examples/batch_experiments.py) — run multiple interviews
- [compare_models.py](examples/compare_models.py) — compare different models

## Development

```bash
git clone https://github.com/am1t/llm-talk.git
cd llm-talk
pip install -e ".[dev]"
pytest
```

To run the example scripts, also install the examples extra:

```bash
pip install -e ".[examples]"
```

## License

MIT
