Metadata-Version: 2.1
Name: ragtime
Version: 0.0.2
Summary: Ragtime is an LLMOps framework to automatically evaluate Retrieval Augmented Generation (RAG) systems and compare different RAGs / LLMs
Project-URL: Homepage, https://github.com/recitalAI/ragtime
Project-URL: Issues, https://github.com/recitalAI/ragtime/issues
Author-email: Gilles Moyse <gilles@recital.ai>
Maintainer-email: reciTAL <ragtime@recital.ai>
License: MIT License
        
        Copyright (c) 2024 reciTAL
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: Evaluation,LLM,RAG
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.8
Description-Content-Type: text/markdown

# Presentation
**Ragtime** is an LLMOps framework which allows you to automatically:
1. evaluate a Retrieval Augmented Generation (RAG) system
2. compare different RAGs / LLMs
3. generate Facts to allow automatic evaluation

In Ragtime, a *RAG* is made of, optionally, a *Retriever*, and always, a *Large Language Model* (*LLM*).
- A *Retriever* takes a *question* in input and returns one or several *chunks*, i.e. paragraphs used to build an answer.
- A *LLM* is a text to text generator taking in input a *prompt*, made of a question and optional chunks, and returning an *LLMAnswer*

You can specify how *prompts* are generated and how the *LLMAnswer* has to be post-processed (if needed).

# How does it work?
The main idea in Ragtime is to evaluate answers returned by a RAG based on **Facts** that you define. Indeed, it is very difficult to evaluate RAGs and/or LLMs because you cannot define a "good" answer. The system can return a correct answer but formulated in many different ways, so you cannot directly compare the strings to detmerine whether the response is good or not. And counting the number of common words like in ROUGE for instance is a poor proxy.

In Ragtime, answers returned by an RAG or a LLM are evaluated against a set of facts. If the answer validates all the facts, then the answer is deemed correct. Conversely, if some facts are not validated, the answer is considered wrong. The number of validated facts compared to the total number of facts to validate defines a score.

You can either define facts manually, or have a LLM define them for you. Regarding the evaluation of the answers, an LLM does it for you, so that **the evualuation of your questions is done automatically!**

Let's now see how to put Ragtime in action!

# Main objects
The main objects used in Ragtime are:
- `AnswerGenerator`: generate `Answer`s with 1 or several `LLM`s. Each `LLM` uses a `Prompter` to get a prompt to be fed with. A `post_processing` method can also be overridden in the `AnswerGenerator` so perform specific processing on the text returned by the `LLM`
- `FactGenerator`: generate `Facts` from the answers with human validation equals to 1. `FactGenerator` also uses an `LLM` to generate the facts
- `EvalGenerator`: generate `Eval`s based on `Answer`s and `Facts`. `EvalGenerator` also uses a `LLM` to perform the evaluations.
- `LLM`: generate text included in `LLMAnswer` objects, themselves added in Answer, Facts, or Eval objects
- `LLMAnswer`: answer returned by an LLM. Contains a `text` field, returned by the LLM, plus a `cost`, a `duration`, a `timestamp` and a `prompt` field, being the prompt used to generate the answer
- `Prompter`: a prompter is used to generate a prompt for an LLM and to post-process the text returned by the LLM
- `Expe`: an experiment object, containing a list of question/answer `QA` objects
- `QA`: a row in an Expe. Each row contains a `Question` and, optionally, `Facts`, `Chunks` and `Answers`.
- `Question`: contains a `text` field for the question's text. Can also contain a `meta` dictionary
- `Facts`: a list of `Fact`, with a `text` field being the fact in itself and an `LLMAnswer` object if the fact has been generated by an LLM
- `Chunks`: a list of `Chunk` containing the `text` of the chunk and optionally a `meta` dictionary with extra data associated with the retriever
- `Answers`: the answer to the question is in the `text` field plus an `LLMAnswer` containing all the data related to the answer generation, plus an `Eval` object related to the evaluation of the answer
- `Eval`: contains a `human` field to store human evaluation of the answer as well as a `auto` field when the evaluation is done automatically. In this case, it also contains an `LLMAnswer` object related to the automatic evaluation

# Use case 1: test an internal LLM with no Retriever
In this use case, you want to test a LLM with no Retriever, i.e. the questions are sent to the LLM without retrieving chunks for each questions.

Here are the steps associated with this use case:
1. Collect the questions
2. Define your `LLM`
3. Generate the answers
4. Annotate the correct ones
5. Generate facts
6. Run evaluations

## Collect the questions
First, get a list of questions and put them in a JSON file having the following format :
```json
[
    {
        "question": {
            "text": "What is the meaning of all this?",
            "meta": {
                "source": ["Wikipedia", "Wolfram"],
                "filter": None
            }
        }
    },
    {
        "question": {
            "text": "How old are you?",
            "meta": {
                "source": ["Google"],
                "filter": ".com"
            }
        }
    },
    ...
]
```

The `meta` dictionary is optional. You can define any keys inside, to be used in your `AnswerGenerator`. If you don't need metadata for your system, don't add this field. The only required field is `text`, which is the text of your question.

You can use the `first_questions.xlsx` file in `res/spreadsheets` to help you generate the JSON. Just fill you questions in column A and copy / paste
the text generated in column B in the JSON file. just don't forget to remove the last comma and to include it in a list (character `[` before and `]` after).

You can now create an `Expe` object, defining an experiment, with:
```python
expe:Expe = Expe(json_path=path_to_the_question_file)
```
## Define your `LLM`
To define your LLM, you only need to crate a class overridding the `complete` method:
```python
class MyLLM(LLM):
    def complete(self, prompt:Prompt) -> LLMAnswer:
        api_result = my_API.complete(user=prompt.user, system=prompt.system)
        
        result: LLMAnswer = LLMAnswer(prompt=prompt,
                                    text=api_result.text,
                                    name=api_result.name, #name of the model, e.g. 'gpt-4'
                                    full_name=api_result.full_name, #full name, e.g. 'gpt-4-0613'
                                    duration=api_result.dur, # duration in seconds
                                    cost=api_result.cost #cost in USD
                                    )
        return result
```
Fields `cost`, `full_name` and `duration` are optional, so you don't have the provide a value if your LLM does not return them.

## Generate the answers
You can now use your LLM to generate answers. At first, you will use a simple prompter, which sends the question as is to the LLM and does not do any post-processing. When using chunks, you have to convert them to a prompt. And if the prompt you use ask the LLM to generate a specific structure, like a JSON, you need to to some post-processing.
But for a simple an straightforward usage, you can use the `SimpleAnsPptr` meaning it is a simple prompter used to generate `Answers`, as opposed to other prompters used to generate `Facts` or `Eval`objects.

So generating answers is done with:
```python
filename:str = name_of_the_JSON_file_containing_the_questions
path:Path = Path(folder_to_the_JSON_file)
expe:Expe = Expe(json_path=path / filename)
my_llm:LLM = LLM(prompter=SimpleAnsPptr()) # LLMs have to be instanciated with a Prompter
ans_gen:AnsGenerator = AnsGenerator(retriever=None, llm=my_llm)
ans_gen.generate(expe)
expe.save_to_json(path=path / filename)
expe.save_to_spreadsheet(path=path / filename)
```

Once this piece of code has been executed, the expe file augmented with answers is saved with a new timestamp in the original folder. A spreadsheet with the results is also generated.

## Generate facts
To enable automatic generation of facts, the correct answer must first be marked. *A correct answer is both right and exhaustive*.
To mark the correct answers, just open the JSON file with the answers and add an `Eval` field with a field `human` equals to 1.0, e.g.:
```json
[
    {
        "question": {
            "text": "question1",
        },
        "answers": {
            "items": [
                {
                    "text": "answer1",
                    "llm_answer": {"name": "my_llm"},
                    "eval": {
                        "human": 1.0
                    }
                }
            ]
        }
    },
]
```

Once the correct answers have been marked in the expe file, the facts generation can be started.

To do so, you can use GPT-4 as the LLM and the `SimpleFactsPptrFR` as the prompter (in French in this case):
```python
filename:str = name_of_the_JSON_file_with_questions_answers_and_human_evals
path:Path = Path(folder_to_the_JSON_file)
expe:Expe = Expe(json_path=path / filename)
facts_gen:FactsGenerator = FactsGenerator(llm_names=["gpt-4"], prompter=SimpleFactsPptrFR())
fact_gen.generate(expe)
expe.save_to_json(path=path / filename)
expe.save_to_spreadsheet(path=path / filename)
```

Now you have facts associated to your questions!

## Run evaluations
Now if you want to compare your LLM with GPT-4 and Claude for instance, you just have to provide your keys in the `keys.py` file. Just use the `keys.example.py` file and rename it to `keys.py` with your API keys.

First, generate answers with the 2 LLMs:
```python
filename:str = name_of_the_JSON_file_containing_questions_facts_and_the_answers_from_my_llm
path:Path = Path(folder_to_the_JSON_file)
expe:Expe = Expe(json_path=path / filename)
llm_names:list[str] = ["gpt-4", "claude-2.1"]
ans_gen:AnsGenerator = AnsGenerator(retriever=None, llm_names=llm_names, prompter=SimpleAnsPptr())
ans_gen.generate(expe, b_missing_only=True) # used to reuse the results already obtained with the MyLLM
expe.save_to_json(path=path / filename)
expe.save_to_spreadsheet(path=path / filename)
```

The new expe file now contains the result of your LLM, the facts, plus the results obtained with GPT-4 and Claude 2.1.

The evaluation (using GPT-4) can then be launched with:
```python
filename:str = name_of_the_JSON_file_with_questions_facts_and_answers
path:Path = Path(folder_to_the_JSON_file)
expe:Expe = Expe(json_path=path / filename)
eval_gen:EvalGenerator = EvalGenerator(llm_names=["gpt-4"], prompter=SimpleEvalPptrFR())
eval_gen.generate(expe)
expe.save_to_json(path=path / filename)
expe.save_to_spreadsheet(path=path / filename)
```

The results can be viewed in the spreadsheet, in the "Stats" tab.

# Analyse experiments
3 kind of files can be generated from an experiment (`Expe` object):
1. JSON: used to store an experiment with all its attributes
2. HTML: used to analyse an experiment per question
3. Spreadsheet (xlsx): used to analyse globally an experiment with stats, and also to write results an human analysis per question

# MISC
## Start a new project
Copy the base_folder` from the main `ragtime` folder to the `user`folder and rename it according to your project.
In you new folder :
- review the default values `config.py` file to specify where the files are located
- add you API keys in `keys.py`


In the `user` folder, create a folder for your project and create the following subfolders within:
- `logs` - copy the file `ragtime_logging.json` inside
- `expe` - within, add `Questions`, `Answers`...
In your user folder, create a `keys.py` file from the template `keys.example.py` and add the API keys you plan to use.
In your main file, don't forget to `import keys`.

## Merge two Expes
Suppose you have run an Expe with 50 questions over 5 models but you want to add a new meta to your Questions which have no impact on the answers. You can run again the experiment but it is useless and costly.
In this case you would rather merge the experiment having the answers with the one having the meta data, like this:
```python
expeQuestWithMeta:Expe = Expe(json_path=FOLDER_QUESTIONS / "expe_having_questions_with_meta.json")
expeWithAns:Expe = Expe(json_path=FOLDER_ANSWERS / "expe_having_Answers.json")

for qaQuestWithMeta, qaWithAns in zip(expeQuestWithMeta, expeWithAns):
    assert qaQuestWithMeta.question.text == qaWithAns.question.text # just to make sure we don't mix up things
    qaQuestWithMeta.answers = qaWithAns.answers # copy the answers from the expe having them to the expe with the questions having meta data

expeQuestWithMeta.save_to_json(path=FOLDER_QUESTIONS / "expe_having_both.json")
```
## Rerun an Expe on post-processing only
In this case just run the `AnswerGenerator` with the parameter `start_from` set as in:
```python
generators.gen_Answers(folder_in=FOLDER_QUESTIONS, folder_out=FOLDER_ANSWERS,
                         json_file="questions.json",
                         prompter=MyAnsPptr(),
                         llm_names=["gpt-4", "gemini-pro", "mistral/mistral-large-latest"],
                         start_from=StartFrom.post_process)```

## Restart an Expe when it has failed
An Expe may failed for some reason. In this case, if an exception is raised, it is caught and a file called "Stopped at ..." is created in the same
folder than the original Expe json file was taken.
You can restart the experiment starting from the "Stopped at..." JSON file. In order not to restart everything from the start, you can use the
`b_missing_only=True` parameter so that all the Answers which have been generated are not generated again:
```python
question_file:str = "Stopped_at_94_of_100_2024-mars--100Q_465C_0F_5M_464A_0HE_0AE_2024-03-01_22,34,53.json"
expe:Expe = Expe(json_path=FOLDER_QUESTIONS / question_file)
prompter:Prompter = PptrRichAnsFR()
llm_names:list[str] = ["gpt-4", "gpt-3.5-turbo", "gemini-pro", "claude-2.1", "mistral/mistral-large-latest"]
ans_gen:AnsGenerator = AnsGenerator(retriever=LSA_Retriever(), llm_names=llm_names, prompter=prompter)
ans_gen.generate(expe, b_missing_only=True)
```

## Change the log file location
By default, logs are added to the `logs.txt` file in the folder `logs`. Along with this file is `ragtime_logging.json` which describes the logging configuration. It follows the format from the standard Python [`logging` module](https://docs.python.org/3/library/logging.config.html).

If you want to change the location of the log file, just edit the key `filename` under `handlers`/`file` in `ragtime_logging.json`.

If you want to change the location of `ragtime_logging.json`, modify `LOG_CONFIG:Path` in `ragtime.py`.


# BELOW IS TO BE SORTED - IGNORE AT THE MOMENT
## Fact generation
The goal of this first step is to enhance an initial list of questions with facts. Facts allow to run the questions again and automatically validate the answers returned by the RAG.

More precisely, when an answer is obtained from the RAG, the facts associated to the question are all checked against the answer : if all the facts are validated, the answer is deemed correct, otherwise not.

The evaluation of the facts against an answer is also done using an LLM so the whole process is automatic once the questions / facts list is defined.

This is why the fact generation step comes first in the process. Each step of the process are detailed below.





You can then load your questions with:
```
questions:Questions = Questions(metadata_map = {"team":"D"})
questions.from_Excel(Path('question_file.xlsx'))
```

`questions` now contains a list of `Question` object.

You can change this default layout creating a new `XL_map` and associating it with the `Questions` object. Below is an example with *Questions* in column B and *Facts* in column A:
```
my_map:XL_map = XL_map()
my_map.text = "B"
my_map.facts = "A"
questions.xl_map = my_map
```

### Generating the answers
You can now evaluate the questions with your RAG. You can specify one or several LLMs to generate the answers.

To do so, you must first define your RAG. A RAG is made of :

- a `Retriever`, retrieving `Chunks` based on the question and its metadata (if any)
- a `_get_user` method, converting the chunks into a prompt
- a `_get_system` method, returning the system prompt to be used
- one or several LLMs, generating an answer based on the user ans system prompts
- a `_post_process` method, to perform extra processing of the  answers returned by the LLMs, useful for instance when the answers are in JSON format

Creating your RAG is defined in [Create a custom RAG](#create-a-custom-rag).

Once your RAG is defined, you can use it to evaluate your questions with LLMs GPT-4, Gemini-Pro and Mistral 7B for instance:
```
llms=["gpt-4", "gemini-pro", "huggingface/mistralai/Mistral-7B-Instruct-v0.1"]
my_rag:my_RAG = my_RAG(retriever=my_Retriever(), llms=llms)
answers:list[Answers] = my_rag.gen_answers(questions)
answers.to_Excel(Path('answer_file.xlsx'))
```
You can pick any of the LLM listed in the [litellm providers list](https://litellm.vercel.app/docs/providers).

The process can take some time depending on the number of questions and the number of LLMs you're using. For each question:

1. The chunks are retrieved using the object `my_Retriever` you have defined
2. They are converted into prompts using the `_get_user` and `_get_system` methods you have defined in `my_RAG`
3. The prompts are sent to each LLM you have specified
4. Each answer is post-processed using the `_post_process` method defined in `my_RAG`

For each question, an `Answer` object is returned with one `LLMAnswer` for each LLM. Hence the final result is `Answers`, one `Answer` per question.

The result is finally stored into an Excel file.

### Evaluating the answers
To do so, open the Excel answers file and evaluate the correct answers. When an answer is correct, just put 1 in the "Hum eval" column. If it is not, you can either keep it empty or put 0.
Once you're done, just save the file.

### Generating the facts
Based on the answers you've just validated, you can use another LLM to validate a set of facts describing these answers.
# HERE

# CLASSES
## TextGenerator
The `TextGenerator` class is the parent of all the generation classes `AnswerGenerator`, `FactGenerator` and `EvalGenerator`.

It exposes a `generate(expe)` method allowing each of the child class to generate an answer, a set of facts or an evaluation. This method call a `gen_for_qa` method for each question in the given `Expe` object.

The `gen_for_qa` method executes the same categories of actions, each depending on the type of object to generate:

## AnswerGenerator
A standard answer generation process goes through the following steps:
1. Chunk retrieval
2. LLM generation - for each LLM:
    
    2.a. Prompt generation
    
    2.b. Text generation
    
    2.c. Post-processing

The process is executed when calling the `generate` from an `AnswerGenerator` object.

### 1. Chunk retrieval
This step is optional since the question can be sent directly to a LLM. This step is executed only if a `Retriever` object is given to the `AnswerGenerator` at creation time, e.g.
```
ans_gen:AnswerGenerator = AnswerGenerator(retreiver=MyRetriever(), llms=[...])
```
### 2. LLM generation per LLM : prompt, generation and post-processing
Once the optional step of Chunks retrieval is complete, the Question and the Chunks are used to run the actual answer generation loop, for each LLM given to the `AnswerGenerator` at creation time. The answer generation loop is made of the following 3 steps :

#### 2.a. Prompt generation
Prompt generation is done with `Prompter` objects. Each `Prompter` object returns a prompt for Answer Fact and Eval generation. A `Prompter` is always attached to a `LLM`, so that prompting can be done per LLM.
To generate an answer, the method `Prompter.get_answer_prompt` is called. It takes a `Question` and an optional `Chunks` and returns a `Prompt`.

#### 2.b. Text generation
The LLM generation is at the heart of the Answer generation process. It takes as input a `Prompt` and returns a `LLMAnswer`. The `LLMAnswer` contains the the `text` returned by the LLM, the `timestamp` corresponding to the time the LLM has been called, but also the `duration`, the `full_name`  of the LLM and the `cost` if provided.

#### 2.c. Post-processing
In some cases you need to extract information from the LLM's answer, for instance when you ask the LLM to return a JSON structure. This can be done with post-processing.
In the `post_process` method defined in `TextGenerator` and its descendants, you can override the method to process the `LLMAnswer` returned.

### Generating answers
Since answer generation can be a long, costly and faulty process, several options are proposed to restart without having to do all the computations again if the process is interrupted.

#### 1. Restart computations for empty values only
Sometimes LLMs do not return an answer when called through API. Instead of stopping the whole experiment when it happens, the question is simply skipped for this LLM. In order to have complete files, the Expe can be run again only on missing values.

#### 2. Restart computations from a given step
Sometimes you want to test only a part of your processing chain, post processing for instance, or you want to replay what had been done in a previous expe file. In this case, you can run the Expe agin from a given step in chain. The different steps are all stored in the `StartFrom` enum.

#### 3. Restart on a sub-list of LLMs only
Sometimes one or several LLMs were not functionning during an experiment, but others went ok. In this case you can run the experiment again using a subset of LLMs only.