Metadata-Version: 2.2
Name: nvidia_lm_eval
Version: 25.2
Summary: A framework for evaluating language models - packaged by NVIDIA
License: MIT
Classifier: Development Status :: 5 - Production/Stable
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE.md
Requires-Dist: accelerate>=0.26.0
Requires-Dist: evaluate
Requires-Dist: datasets>=2.16.0
Requires-Dist: evaluate>=0.4.0
Requires-Dist: jsonlines
Requires-Dist: numexpr
Requires-Dist: peft>=0.2.0
Requires-Dist: pybind11>=2.6.2
Requires-Dist: pytablewriter
Requires-Dist: rouge-score>=0.0.4
Requires-Dist: sacrebleu>=1.5.0
Requires-Dist: scikit-learn>=0.24.1
Requires-Dist: sqlitedict
Requires-Dist: torch>=1.8
Requires-Dist: tqdm-multiprocess
Requires-Dist: transformers>=4.1
Requires-Dist: zstandard
Requires-Dist: dill
Requires-Dist: word2number
Requires-Dist: more_itertools
Requires-Dist: pandas==2.0.0
Requires-Dist: numpy<2.0.0
Requires-Dist: httpx==0.27.0
Requires-Dist: langdetect==1.0.9
Requires-Dist: immutabledict==4.2.0
Requires-Dist: nltk>=3.9.1
Requires-Dist: requests
Requires-Dist: aiohttp
Requires-Dist: tenacity
Requires-Dist: tqdm
Requires-Dist: tiktoken
Requires-Dist: sentencepiece>=0.1.98
Requires-Dist: beautifulsoup4==4.13.1
Requires-Dist: openai==1.61.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: pytest-xdist; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Provides-Extra: deepsparse
Requires-Dist: deepsparse-nightly[llm]>=1.8.0.20240404; extra == "deepsparse"
Provides-Extra: gptq
Requires-Dist: auto-gptq[triton]>=0.6.0; extra == "gptq"
Provides-Extra: hf-transfer
Requires-Dist: hf_transfer; extra == "hf-transfer"
Provides-Extra: ibm-watsonx-ai
Requires-Dist: ibm_watsonx_ai>=1.1.22; extra == "ibm-watsonx-ai"
Provides-Extra: ifeval
Requires-Dist: langdetect; extra == "ifeval"
Requires-Dist: immutabledict; extra == "ifeval"
Requires-Dist: nltk>=3.9.1; extra == "ifeval"
Provides-Extra: neuronx
Requires-Dist: optimum[neuronx]; extra == "neuronx"
Provides-Extra: mamba
Requires-Dist: mamba_ssm; extra == "mamba"
Requires-Dist: causal-conv1d==1.0.2; extra == "mamba"
Provides-Extra: math
Requires-Dist: sympy>=1.12; extra == "math"
Requires-Dist: antlr4-python3-runtime==4.11; extra == "math"
Provides-Extra: multilingual
Requires-Dist: nagisa>=0.2.7; extra == "multilingual"
Requires-Dist: jieba>=0.42.1; extra == "multilingual"
Requires-Dist: pycountry; extra == "multilingual"
Provides-Extra: optimum
Requires-Dist: optimum[openvino]; extra == "optimum"
Provides-Extra: promptsource
Requires-Dist: promptsource>=0.2.3; extra == "promptsource"
Provides-Extra: sparseml
Requires-Dist: sparseml-nightly[llm]>=1.8.0.20240404; extra == "sparseml"
Provides-Extra: testing
Requires-Dist: pytest; extra == "testing"
Requires-Dist: pytest-cov; extra == "testing"
Requires-Dist: pytest-xdist; extra == "testing"
Provides-Extra: vllm
Requires-Dist: vllm>=0.4.2; extra == "vllm"
Provides-Extra: zeno
Requires-Dist: pandas; extra == "zeno"
Requires-Dist: zeno-client; extra == "zeno"
Provides-Extra: wandb
Requires-Dist: wandb>=0.16.3; extra == "wandb"
Requires-Dist: pandas; extra == "wandb"
Requires-Dist: numpy; extra == "wandb"
Provides-Extra: gptqmodel
Requires-Dist: gptqmodel>=1.0.9; extra == "gptqmodel"
Provides-Extra: japanese-leaderboard
Requires-Dist: emoji==2.14.0; extra == "japanese-leaderboard"
Requires-Dist: neologdn==0.5.3; extra == "japanese-leaderboard"
Requires-Dist: fugashi[unidic-lite]; extra == "japanese-leaderboard"
Requires-Dist: rouge_score>=0.1.2; extra == "japanese-leaderboard"
Provides-Extra: all
Requires-Dist: lm_eval[anthropic]; extra == "all"
Requires-Dist: lm_eval[dev]; extra == "all"
Requires-Dist: lm_eval[deepsparse]; extra == "all"
Requires-Dist: lm_eval[gptq]; extra == "all"
Requires-Dist: lm_eval[hf_transfer]; extra == "all"
Requires-Dist: lm_eval[ibm_watsonx_ai]; extra == "all"
Requires-Dist: lm_eval[ifeval]; extra == "all"
Requires-Dist: lm_eval[mamba]; extra == "all"
Requires-Dist: lm_eval[math]; extra == "all"
Requires-Dist: lm_eval[multilingual]; extra == "all"
Requires-Dist: lm_eval[promptsource]; extra == "all"
Requires-Dist: lm_eval[sparseml]; extra == "all"
Requires-Dist: lm_eval[testing]; extra == "all"
Requires-Dist: lm_eval[vllm]; extra == "all"
Requires-Dist: lm_eval[zeno]; extra == "all"
Requires-Dist: lm_eval[wandb]; extra == "all"
Requires-Dist: lm_eval[japanese_leaderboard]; extra == "all"

# NVIDIA Evals Factory

The goal of NVIDIA Evals Factory is to advance and refine state-of-the-art methodologies for model evaluation, and deliver them as modular evaluation packages (evaluation containers and pip wheels) that teams can use as standardized building blocks.

# Quick start guide

NVIDIA Evals Factory provide you with evaluation clients, that are specifically built to evaluate model endpoints using our Standard API.

## Launching an evaluation for an LLM

1. Install the package
    ```
    pip install nvidia-lm-eval
    ```

3. (Optional) Set a token to your API endpoint if it's protected
    ```bash
    export MY_API_KEY="your_api_key_here"
    ```
4. List the available evaluations:
    ```bash
    $ core_evals_lm_eval ls
    Available tasks:
    * mmlu (in lm-evaluation-harness)
    * ifeval (in lm-evaluation-harness)
    * mmlu_pro (in lm-evaluation-harness)
    * math (in lm-evaluation-harness)
    ...
    ```
5. Run the evaluation of your choice:
   ```bash
   core_evals_lm_eval run_eval \
       --eval_type mmlu_pro \
       --model_id meta/llama-3.1-70b-instruct \
       --model_url https://integrate.api.nvidia.com/v1/chat/completions \
       --model_type chat \
       --api_key_name MY_API_KEY \
       --output_dir /workspace/results
   ```
6. Gather the results
    ```bash
    cat /workspace/results/results.yml
    ```

# Command-Line Tool

Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the `lm_eval` (`lm-evaluation-harness`):

## Commands

### 1. **List Evaluation Types**

```bash
core_evals_lm_eval ls
```

Displays the evaluation types available within the harness.

### 2. **Run an evaluation**

The `core_evals_lm_eval run_eval` command executes the evaluation process. Below are the flags and their descriptions:

### Required flags
* `--eval_type <string>`
The type of evaluation to perform
* `--model_id <string>`
The name or identifier of the model to evaluate.
* `--model_url <url>`
The API endpoint where the model is accessible.
* `--model_type <string>`
The type of the model to evaluate, currently either "chat", "completions", or "vlm".
* `--output_dir <directory>`
The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.

### Optional flags
* `--api_key_name <string>`
The name of the environment variable that stores the Bearer token for the API, if authentication is required.
* `--run_config <path>`
Specifies the path to a  YAML file containing the evaluation definition.

### Example

```bash
core_evals_lm_eval run_eval \
    --eval_type ifeval \
    --model_id my_model \
    --model_type chat \
    --model_url http://localhost:8000 \
    --output_dir ./evaluation_results
```

If the model API requires authentication, set the API key in an environment variable and reference it using the `--api_key_name` flag:

```bash
export MY_API_KEY="your_api_key_here"

core_evals_lm_eval run_eval \
    --eval_type ifeval \
    --model_id my_model \
    --model_type chat \
    --model_url http://localhost:8000 \
    --api_key_name MY_API_KEY \
    --output_dir ./evaluation_results
```

# Configuring evaluations via YAML

Evaluations in NVIDIA Evals Factory are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.

Example of a YAML config:
```yaml
config:
  type: ifeval
  params:
    parallelism: 50
    limit_samples: 20
target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    type: chat
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key: NVIDIA_API_KEY
```

The priority of overrides is as follows:
1. command line arguments
2. user config (as seen above)
3. task defaults (defined per task type)
4. framework defaults 

`--dry_run` option allows you to print the final run configuration and command without executing the evaluation.

### Example:

```bash
core_evals_lm_eval run_eval \
    --eval_type mmlu_pro_instruct \
    --model_id my_model \
    --model_type chat \
    --model_url http://localhost:8000 \
    --output_dir .evaluation_results \
    --dry_run
```

Output:

```bash
Rendered config:

command: '{% if target.api_endpoint.api_key is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key}}{%
  endif %} lm-eval --tasks {{config.params.task}}{% if config.params.extra.num_fewshot
  is defined %} --num_fewshot {{ config.params.extra.num_fewshot }}{% endif %} --model
  {% if target.api_endpoint.type == "completions" %}local-completions{% elif target.api_endpoint.type
  == "chat" %}local-chat-completions{% endif %} --model_args "base_url={{target.api_endpoint.url}},model={{target.api_endpoint.model_id}},tokenized_requests=false,{%
  if target.api_endpoint.type == "completions" %}tokenizer={{config.params.extra.tokenizer}}{%
  endif %},num_concurrent={{config.params.parallelism}}{% if config.params.max_new_tokens
  is not none %},max_gen_toks={{ config.params.max_new_tokens }}{% endif %},timeout={{
  config.params.timeout }},max_retries={{ config.params.max_retries }},stream={{ target.api_endpoint.stream
  }}" --log_samples --output_path {{config.output_dir}} --use_cache {{config.output_dir}}/lm_cache
  {% if config.params.limit_samples is not none %}--limit {{config.params.limit_samples}}{%
  endif %} {% if target.api_endpoint.type == "chat" %}--fewshot_as_multiturn --apply_chat_template
  {% endif %} {% if config.params.extra.args is defined %} {{config.params.extra.args}}
  {% endif %} {% if config.params.temperature is not none or config.params.top_p is
  not none %}--gen_kwargs="{% if config.params.temperature is not none %}temperature={{
  config.params.temperature }},{% endif %}{% if config.params.top_p is not none %}top_p={{
  config.params.top_p}}{% endif %}"{% endif %}'
framework_name: lm-evaluation-harness
pkg_name: lm_eval
config:
  output_dir: .evaluation_results
  params:
    limit_samples: null
    max_new_tokens: 1024
    max_retries: 5
    parallelism: 10
    task: mmlu_pro
    temperature: 1.0e-07
    timeout: 30
    top_p: 0.9999999
    extra:
      tokenizer: meta-llama/Llama-3.1-70B-Instruct
      num_fewshot: 0
  supported_endpoint_types:
  - chat
  type: mmlu_pro_instruct
target:
  api_endpoint:
    api_key: null
    model_id: my_model
    stream: false
    type: chat
    url: http://localhost:8000


Rendered command:

 lm-eval --tasks mmlu_pro --num_fewshot 0 --model local-chat-completions --model_args "base_url=http://localhost:8000,model=my_model,tokenized_requests=false,,num_concurrent=10,max_gen_toks=1024,timeout=30,max_retries=5,stream=False" --log_samples --output_path .evaluation_results --use_cache .evaluation_results/lm_cache  --fewshot_as_multiturn --apply_chat_template   --gen_kwargs="temperature=1e-07,top_p=0.9999999"
```

# FAQ

## Deploying a model as an endpoint

NVIDIA Evals Factory utilize a client-server communication architecture to interact with the model. As a prerequisite, the **model must be deployed as an endpoint with a NIM-compatible API**.

Users have the flexibility to deploy their model using their own infrastructure and tooling.

Servers with APIs that conform to the OpenAI/NIM API standard are expected to work seamlessly out of the box.
