Metadata-Version: 2.4
Name: nvidia-mmath
Version: 26.1
Summary: MMATH - packaged by NVIDIA
License: MIT License
        
        Copyright (c) 2025 RUCAIBox
        Copyright (c) 2025 NVIDIA
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Classifier: Development Status :: 5 - Production/Stable
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: math-verify==0.7.0
Requires-Dist: tqdm
Requires-Dist: langchain
Requires-Dist: langchain-openai
Requires-Dist: nemo-evaluator>=0.1.51
Dynamic: license-file

# MMATH

This is the official repository for the paper **"MMATH: A Multilingual Benchmark for Mathematical Reasoning"**.

## 📖 Introduction

**MMATH** is a new benchmark specifically designed for multilingual complex reasoning. It comprises 374 carefully selected math problems from high-quality sources, including AIME, CNMO, and MATH-500, and covers ten typologically and geographically diverse languages. Each problem is translated and validated through a rigorous pipeline that combines frontier LLMs with human verification, ensuring semantic consistency.

## NVIDIA NeMo Evaluator

MMATH provides you with evaluation clients that are specifically built to evaluate model endpoints using our Standard API.

### Launching an evaluation for an LLM

#### Install the package

```bash
pip install nvidia-mmath
```

#### (Optional) Set a token to your API endpoint if it's protected

```bash
export MY_API_KEY="your_api_key_here"
export HF_TOKEN="your_huggingface_token_here"
```

#### List the available evaluations

```bash
nemo-evaluator ls
```

Available tasks:
* mmath_en (in mmath)
* mmath_zh (in mmath)
* mmath_ar (in mmath)
* mmath_es (in mmath)
* mmath_fr (in mmath)
* mmath_ja (in mmath)
* mmath_ko (in mmath)
* mmath_pt (in mmath)
* mmath_th (in mmath)
* mmath_vi (in mmath)

#### Run the evaluation of your choice

```bash
nemo-evaluator run_eval \
    --eval_type mmath_en \
    --model_id meta/llama-3.1-8b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name MY_API_KEY \
    --output_dir /workspace/results
```

#### Gather the results

```bash
cat /workspace/results/results.yml
```

### Command-Line Tool

Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the mmath evaluations:

#### Commands

##### 1. List Evaluation Types

```bash
nemo-evaluator ls
```

Displays the evaluation types available within the harness.

##### 2. Run an evaluation

The `nemo-evaluator run_eval` command executes the evaluation process. Below are the flags and their descriptions:

**Required flags:**

- `--eval_type <string>`: The type of evaluation to perform (e.g., mmath_en, mmath_zh, etc.)
- `--model_id <string>`: The name or identifier of the model to evaluate.
- `--model_url <url>`: The API endpoint where the model is accessible.
- `--model_type <string>`: The type of the model to evaluate, currently either "chat", "completions", or "vlm".
- `--output_dir <directory>`: The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.

**Optional flags:**

- `--api_key_name <string>`: The name of the environment variable that stores the Bearer token for the API, if authentication is required.
- `--run_config <path>`: Specifies the path to a YAML file containing the evaluation definition.
- `--overrides <string>`: Override configuration parameters (e.g., 'config.params.limit_samples=10').

#### Example

```bash
nemo-evaluator run_eval \
    --eval_type mmath_en \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results
```

If the model API requires authentication, set the API key in an environment variable and reference it using the `--api_key_name` flag:

```bash
export MY_API_KEY="your_api_key_here"

nemo-evaluator run_eval \
    --eval_type mmath_en \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --api_key_name MY_API_KEY \
    --output_dir ./evaluation_results
```

### Configuring evaluations via YAML

Evaluations in MMATH are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.

**Example of a YAML config:**

```yaml
config:
  type: mmath_en
  params:
    parallelism: 50
    limit_samples: 20
target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    type: chat
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NVIDIA_API_KEY
```

The priority of overrides is as follows:

1. command line arguments
2. user config (as seen above)
3. task defaults (defined per task type)
4. framework defaults

The `--dry_run` option allows you to print the final run configuration and command without executing the evaluation.

**Example:**

```bash
nemo-evaluator run_eval \
    --eval_type mmath_en \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results \
    --dry_run
```

## 🛠️ Code and Data Usage

MMATH has the following resources. The `mmath` folder contains the benchmark data. The `train` folder contains the training data at Section 3. `mmath_eval.py` is the main program to evaluate the accuracy results, while `calculate_lcr.py` calculates the value of LCR as defined in our paper.

```
│  calculate_lcr.py
│  mmath_eval.py
│  utils.py
│
├─mmath
│      ar.json
│      en.json
│      es.json
│      fr.json
│      ja.json
│      ko.json
│      pt.json
│      th.json
│      vi.json
│      zh.json
│
└─train
        enSFT-3k.json
        enThink-3k.json
        nativeThink-3k.json
```

The example in mmath/xx.json has the following format:

```json
{
        "question": "Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+\\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.", # The question 
        "answer": "204",			# The answer
        "data_source": "AIME2024",  # The data source, which might be AIME2024/AIME2025/CNMO/MATH500
        "data_source_id": 0,        # The index in original data
        "lang": "en",				# Language type
        "gid": 0					# The global id in our benchmark MMATH
},
```

The example in train/yy.json has the following format:

```json
{
        "index":0,
        "answer":"364",
        "conversations":[
            {
                "from":"user",
                "value":"For a positive integer \\( n \\), consider the function\n\n\\[ \nf(n)=\\frac{4n+\\sqrt{4n^2-1}}{\\sqrt{2n+1}+\\sqrt{2n-1}} \n\\]\n\nCalculate the value of\n\n\\[ \nf(1)+f(2)+f(3)+\\cdots+f(40) \n\\]"
            },
            {
                "from":"assistant",
                "value":"<think>\nOkay, let's see. I need to find the sum of f(1) + f(2) + ...Thus, the sum is:\n\n\\[\n\\frac{1}{2} (729 - 1) = \\frac{728}{2} = 364.\n\\]\n\nThe final answer is:\n\n\\[\n\\boxed{364}\n\\]"
            }
        ]
},
```

## 🧪 Experiment Setup

### Environment Setups

To accelerate the process of environment setup, we use `uv` to manage the packages. And our training code is based on [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), you can install it based on your requirements (e.g, with -e option).

```bash
conda create -n mmath python=3.10
conda activate mmath
pip install uv
uv pip install -r requirements.txt
```

### Evaluation Commands

#### Accuracy Results

To calculate accuracy results on our benchmark, you can run with:

```bash
export CUDA_VISIBLE_DEVICES=0,1
python mmath_eval.py --model_name_or_path DeepSeek-R1-Distill-Qwen-32B --tensor_parallel_size 2
```

This will generate a `results` directory with a subdirectory with the model name (e.g, DeepSeek-R1-Distill-Qwen-32B). Inside the directory are the results belonging to different languages, such as `en.json`.

#### LCR Results

As for calculating LCR, please download the `lid.176.bin` with:

```bash
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
```

After that, please set the `model_list_full` variable in `calculate_lcr.py`. Then, you can run the command using `python calculate_lcr.py`.

```python
model_list_full = [
    "DeepSeek-R1-Distill-Qwen-32B",
    # ...
]
```

This will rewrite some keys in `results/model_name/xx.json` and output a LaTeX table about the whole results.

## Training Setups

As mentioned before, our training code is based on [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory). Here we provide the hyperparameters used in our paper. 

```yaml
### model
model_name_or_path: Qwen2.5-32B-Instruct
trust_remote_code: true

### method
stage: sft
template: qwen
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
packing: false

### dataset
dataset: en-Think
cutoff_len: 32768
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: Qwen2.5-32B-Instruct-en-Think
logging_steps: 10
save_strategy: epoch
plot_loss: true
overwrite_output_dir: true
save_only_model: true
save_total_limit: 10

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-5
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.1 
save_total_limit: 10 
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
enable_liger_kernel: true
```

## 📄 Attribution

This project is a fork of the original **MMATH: A Multilingual Benchmark for Mathematical Reasoning** repository created by the RUCAIBox team at Renmin University of China.

### Original Repository
- **Repository**: [RUCAIBox/MMATH](https://github.com/RUCAIBox/MMATH)
- **Original Paper**: [MMATH: A Multilingual Benchmark for Mathematical Reasoning](https://arxiv.org/abs/2505.19126)

### Original Authors
The original MMATH benchmark was created by:

- **Wenyang Luo** - Renmin University of China
- **Wayne Xin Zhao** - Renmin University of China  
- **Jing Sha** - Renmin University of China
- **Shijin Wang** - Renmin University of China
- **Ji-Rong Wen** - Renmin University of China

### License
The original MMATH repository is licensed under the MIT License. This fork maintains the same license terms while adding NVIDIA-specific packaging and evaluation capabilities.

### Acknowledgments
We thank the original MMATH authors for creating this comprehensive multilingual mathematical reasoning benchmark and making it publicly available to the research community.

## 📄 Citation

```
@article{luo2025mmath,
  title={MMATH: A Multilingual Benchmark for Mathematical Reasoning},
  author={Luo, Wenyang and Zhao, Wayne Xin and Sha, Jing and Wang, Shijin and Wen, Ji-Rong},
  journal={arXiv preprint arXiv:2505.19126},
  year={2025}
}
```
