Metadata-Version: 2.4
Name: merging-eval
Version: 0.1.4
Summary: Model Merging Scaling Laws in Large Language Models - Empirical scaling laws for language model merging measured by cross-entropy
Author-email: Merging-EVAL Team <maxuan1798@gmail.com>
License: MIT License
        
        Copyright (c) 2024 Merging-EVAL Team
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Project-URL: Homepage, https://github.com/Merging-EVAL/Merging-EVAL
Project-URL: Documentation, https://github.com/Merging-EVAL/Merging-EVAL#readme
Project-URL: Repository, https://github.com/Merging-EVAL/Merging-EVAL
Project-URL: Issues, https://github.com/Merging-EVAL/Merging-EVAL/issues
Keywords: llm,model-merging,scaling-laws,language-models,cross-entropy
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.30.0
Requires-Dist: datasets>=2.12.0
Requires-Dist: accelerate>=0.20.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: requests>=2.31.0
Provides-Extra: full
Requires-Dist: torch==2.6.0; extra == "full"
Requires-Dist: torchaudio==2.6.0; extra == "full"
Requires-Dist: torchvision==0.21.0; extra == "full"
Requires-Dist: transformers==4.50.1; extra == "full"
Requires-Dist: datasets==3.4.1; extra == "full"
Requires-Dist: tokenizers==0.21.1; extra == "full"
Requires-Dist: huggingface-hub==0.29.3; extra == "full"
Requires-Dist: safetensors==0.5.3; extra == "full"
Requires-Dist: pyarrow==19.0.1; extra == "full"
Requires-Dist: swanlab>=0.3.0; extra == "full"
Requires-Dist: bitsandbytes==0.45.4; extra == "full"
Requires-Dist: peft==0.15.0; extra == "full"
Requires-Dist: deepspeed==0.16.4; extra == "full"
Requires-Dist: vllm==0.8.2; extra == "full"
Requires-Dist: triton==3.2.0; extra == "full"
Requires-Dist: xformers==0.0.29.post2; extra == "full"
Requires-Dist: flashinfer-python==0.2.3; extra == "full"
Requires-Dist: ninja==1.11.1.4; extra == "full"
Requires-Dist: compressed-tensors==0.9.2; extra == "full"
Requires-Dist: gguf==0.10.0; extra == "full"
Requires-Dist: trl==0.15.2; extra == "full"
Requires-Dist: unsloth==2025.3.18; extra == "full"
Requires-Dist: unsloth_zoo==2025.3.16; extra == "full"
Requires-Dist: outlines==0.1.11; extra == "full"
Requires-Dist: outlines_core==0.1.26; extra == "full"
Requires-Dist: lm-format-enforcer==0.10.11; extra == "full"
Requires-Dist: llguidance==0.7.10; extra == "full"
Requires-Dist: xgrammar==0.1.16; extra == "full"
Requires-Dist: diffusers==0.32.2; extra == "full"
Requires-Dist: opencv-python-headless==4.11.0.86; extra == "full"
Requires-Dist: pillow==11.1.0; extra == "full"
Requires-Dist: cupy-cuda12x==13.4.1; extra == "full"
Requires-Dist: nvidia-ml-py==12.570.86; extra == "full"
Requires-Dist: fastapi==0.115.12; extra == "full"
Requires-Dist: uvicorn==0.34.0; extra == "full"
Requires-Dist: starlette==0.46.1; extra == "full"
Requires-Dist: pydantic==2.10.6; extra == "full"
Requires-Dist: wandb==0.19.8; extra == "full"
Requires-Dist: prometheus-client==0.21.1; extra == "full"
Requires-Dist: prometheus-fastapi-instrumentator==7.1.0; extra == "full"
Requires-Dist: nvitop==1.4.2; extra == "full"
Requires-Dist: ray==2.44.0; extra == "full"
Requires-Dist: scipy==1.15.2; extra == "full"
Requires-Dist: scikit-learn==1.6.1; extra == "full"
Requires-Dist: numba==0.60.0; extra == "full"
Requires-Dist: llvmlite==0.43.0; extra == "full"
Requires-Dist: matplotlib==3.10.1; extra == "full"
Requires-Dist: seaborn==0.13.2; extra == "full"
Requires-Dist: click==8.1.8; extra == "full"
Requires-Dist: rich==13.9.4; extra == "full"
Requires-Dist: typer==0.15.2; extra == "full"
Requires-Dist: pyyaml==6.0.2; extra == "full"
Requires-Dist: python-dotenv==1.1.0; extra == "full"
Requires-Dist: httpx==0.28.1; extra == "full"
Requires-Dist: aiohttp==3.11.14; extra == "full"
Requires-Dist: openai==1.68.2; extra == "full"
Requires-Dist: einops==0.8.1; extra == "full"
Requires-Dist: sentencepiece==0.2.0; extra == "full"
Requires-Dist: tiktoken==0.9.0; extra == "full"
Requires-Dist: protobuf==3.20.3; extra == "full"
Requires-Dist: psutil==7.0.0; extra == "full"
Requires-Dist: filelock==3.18.0; extra == "full"
Requires-Dist: regex==2024.11.6; extra == "full"
Requires-Dist: packaging==24.2; extra == "full"
Requires-Dist: fsspec==2024.12.0; extra == "full"
Dynamic: license-file

# Merging-EVAL: Model Evaluation Framework

A comprehensive evaluation framework for language models with support for multiple datasets, GPU specification, and offline evaluation capabilities.

## Features

- 🚀 **Multi-dataset Support**: Evaluate on code, algebra, analysis, and other domains
- 🎯 **GPU Specification**: Run evaluations on specific GPUs
- 📊 **Offline Mode**: Evaluate models without internet connection
- 🔧 **Data Slicing**: Evaluate on specific data subsets using indices
- 💾 **Caching**: Intelligent caching for faster repeated evaluations
- 📈 **Comprehensive Metrics**: Cross-entropy loss, token-level analysis

## Repository Structure

```
.
├── scripts/
│   └── eval.py           # Main evaluation script
├── src/merge/
│   └── main_merging.py   # Model merging script
├── data/
│   ├── eval_partial/     # Evaluation datasets
│   │   ├── code.json     # Code generation tasks
│   │   ├── algebra.json  # Mathematical problems
│   │   └── analysis.json # Data analysis tasks
│   └── train_partial/    # Training datasets
├── test_result/          # Evaluation results and merged models
└── cache/                # Cached tokenized data
```

## Installation and Setup

### 1. Environment Setup

Activate the required Python environment:

```bash
source /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/merge-eval-py312/bin/activate
```

### 2. Environment Variables

Set required environment variables:

```bash
export TRANSFORMERS_NO_TORCHVISION=1
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
```

## Usage

### Model Merging

Merge multiple models using Task Arithmetic:

```bash
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
python3 src/merge/main_merging.py \
  --merge_method task_arithmetic \
  --output_dir /path/to/output \
  --base_model /path/to/base/model \
  --models_to_merge "/path/to/model1,/path/to/model2,/path/to/model3" \
  --scaling_coefficient 0.2 \
  --use_gpu
```

#### GPU-Specific Merging

Run model merging on a specific GPU:

```bash
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
export CUDA_VISIBLE_DEVICES=7
python3 src/merge/main_merging.py \
  --merge_method task_arithmetic \
  --output_dir /zju_0038/test_merge/Merging-EVAL/test_result/scheme1_16models_task_arithmetic \
  --base_model /zju_0038/wyy/mergebench/models/Llama-3.2-3B \
  --models_to_merge "/zju_0038/yifyang/scripts/models/llama-instruct-3B-v2-algebra,/zju_0038/yifyang/scripts/models/llama-instruct-3B-v2-analysis,/zju_0038/yifyang/scripts/models/llama-instruct-3B-v2-number_theory,/zju_0038/yifyang/scripts/models/llama-instruct-3B-v2-physics" \
  --scaling_coefficient 0.2 \
  --use_gpu
```

### Basic Evaluation

Evaluate a model on a specific dataset:

```bash
python3 scripts/eval.py \
  --model /path/to/model \
  --tokenizer /path/to/tokenizer \
  --file /path/to/dataset.json \
  --output ./results \
  --batch_size 1 \
  --max_length 2048
```

### GPU-Specific Evaluation

Run evaluation on a specific GPU:

```bash
export CUDA_VISIBLE_DEVICES=7
python3 scripts/eval.py \
  --model /path/to/model \
  --tokenizer /path/to/tokenizer \
  --file /path/to/dataset.json \
  --output ./results \
  --batch_size 1 \
  --max_length 2048 \
  --gpu_id 0 \
  --offline
```

### Dataset-Specific Examples

#### Mathematical Problems (Algebra)
```bash
# 数学问题推荐 max_length = 2048
source /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/merge-eval-py312/bin/activate && \
export TRANSFORMERS_NO_TORCHVISION=1 && \
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python && \
export CUDA_VISIBLE_DEVICES=7 && \
python3 /zju_0038/test_merge/Merging-EVAL/scripts/eval.py \
  --model /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/models/merged/Llama-3B-cmb/task_arithmetic_9/sc0.1_r0/6p3h \
  --tokenizer /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/models/merged/Llama-3B-cmb/task_arithmetic_9/sc0.1_r0/6p3h \
  --file /zju_0038/test_merge/Merging-EVAL/data/eval_partial/algebra.json \
  --output /zju_0038/test_merge/Merging-EVAL/test_result \
  --batch_size 1 \
  --max_length 2048 \
  --gpu_id 0 \
  --offline
```

#### Code Generation Tasks
```bash
# 代码问题推荐 max_length = 4096
source /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/merge-eval-py312/bin/activate && \
export TRANSFORMERS_NO_TORCHVISION=1 && \
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python && \
export CUDA_VISIBLE_DEVICES=7 && \
python3 /zju_0038/test_merge/Merging-EVAL/scripts/eval.py \
  --model /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/models/merged/Llama-3B-cmb/task_arithmetic_9/sc0.1_r0/6p3h \
  --tokenizer /zju_0038/jinjia/workspace/Merging-Scaling-Law-main/models/merged/Llama-3B-cmb/task_arithmetic_9/sc0.1_r0/6p3h \
  --file /zju_0038/test_merge/Merging-EVAL/data/eval_partial/code.json \
  --output /zju_0038/test_merge/Merging-EVAL/test_result \
  --batch_size 1 \
  --max_length 4096 \
  --gpu_id 0 \
  --offline
```

## Parameters

### Model Merging Parameters

#### Required Parameters
- `--merge_method`: Merging method (e.g., "task_arithmetic", "average_merging")
- `--base_model`: Path to the base model directory
- `--models_to_merge`: Comma-separated list of model paths to merge
- `--output_dir`: Output directory for merged model

#### Optional Parameters
- `--scaling_coefficient`: Scaling coefficient for merging (default: 1.0)
- `--use_gpu`: Use GPU for merging (default: CPU)
- `--exclude_param_names_regex`: Regex patterns for parameters to exclude
- `--param_value_mask_rate`: Parameter value mask rate (default: 0.8)
- `--mask_apply_method`: Method for applying masks (default: "average_merging")
- `--weight_mask_rates`: Comma-separated weight mask rates

### Model Evaluation Parameters

#### Required Parameters
- `--model`: Path to the model directory
- `--tokenizer`: Path to the tokenizer directory
- `--file`: Path to the evaluation dataset JSON file

#### Optional Parameters
- `--output`: Output directory for results (default: `./output`)
- `--batch_size`: Batch size for evaluation (default: 10)
- `--max_length`: Maximum sequence length (default: 2048)
- `--gpu_id`: Specific GPU ID to use (e.g., 0, 1, 2)
- `--offline`: Run in offline mode (no internet connection required)
- `--indices`: Evaluate specific data indices (e.g., "1-10,15,20-22")
- `--run_name`: Custom name for output folder
- `--no_cache`: Disable caching mechanism

### Dataset-Specific Recommendations

#### Code Generation Tasks
- **Recommended max_length**: 4096-8192
- **Reason**: Code samples are typically longer (average: 1,899 tokens)
- **Coverage**: 8192 covers 98% of samples

#### Mathematical Problems
- **Recommended max_length**: 2048
- **Reason**: Math problems are typically shorter
- **Coverage**: 2048 is sufficient for most algebra problems

## Output Format

Results are saved as CSV files in the output directory:

```
test_result/
└── model_name/
    └── all/
        └── results.csv
```

CSV format:
```csv
problem,CE Loss,class
algebra,0.6289,algebra
Avg.,0.6289,average
Overall,0.6289,overall
```

## PyPI Package Distribution

This project is available as a Python package on PyPI for easy installation and distribution.

### Package Information

- **Package Name**: `merging-eval`
- **Version**: 0.1.0
- **Description**: Model merging algorithms and scaling laws for large language models
- **License**: MIT
- **Python**: >=3.8

### Installation

#### From PyPI
```bash
pip install merging-eval
```

#### With Full Dependencies
```bash
pip install "merging-eval[full]"
```

#### Development Installation
```bash
git clone https://github.com/Merging-EVAL/Merging-EVAL.git
cd Merging-EVAL
pip install -e .
```

### Package Features

The package provides comprehensive model merging and evaluation capabilities:

```python
import merge
from merge import MergingMethod, FlopsCounter

# Available merging methods
merging_methods = [
    "average_merging",      # Equal-weight averaging
    "task_arithmetic",      # Task vector arithmetic
    "ties_merging",         # TIES merging algorithm
    "ties_merging_dare",    # TIES with DARE variant
    "mask_merging"          # Mask-based merging
]
```

### Publishing to PyPI

#### Local Development
```bash
./build_and_test.sh
```

#### Publishing
1. **Setup credentials**: Copy `.pypirc.template` to `~/.pypirc` and add API tokens
2. **Test PyPI**: `./publish_to_pypi.sh test`
3. **Production PyPI**: `./publish_to_pypi.sh production`

#### Automated Publishing
- GitHub Actions automatically publishes on new releases
- Update version in `pyproject.toml` and create GitHub release

## Troubleshooting

### Model Merging Issues

1. **Protobuf Version Conflicts**:
   - Solution: Set `export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python`

2. **GPU Memory Issues**:
   - Use CPU mode (remove `--use_gpu` flag)
   - Reduce number of models to merge simultaneously
   - Use specific GPU with `export CUDA_VISIBLE_DEVICES=X`

3. **PEFT Configuration Errors**:
   - Some models may have incompatible PEFT configurations
   - Solution: Exclude problematic models or use CPU mode

4. **RoPE Configuration Warnings**:
   - Warning: `rope_scaling` configuration issues
   - Solution: These are usually non-fatal warnings

### Model Evaluation Issues

1. **NaN Loss Values**: Usually caused by very long sequences or all-masked labels
   - Solution: Increase `max_length` or check data format

2. **GPU Memory Issues**:
   - Reduce `batch_size` to 1
   - Decrease `max_length`
   - Use specific GPU with `--gpu_id`

3. **Network Connection Issues**:
   - Use `--offline` flag for local model evaluation

### Performance Tips

#### Model Merging
- Use GPU mode for faster merging when memory allows
- Start with smaller model subsets to test compatibility
- Use CPU mode for large model collections to avoid memory issues
- Set environment variables before running to avoid conflicts

#### Model Evaluation
- Use caching for repeated evaluations on the same dataset
- Specify GPU ID for better resource management
- Adjust `max_length` based on dataset characteristics
- Use offline mode when internet connection is unstable 
