Metadata-Version: 2.4
Name: jfinqa
Version: 0.3.2
Summary: Japanese Financial Numerical Reasoning QA Benchmark
Project-URL: Homepage, https://github.com/ajtgjmdjp/jfinqa
Project-URL: Repository, https://github.com/ajtgjmdjp/jfinqa
License-Expression: Apache-2.0
License-File: LICENSE
License-File: NOTICE
Keywords: benchmark,edinet,financial-nlp,japanese,question-answering
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: click>=8.0
Requires-Dist: datasets>=3.0
Requires-Dist: loguru>=0.7
Requires-Dist: pydantic>=2.0
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.8; extra == 'dev'
Provides-Extra: pandas
Requires-Dist: pandas>=2.0; extra == 'pandas'
Description-Content-Type: text/markdown

# jfinqa

Japanese Financial Numerical Reasoning QA Benchmark.

[![PyPI](https://img.shields.io/pypi/v/jfinqa)](https://pypi.org/project/jfinqa/)
[![Python](https://img.shields.io/pypi/pyversions/jfinqa)](https://pypi.org/project/jfinqa/)
[![CI](https://github.com/ajtgjmdjp/jfinqa/actions/workflows/ci.yml/badge.svg)](https://github.com/ajtgjmdjp/jfinqa/actions/workflows/ci.yml)
[![Downloads](https://img.shields.io/pypi/dm/jfinqa)](https://pypi.org/project/jfinqa/)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/ajtgjmdjp/jfinqa)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)](LICENSE)
[![Leaderboard](https://img.shields.io/badge/Leaderboard-Live-brightgreen)](https://ajtgjmdjp.github.io/jfinqa-leaderboard/)

## What is this?

**jfinqa** is a benchmark for evaluating LLMs on Japanese financial numerical reasoning. Unlike existing benchmarks that focus on classification or simple lookup, jfinqa requires **multi-step arithmetic over financial statement tables** extracted from real Japanese corporate disclosures (EDINET). Questions include DuPont decomposition (6-step), growth rate calculations, and cross-statement ratio analysis.

### Three Subtasks

| Subtask | Description | Example |
|---------|-------------|---------|
| **Numerical Reasoning** | Calculate financial metrics from table data | "2024年3月期の売上高成長率は何%か？" |
| **Consistency Checking** | Verify internal consistency of reported figures | "資産合計は流動資産と固定資産の合計と一致するか？" |
| **Temporal Reasoning** | Analyze trends and changes across periods | "売上高が最も低かったのはどの年度か？" |

### Dataset Statistics

| | Total | Numerical Reasoning | Consistency Checking | Temporal Reasoning |
|---|---|---|---|---|
| **Questions** | 1000 | 550 | 200 | 250 |
| **Companies** | 68 | — | — | — |
| **Accounting Standards** | J-GAAP 58%, IFRS 38%, US-GAAP 4% | — | — | — |
| **Avg. program steps** | 2.59 | 2.84 | 2.00 | 2.54 |
| **Avg. table rows** | 13.3 | — | — | — |
| **Max program steps** | 6 (DuPont) | — | — | — |

### Baseline Results

| Model | Overall | Numerical Reasoning | Consistency Checking | Temporal Reasoning |
|-------|---------|--------------------|--------------------|-------------------|
| GPT-4o | **87.0%** | 80.2% | **90.5%** | **99.2%** |
| Gemini 2.0 Flash | 80.4% | **86.2%** | 83.5% | 65.2% |
| GPT-4o-mini | 67.7% | 79.3% | 83.5% | 29.6% |
| Qwen2.5-3B-Instruct | 39.6% | 46.4% | 51.0% | 15.6% |

*1000 questions, zero-shot, temperature=0. Evaluation uses numerical matching with 1% tolerance. Qwen2.5-3B-Instruct run locally with MLX (4-bit quantization).*

**[View full leaderboard →](https://ajtgjmdjp.github.io/jfinqa-leaderboard/)**

### Error Analysis

Systematic error analysis revealed both benchmark design issues and genuine LLM failure patterns.

Key findings:
- **Clear capability gradient**: GPT-4o (87%) > Gemini 2.0 Flash (80%) > GPT-4o-mini (68%) >> Qwen2.5-3B (40%), validating the benchmark discriminates across model sizes and capabilities.
- **Temporal reasoning separates frontier models**: GPT-4o achieves 99.2% on TR, while Gemini drops to 65.2% and GPT-4o-mini to 29.6%. This subtask requires strict output format compliance ("増収"/"減収" rather than "はい"/"いいえ"), which strongly differentiates models.
- **Gemini 2.0 Flash leads on numerical reasoning** (86.2% vs GPT-4o's 80.2%), suggesting strong arithmetic capabilities, but falls behind on consistency checking and temporal reasoning where format compliance matters more.
- **DuPont decomposition is the hardest subtask**: 6-step ROE decomposition questions (56 questions) see significant accuracy drops even for frontier models, while 3B models rarely solve them correctly.
- **GPT-4o-mini has a systematic prompt compliance issue in temporal reasoning.** It answers "はい" (yes) to questions like "増収か減収か？" despite correctly analyzing the direction in its reasoning chain (122 of 176 TR errors follow this pattern).
- **J-GAAP balance sheet structure is a major error source.** Models confuse 純資産合計 (net assets) with 株主資本 (shareholders' equity), and decompose 総資産 into 4 sub-categories instead of the standard 2.
- **Qwen2.5-3B-Instruct** struggles most with temporal reasoning (15.6%) and consistency checking (51.0%), suggesting that smaller models have difficulty with instruction-following and multi-step verification tasks in Japanese.

### Key Features

- **FinQA-compatible**: Same data format as [FinQA](https://github.com/czyssrs/FinQA) for cross-benchmark comparison
- **Japan-specific**: Handles J-GAAP, IFRS, US-GAAP, and Japanese number formats (百万円, 億円, △)
- **Dual evaluation**: Exact match and numerical match with tolerance
- **lm-evaluation-harness integration**: Ready-to-use YAML task configs
- **Source provenance**: Every question links back to its EDINET filing

## Quick Start

### Installation

```bash
pip install jfinqa
# or
uv add jfinqa
```

### Evaluate Your Model

```python
from jfinqa import load_dataset, evaluate

# Load benchmark questions
questions = load_dataset("numerical_reasoning")

# Provide predictions
predictions = {"nr_001": "25.0%", "nr_002": "16.0%"}
result = evaluate(questions, predictions=predictions)
print(result.summary())
```

### Or Use a Model Function

```python
from jfinqa import load_dataset, evaluate

questions = load_dataset()

def my_model(question: str, context: str) -> str:
    # Your model inference here
    return "42.5%"

result = evaluate(questions, model_fn=my_model)
print(result.summary())
```

## CLI

```bash
# Inspect dataset questions
jfinqa inspect -s numerical_reasoning -n 5

# Evaluate predictions file
jfinqa evaluate -p predictions.json

# Evaluate with local data
jfinqa evaluate -p predictions.json -d local_data.json -s numerical_reasoning
```

## lm-evaluation-harness

[PR #3570](https://github.com/EleutherAI/lm-evaluation-harness/pull/3570) is pending. Once merged:

```bash
lm-eval run --model openai-completions \
    --model_args model=gpt-4o \
    --tasks jfinqa \
    --num_fewshot 0
```

Before merge, use `--include_path`:

```bash
lm-eval run --model openai-completions \
    --model_args model=gpt-4o \
    --tasks jfinqa \
    --num_fewshot 0 \
    --include_path lm_eval_tasks/
```

## Data Format

Each question follows the FinQA schema with additional metadata:

```json
{
  "id": "nr_001",
  "subtask": "numerical_reasoning",
  "pre_text": ["以下はA社の連結損益計算書の抜粋である。"],
  "post_text": ["当期は前期比で増収増益となった。"],
  "table": {
    "headers": ["", "2024年3月期", "2023年3月期"],
    "rows": [
      ["売上高", "1,500,000", "1,200,000"],
      ["営業利益", "200,000", "150,000"]
    ]
  },
  "qa": {
    "question": "2024年3月期の売上高成長率は何%か？",
    "program": ["subtract(1500000, 1200000)", "divide(#0, 1200000)", "multiply(#1, 100)"],
    "answer": "25.0%",
    "gold_evidence": [0]
  },
  "edinet_code": "E00001",
  "filing_year": "2024",
  "accounting_standard": "J-GAAP"
}
```

## Japanese Number Handling

jfinqa correctly normalizes Japanese financial number formats:

| Input | Extracted Value | Notes |
|-------|----------------|-------|
| `△1,000` | -1,000 | Triangle negative marker |
| `１２，３４５` | 12,345 | Fullwidth digits + comma removal |
| `24,956百万円` | 24,956 | Compound financial units treated as labels |
| `50億` | 5,000,000,000 | Bare kanji multiplier applied |
| `42.5%` | 42.5 | Percentage |

## Development

```bash
git clone https://github.com/ajtgjmdjp/jfinqa
cd jfinqa
uv sync --dev --extra dev
uv run pytest -v
uv run ruff check .
uv run mypy src/
```

## Data Attribution

Source financial data is obtained from [EDINET](https://disclosure.edinet-fsa.go.jp/)
(Electronic Disclosure for Investors' NETwork), operated by the
Financial Services Agency of Japan (金融庁).
EDINET data is provided under the [Public Data License 1.0](https://www.digital.go.jp/resources/open_data/).

The data format is compatible with [FinQA](https://github.com/czyssrs/FinQA) (Chen et al., 2021).

## Related Projects

- [FinQA](https://github.com/czyssrs/FinQA) — English financial QA benchmark (Chen et al., 2021)
- [TAT-QA](https://github.com/NExTplusplus/TAT-QA) — Tabular and textual QA
- [edinet-mcp](https://github.com/ajtgjmdjp/edinet-mcp) — EDINET XBRL parser (companion project)
- [EDINET-Bench](https://github.com/SakanaAI/EDINET-Bench) — Sakana AI's financial classification benchmark

## Citation

If you use jfinqa in your research, please cite it as follows:

```bibtex
@dataset{jfinqa2025,
  title={jfinqa: Japanese Financial Numerical Reasoning QA Benchmark},
  author={ajtgjmdjp},
  year={2025},
  url={https://github.com/ajtgjmdjp/jfinqa},
  license={Apache-2.0}
}
```

## License

Apache-2.0. See [NOTICE](NOTICE) for third-party attributions.
