{% extends "base.html" %} {% block title %}Leaderboard — OCR Bench{% endblock %} {% block content %}

Leaderboard

{{ repo_id }}

Rankings are computed using Bradley-Terry MLE from pairwise comparisons judged by a vision-language model. The judge sees the original document image alongside two anonymised OCR outputs and picks the more faithful transcription. Browse the comparisons to see the evidence — and vote yourself to build a Human ELO column. Human votes are stored locally for this session only and will reset when the server restarts.

{% if has_ci %}{% endif %} {% if has_human_elo %} {% endif %} {% for row in rows %} {% if has_ci %}{% endif %} {% if has_human_elo %} {% endif %} {% endfor %}
# Model Params Judge ELO95% CIWins Losses Ties Win%Human ELO H-Win%
{{ loop.index }} {{ row.model_short }} {{ row.params if row.params else "—" }} {{ row.elo }}{{ row.elo_low }}–{{ row.elo_high }}{{ row.wins }} {{ row.losses }} {{ row.ties }} {{ row.win_pct }}%{{ row.human_elo if row.human_elo is not none else "—" }} {{ row.human_win_pct if row.human_win_pct is not none else "—" }}
{% if chart_points|length >= 2 %}

ELO vs Parameter Count

Smaller models can win on the right documents. Error bars show 95% confidence intervals.

{% endif %} {% endblock %}