{% extends "base.html" %} {% block title %}Leaderboard — OCR Bench{% endblock %} {% block content %}
Rankings are computed using Bradley-Terry MLE from pairwise comparisons judged by a vision-language model. The judge sees the original document image alongside two anonymised OCR outputs and picks the more faithful transcription. Browse the comparisons to see the evidence — and vote yourself to build a Human ELO column. Human votes are stored locally for this session only and will reset when the server restarts.
| # | Model | Params | Judge ELO | {% if has_ci %}95% CI | {% endif %}Wins | Losses | Ties | Win% | {% if has_human_elo %}Human ELO | H-Win% | {% endif %}
|---|---|---|---|---|---|---|---|---|---|---|
| {{ loop.index }} | {{ row.model_short }} | {{ row.params if row.params else "—" }} | {{ row.elo }} | {% if has_ci %}{{ row.elo_low }}–{{ row.elo_high }} | {% endif %}{{ row.wins }} | {{ row.losses }} | {{ row.ties }} | {{ row.win_pct }}% | {% if has_human_elo %}{{ row.human_elo if row.human_elo is not none else "—" }} | {{ row.human_win_pct if row.human_win_pct is not none else "—" }} | {% endif %}
Smaller models can win on the right documents. Error bars show 95% confidence intervals.