2026-02-27 08:55:15 UTC ·
245.0s ·
3 models ·
3 prompts ·
Total cost: $0.31
#1
flux-pro
4.3/5
WINNER
|
Best Value: flux-schnell
3 models ·
3 prompts (n=3) ·
$0.31 ·
245s
· ⚠ Small sample — scores may vary with more prompts
Consensus Analysis — 3 Judges
Two primary judges score each dimension independently. When they agree (score difference ≤0.5), the result is high agreement. When they disagree, a third tiebreaker judge is called and the median is used.
Judge Scoring Bias
| Judge | Role | Avg Score | vs Consensus | Scores Given |
| gemini-2.5-flash |
primary |
4.22 |
+0.00 |
27 |
| gemini-3-flash |
primary |
4.15 |
-0.07 |
27 |
| gemini-2.5-pro |
tiebreaker |
4.00 |
n/a* |
9 |
* Tiebreaker only scores disputed dimensions, so its average is not directly comparable to primary judges.
Disputed Dimensions (9)
-
flux-schnell/Visual Quality:
consensus 5.0
← gemini-2.5-flash=5.0, gemini-3-flash=4.0, gemini-2.5-pro=5.0
-
flux-schnell/Prompt Adherence:
consensus 4.0
← gemini-2.5-flash=4.0, gemini-3-flash=5.0, gemini-2.5-pro=4.0
-
flux-dev/Visual Quality:
consensus 5.0
← gemini-2.5-flash=5.0, gemini-3-flash=4.0, gemini-2.5-pro=5.0
-
flux-pro/Visual Quality:
consensus 5.0
← gemini-2.5-flash=4.0, gemini-3-flash=5.0, gemini-2.5-pro=5.0
-
flux-schnell/Visual Quality:
consensus 3.0
← gemini-2.5-flash=4.0, gemini-3-flash=3.0, gemini-2.5-pro=3.0
-
flux-schnell/Prompt Adherence:
consensus 4.0
← gemini-2.5-flash=3.0, gemini-3-flash=4.0, gemini-2.5-pro=4.0
-
flux-schnell/Text Rendering:
consensus 1.0
← gemini-2.5-flash=1.0, gemini-3-flash=2.0, gemini-2.5-pro=1.0
-
flux-dev/Visual Quality:
consensus 5.0
← gemini-2.5-flash=5.0, gemini-3-flash=3.0, gemini-2.5-pro=5.0
-
flux-pro/Visual Quality:
consensus 4.0
← gemini-2.5-flash=5.0, gemini-3-flash=4.0, gemini-2.5-pro=4.0