Examples

Different systems for using LLMs to assess academic texts are proliferating. Fortunately, coarse.ink — an open-source project from David Van Dijcke at the University of Michigan — has recently provided a way to evaluate them against each other. Coarse.ink's quality evaluator asks Gemini 3.1 Pro to score isitcredible.com's review against an opponent's review of the same paper on four dimensions (coverage, specificity, depth, consistency). The scale runs 1.0–6.0, where 5.0 means the two reviews are judged equally good and 5.5–6.0 means isitcredible.com exceeds the opponent. Each cell in the tables below is the average across the four dimensions. The protocol comes in two forms: a panel of three persona-differentiated judges plus a meta-synthesizer, and a single-judge protocol averaged across two positional orderings.

We pitted isitcredible.com against eight other automated reviewers on the four papers below: refine.ink, reviewer3.com, paperreview.ai, and the five LLM models available through coarse.ink's own reviewer harness. The papers are those chosen by refine.ink, which have become a standard way to compare models.

The results are striking. In panel mode, isitcredible.com won 30 of 32 cells, tied one, and lost one. In single-judge mode we won 26 of 32 cells, tied five, and lost one. The cell we lost was the same in both modes: against refine.ink on Stephens and Donnelly's population genetics paper. On that one paper, the judge flagged a few citation errors in our report, noted that we missed a handful of equation-level catches refine.ink made, and observed that several of our critiques targeted limitations the authors themselves had already acknowledged. We have already tweaked our prompts to address all three issues.

Panel mode

Opponent Chaotic Balanced State Coset Codes Population Genetics Targeting Interventions
coarse.ink (Claude Sonnet 4.6) 5.83 5.50 5.17 6.00
coarse.ink (DeepSeek V3.2) 6.00 6.00 5.50 5.83
coarse.ink (GPT-5 mini) 5.50 5.67 5.50 6.00
coarse.ink (Kimi K2.5) 6.00 5.83 5.67 6.00
coarse.ink (Qwen 3.5 Plus) 6.00 5.67 5.83 6.00
paperreview.ai 6.00 6.00 5.67 6.00
refine.ink 6.00 5.83 4.67 6.00
reviewer3.com 6.00 6.00 5.83 6.00

Single-judge mode

Opponent Chaotic Balanced State Coset Codes Population Genetics Targeting Interventions
coarse.ink (Claude Sonnet 4.6) 5.88 5.25 5.00 5.50
coarse.ink (DeepSeek V3.2) 6.00 6.00 6.00 6.00
coarse.ink (GPT-5 mini) 6.00 6.00 4.88 5.12
coarse.ink (Kimi K2.5) 6.00 5.88 5.38 5.38
coarse.ink (Qwen 3.5 Plus) 6.00 6.00 5.38 5.50
paperreview.ai 6.00 6.00 5.88 6.00
refine.ink 5.88 5.25 4.25 5.88
reviewer3.com 6.00 5.88 6.00 6.00

5.50–6.00 win  ·  4.75–5.25 tie  ·  below 4.75 loss

The benchmark is a binary comparison, not a ranking — each cell measures how isitcredible.com does against one specific opponent on one specific paper, not a global score. The worst-case opponent average across the four papers is 5.62 (refine.ink and Sonnet 4.6) in panel mode and 5.31 (refine.ink) in single mode. Both are clearly above parity.

For complete transparency, we provide a replication package containing the coarse.ink judge code, all nine reviews per paper, and the raw Gemini judge reasoning for every cell. Drop the four source PDFs into place and you can reproduce both tables above on your own machine.

Download replication package

The four reports below show our analysis of each paper, paired with a brief comparison against refine.ink's coverage of the same paper. We focus on refine.ink here because it is often considered the leading AI reviewer system.

Chaotic Balanced State in a Model of Cortical Circuits

van Vreeswijk & Sompolinsky, Neural Computation, 1998

This article is a foundational theoretical model in computational neuroscience. Our report runs to 20 pages. It identifies over two dozen distinct issues: twelve structural concerns with the model's assumptions and biological claims, around ten algebraic and sign errors in the derivations, several presentation issues, and page-by-page copyediting suggestions. It also includes an editor's note advising how the authors might respond to each criticism, a credibility assessment, and suggestions for future research. refine.ink's report covers twelve issues in total, covering a mix of equation corrections, conceptual clarifications, and suggestions for improving the exposition.

Coset Codes, Part I

G. D. Forney Jr., IEEE Trans. Information Theory, 1988

This paper presents Forney's classification of coding schemes for band-limited channels. Our report identifies around sixteen issues, including structural concerns about the framework's reliance on heuristic metrics, calculation errors in the summary tables, mathematical typos, and presentation suggestions. refine.ink's report covers nineteen issues, with a focus on notation and clarity.

Inference in Molecular Population Genetics

Stephens & Donnelly, Journal of the Royal Statistical Society, Series B, 2000

This paper introduces Stephens and Donnelly's importance sampling algorithm for coalescent-based inference. Our report identifies around a dozen issues, including structural weaknesses in the empirical benchmarks, an algebraic error in the proof of Theorem 1, and a credibility assessment of the paper's comparative efficiency claims. refine.ink's report covers fifteen issues, focusing on notation and exposition.

Targeting Interventions in Networks

Galeotti, Golub & Goyal, Econometrica, 2020

This paper presents Galeotti, Golub, and Goyal's framework for optimal intervention targeting using network spectral decomposition. The example report below covers both the main text and the online appendix, identifying over thirty issues: structural concerns about practical applicability, around thirteen equation-level errors in the supplementary proofs, notation errors, and detailed copyediting suggestions. We always encourage users to upload supplementary materials with their papers, since these often contain the most important mathematical detail. refine.ink's report covers six issues from the main text. For the head-to-head benchmark above, we tested isitcredible.com on the same working-paper version refine.ink's system received, without the supplement, to keep the comparison fair.