An Alignment Journal: Adaptation to AI

April 28, 2026

Etchings of artificial and natural objects

This post is the second in a series that will go on to discuss our theory of change, comparison to related projects, possible partnerships and extensions, scope, personnel, and organizational structure.

Contact us if you’re interested in participating as an author, reviewer, or editor, or if you know someone who might be.

Cross posted to LessWrong. Please go there for comments and discussion.

Summary: Adaptation to AI

This post describes the Alignment journal’s plans for adapting to ever-stronger AI presence in peer review, and in particular the tools we are developing. The first section below surveys the broader journal landscape — reviewer-finding systems, LLM-usage policies, AI review services, and editorial experiments like the AAAI-26 AI-reviewer pilot and the ICLR 2025 reviewer-feedback study. The rest of this summary focuses on what’s specific to us.

Distinct aspects of the alignment field shape our approach. First, our volume will start low and the overall community is still relatively small and fluid, so processes are less entrenched; this lets us experiment, audit by hand, and quickly deploy tools that wouldn’t be deployed by Nature or NeurIPS, but it also means we can’t develop tools that require many resources or a large user base. Second, the field is young and interdisciplinary, and we want to build bridges to neighboring fields and across academia, industry, and independent research; this makes LLM methods relatively high-leverage.

Our approach is to experiment continuously and, where AI use by participants produces negative effects, update incentives rather than restrict usage. Since alignment researchers are heavy AI users already, we are particularly interested in tools that authors and reviewers cannot easily replicate themselves. The main exception is desk review, where editors face unfiltered submissions.

Our near-term priorities, in rough order of importance, are:

LLM-driven reviewer discovery — our initial focus, well-suited to LLM strengths and especially valuable for a young, interdisciplinary field with no large legacy reviewer database.
Checkable desk-review assistance — an AI assessment at desk-review, potentially graduating into an automated (but bypassable) bounce-back for submissions with significant, verifiable problems.
AI reviews for reviewers — an AI report made available to reviewers (after submitting their own initial report, to avoid anchoring), sourced from 3rd party services like Refine.

In the longer term, we are looking at three trends: ICLR-2025-style private AI feedback on reviewer drafts; the DOGE arbitration protocol, which restructures peer review around an AI acting as a neutral third party; and the possibility that LLM-mediated writing and reading will change what a paper looks like, in turn changing what review should do.

Automated tools for research journals: Lay of the land

In this section, we review automated systems for journals and conferences in general. The discussion specific to the Alignment journal begins in the section “What’s different for an alignment journal” and we describe our concrete plans in “Near-term tooling”.

We review a broad spectrum of proposed, tested, or deployed automated tools for general peer review.[1] They can be categorized by their role in the review process (with the human role they replace or augment in italics):[2]

Integrity screening (editor): Detecting purposeful fraud and slop
Desk review (editor): Provisional review of the manuscript
Reviewer discovery (editor): Finding qualified, interested, and unconflicted reviewers
Reviewing (reviewer): Writing a report assessing the manuscript
Review synthesis (author): Combining and organizing multiple, potentially conflicting reviews
Arbitration (editor): Adjudicating disputes between the authors and reviewers
Meta-review (editor): Reviewing the reviewers for feedback and quality tracking

Frontier LLMs are the obvious new lever for further automating peer review, but older, well-understood algorithms also deserve consideration. In particular:

Classical computer vision techniques like error-level and frequency-domain analysis for detecting image manipulation
Keyword matching, recommender systems, and semantic content embeddings (e.g., SPECTER2 in Semantic Scholar) for reviewer discovery
Constrained assignment optimization for reviewer assignment
Bridge-based ranking, as seen in X community notes and vTaiwan, for reconciling divergent reviews in a multi-way reviewer discussion using mutual rating rather than content-based analysis.

Traditional automated tools

Desk review and integrity screening. These systems are oldest and most widely deployed, but they are usually closed source with scant public detail. They cover a wide range of sophistication, from mere checklists for confirming formatting requirements to deep-learning models for detecting image manipulation and AI-generated text. Springer Nature ran its Editor Evaluation tool on nearly half a million manuscripts in 2025, automating checks for data-availability statements, ethics declarations, clinical-trial registration, and misuse risk during desk review. Springer Nature has also deployed specialized detectors: “Geppetto” for AI-generated text, “SnappShot” for problematic images, and a citation-relevance checker for irrelevant references. Frontiers’ AIRA suite performs integrity checks on each submission, e.g., flagging image manipulation, plagiarism, paper-mill patterns, and suspicious references; this contributes to filtering roughly 35% of submissions before reaching an editor.[3] There are also a large number of citation checking tools that confirm each citation resolves to a real paper (a basic defense against fraudulent cites and hallucinations), but robust tools checking that the citation actually supports the statement for which it is invoked based on the contents of the cited work are only just emerging.[4]

Reviewer discovery and matching. When surveyed about the hardest part of their job, 75% of editors selected “finding reviewers and getting them to accept review invitations”.[5] Springer Nature’s Reviewer Finder, Elsevier’s Find Reviewers, and Clarivate’s Reviewer Locator rank candidates by topical match, workload, review history, and conflict of interest. They are probably based on recommender systems. For ML conferences, which require matching in addition to discovery, OpenReview computes affinity scores from reviewer publications and optimizes assignments under load constraints; NeurIPS has used this since at least 2021. Reviewer-identity verification is also increasingly automated: Editorial Manager’s Identity Confidence Check and ScholarOne’s Unusual Activity Detection help screen for fraudulent reviewer accounts.

LLM-usage policies

Perhaps because preventing reviewers from using LLMs is infeasible, many journals have adopted hybrid and disclosure-based policies rather than blanket bans. Nature Portfolio prohibits uploading manuscripts to external services, and requires reviewers to disclose any AI use in preparing their review. Taylor & Francis allows AI to improve only the language of a review. A common design principle across publishers favors integrated, auditable, in-platform AI over ad hoc use of consumer chatbots—a point emphasized in IOP’s 2025 report and AAAI-26’s FAQ, which stresses that its AI workflow runs under contractual privacy protections. The Unjournal’s working policy allows selective AI usage with disclosure requirements, ideally accompanied by direct links to the AI output: running extensive checks that are infeasible by hand is encouraged, while its use for overall evaluations or ratings is discouraged.

ICML 2026 is operating under a two-policy framework. Authors and reviewers each declare preferences: Policy A (no LLM use at all) or Policy B (LLMs allowed for comprehension and polishing, but not for generating evaluative judgments). Papers are routed accordingly. This is informed by community surveys showing that Policies A and B were strongly preferred by ~40% and ~30% of reviewers, respectively. ICML also offered authors pre-submission AI feedback via a voucher system — one paper per eligible author, typically processed within 24 hours.

LLM-written reviews

Many review-writing tools powered by frontier LLMs are available: q.e.d Science, Refine, Reviewer3, Stanford Agentic Reviewer, DeepReviewer, OpenAIReview, Hum’s Alchemist Review, xPeerd, Manusights, WorldBrain Scholar’s Eliza, Enago & Charlesworth’s Review Assistant, and Cactus’s Paperpal Review.[6] We’re still uncertain whether any of these provide enough value over a good prompt to a frontier LLM to justify the modest overhead of third-party integration and obsolescence risk.[7] If you have experience with any of these services, please share in the comments.

To our knowledge none of the services have been compared systematically by an independent party, either to each other or to human-written reviews.[8] The only independent positive evidence we have seen is for q.e.d Science, Refine, Reviewer3, OpenAIReview, and the Stanford Agentic Reviewer, and it is mostly anecdotal (see Appendix 1 for links), although this openRxiv pilot for q.e.d Science seems notable. We interpret this lukewarm post by Purpose-led Publishing[9] as weak negative evidence on Alchemist Review.

Experiments with LLM editorial tools

Supplementary review by off-the-shelf chatbots. The New England Journal of Medicine’s AI-focused journal NEJM AI experimented with GPT-5 and Gemini 2.5 Pro as supplementary reviewers for clinical trial submissions, but was limited to manuscripts that multiple human editors had already judged likely to be accepted. The models flagged trial design flaws and statistical anomalies (e.g., implausible sample size justifications, incomplete randomization descriptions) that some human reviewers missed. In addition to having the chatbots one-shot a report, the editors engaged in an extensive back-and-forth conversation over statistical issues.

Unjournal has collected data comparing its collection of human-written reviews with those produced by several frontier LLMs, but it has not deployed them in its real editorial process. Their benchmarking project (n≈45 paired human/LLM evaluations across 5 models, results very preliminary) finds that for the strongest LLMs tested (Claude Opus 4.6, GPT-5 Pro) overall ratings are roughly as correlated with human ratings as human ratings are with each other, although confidence intervals for these statistics are wide. (Earlier/smaller models, such as Sonnet and GPT-4o, perform substantially worse.)

AI as an explicit, non-voting referee. The boldest experiment to date is the AAAI-26 AI Review Pilot. All 22,977 full-review submissions received a single, clearly labelled AI review from a multi-stage pipeline built on a frontier reasoning model; the reviews carried no score and no accept/reject recommendation, and confidentiality was handled contractually (ephemeral copies passed to the API, with no storage, logging, or training on submissions). A second phase added an AI-generated summary of the human discussion for senior programme-committee members. The subsequent post-mortem is the strongest piece of field evidence to date on machine peer review: across 5,834 survey responses AI reviews were rated higher than human reviews on six of nine review-quality dimensions — biggest gaps were in technical-error detection, raising unconsidered points, and suggesting presentation improvements — but functioned as a complement rather than a substitute (46.6% of reviewers said the AI caught concerns humans would struggle to catch, 49.4% said it missed things humans would catch, and only 13.8% said it actually changed their evaluation). The characteristic failure modes are weak big-picture judgement on novelty and significance, nitpicking, verbosity, and occasional factual misreadings. Operationally, the pilot cost under $1 per paper and completed in under 24 hours. Further details are condensed in our Appendix 2.

AI meta-review. ICLR 2025 ran the largest controlled experiment to date. A “review feedback agent” scanned more than 20,000 randomly selected reviews for vague comments, claims already addressed in the paper, and unprofessional language, then sent private, optional suggestions to the reviewer before authors saw anything. Results: 27% of recipients revised their review, incorporating over 12,000 suggestions; updated reviews were preferred by blinded human evaluators 89% of the time; and reviewers in the feedback group wrote longer, more substantive author-discussion comments during rebuttal. The study was subsequently published in Nature Machine Intelligence.

Proposals

Wei et al.: Discussion facilitation. Wei et al. (2025) propose a broad range of mostly minor tasks that an AI assistant could perform for the participants in a review discussion: cataloging and summarizing reviews/rebuttals, review synthesis like conflict-and-gap highlighting (e.g., “Reviewer 1 praises novelty, Reviewer 2 says incremental”; “this concern was not addressed in rebuttal”), meta-review drafting, helping authors distinguish misunderstandings from substantive disagreements, and using retrieval-augmented verification (RAV) and/or coding agents to validate reviewer claims against the paper and code. Given the current technology, we think these would have only modest benefit and would be burdensome to implement well. Wei et al. also advocate for community data infrastructure efforts, but these are more appropriately targeted at large venues like OpenReview.

Kim et al.: Review re-structuring and badges. Kim et al. (2025) make three proposals: (1) share LLM-generated reviews with authors only, as both a deterrent against LLM-reliant reviewers and a reference point authors can use to flag suspected LLM reviews; (2) release reviews to authors in two stages—summary, strengths, and clarifying questions first (on which authors rate the reviewer’s comprehension), then weaknesses and overall ratings—to prevent retaliatory scoring; and (3) publicly recognize top-decile reviewers with badges. We are sympathetic to (3) but doubt new-journal badges will carry much weight, and expect signed reviewer abstracts to do more. We are unpersuaded by (1), since authors can already obtain their own LLM reviews and the core problem is reviewer incentives, not detection. We find (2) intriguing but likely not worth the overhead at start-up scale. We discuss this more fully in Appendix 3.

Allen-Zhu and Xu: AI as an arbitrator. The proposals above all leave the basic architecture of peer review intact: humans review, and AI assists. Allen-Zhu and Xu (2025) argue for a more radical restructuring. Their “DOGE” protocol proposes that instead of reviewer and author trying to convince each other—often over multiple frustrating rounds—both parties should try to convince an AI arbitrator.[10] The theoretical grounding is an intelligence hierarchy: authoring a paper (L4) is harder than reviewing it (L3), which is harder than auditing a review (L2), which is harder than arbitrating a discussion where both sides present their arguments (L1). The key claim, supported by experiments on a real ICLR 2025 rejection, is that current frontier models already operate reliably at L1 and are approaching L2—meaning they can follow the logic of a reviewer-author exchange and identify factual errors, even if they cannot yet produce a full expert review from scratch.[11] This is a provocative idea, but the underlying observation—that the interactive structure of arbitration dramatically lowers the capability bar for useful AI participation—is worth taking seriously, and is reminiscent of complexity-theoretic intuitions (IP = PSPACE vs. NP).

What’s different for an alignment journal

Most of the initiatives above were designed for venues that process thousands to hundreds of thousands of submissions per year in established fields. A journal focused on AI alignment faces a different set of constraints:

Bounded resources for experimentation. Even when AI tools perform flawlessly as designed, designing and integrating the tools requires significant user experimentation to find what works. We are nimble and optimistic about new applications for LLMs, and we will be relatively well-funded on a per-submission basis, but our resources are still small on an absolute scale, and we don’t have a large user base to provide statistical power. Thus we have to carefully ration our attention and effort, focusing on tools where the benefits are clear and fast.

A young field. Large publishers of journals in well-established fields have thousands of potential reviewers, many of whom have been reviewing in the field for many years. The automated reviewer-discovery tools these publishers use are often conventional and built around large databases of researchers. Although we draw data from databases like OpenReview and Semantic Scholar, we are inclined to look at aggressive and database-less techniques driven by LLMs.

Bridge building. Our goals for the journal include (1) growing the field of alignment by drawing in excellent researchers from neighboring fields and (2) building bridges between academic, industry, and independent researchers. Thus, it is especially valuable if we can help editors find high-quality reviewers whose expertise is relevant but who may not be closely connected through citation or colleague networks.

Interdisciplinary without established conventions. Alignment research draws on machine learning, decision theory, philosophy, game theory, and more. There are few settled methodological conventions, which means integrity checks designed for, say, clinical trials or standard ML benchmarks would not transfer well. We need AI tools that can be customized to our review criteria rather than off-the-shelf pipelines tuned to mainstream fields.

Our general approach and philosophy

Policy toward AI usage by review participants

Our philosophy is that constant experimentation and adaptation are the path forward. When AI tool use by review participants (editors, reviewers, and authors) leads to negative effects at venues designed before LLMs, our inclination is to update the incentives and mechanisms rather than to restrict the tools. We hope to receive strong community input on this.

Tentatively, we plan to adopt a policy where

All review participants are free to use AI tools.
Participants are responsible for the claims they make in their writing as products of their own judgement. Propagating an error made by an LLM is as serious as repeating a false claim heard offhand from a colleague.
When submitting their report, reviewers must disclose AI usage by selecting an appropriate checkbox, which will be visible to all participants in the review discussion.

The first two points resemble how Wikipedia usage is treated in practice by most journals. The last point is different, and is motivated by the additional transparency warranted while norms and expectations around AI usage are evolving quickly.

This candidate policy is inspired by Unjournal’s working policy.

AI usage by the journal

Many existing experiments with LLM assistance amount to the venue nudging authors and reviewers to use LLMs in ways they could do on their own: checking papers for clear errors, critiquing reviewer reports, and so on.

Because editors cannot control participant behavior, the journal itself may want to deploy LLMs where reviewers and authors haven’t, but we expect alignment researchers to be naturally inclined to find useful ways to use AI rather than needing to be pushed. There are also efficiency gains from centralizing the creation and maintenance of custom wrappers, but such wrappers can be brittle and quickly made obsolete by improved models.

Thus, we will be particularly interested in AI tools that cannot be easily replicated by the authors and reviewers themselves. The main exception is immediately following submission (i.e., during desk review), where we face unfiltered submissions of potentially widely varying quality.

Near-term AI tooling

For the reasons described above, we are currently prioritizing the following.

Checkable desk-review assistance. LLMs are still limited in their ability to assess high-level questions requiring integrated understanding of large documents, but they are quite good at finding lower-level mistakes that can be efficiently checked by experts.[12] We are working on a system to identify submissions with significant editor-checkable problems during desk review, i.e., before being sent out to reviewers.

Initially this will just be information provided to the editor. If we build confidence that the system is accurate, the next step would be for a flagged submission to be automatically bounced back to the authors (before it reaches an editor). This would come with an explanation and invitation to re-submit provided the authors affirm that the criticism was considered and any valid problems were fixed. This must be handled with care: it shifts work from editors and reviewers onto authors, and is only justified if the flagged problems are consistently real, significant, and the authors’ responsibility.

A minimal-effort version of this would be a modified report from Refine, which is probably the leader in this market.[13] A Refine report would have the advantage of also being usable as an auxiliary report in the review discussion, as discussed below.

Other desk-review indicators. We likely will also give editors additional quick but noisy signals about paper quality, such as off-the-shelf detectors of plagiarism and AI slop. As mentioned, the Alignment journal is very unlikely to have a policy against AI usage since we want authors to use all available tools to produce good manuscripts. But for now, text that is easily detected as mostly AI-generated is strongly correlated with low quality.[14] These signals are nudges for the editor to look more closely, not in themselves grounds for rejection.

LLM-driven reviewer finding. Editors will receive some reviewer suggestions from off-the-shelf keyword- and database-driven tools, but our own development will focus on LLM-based recommendations built for maximal flexibility and editor input. Reviewer suggestion fits LLM strengths and weaknesses unusually well. It rewards encyclopedic knowledge of researchers’ public footprints and the ability to translate concepts across fields, and a bad suggestion costs little — the editor simply dismisses it. We expect this strategy to be more powerful in the long run, and especially well suited for an interdisciplinary and rapidly growing field like alignment.[15]

AI reviews. As mentioned, it is infeasible and probably undesirable to prohibit reviewers from using LLMs to prepare their reports. (Similarly, it’s infeasible to prevent reviewers from being lazy by instructing them not to be.) It’s more effective to rely on reputation, trust networks, and track record.

Since reviewers will be free to use their own preferred chatbots, little value is added by the journal providing an additional report the reviewers could quickly obtain themselves. However, there are services like Refine, Reviewer3, and q.e.d Science that claim to provide higher-quality reports with specialized pipelines. The cost (~$30/report) is non-trivial for reviewers but still small compared to what we will spend on the review process. So, if the reviewers find these reports useful, it would be efficient for the journal to automatically make them available to the review participants.[16] We intend to collect reviewer feedback on the usefulness of these reports beyond what they can already produce with their preferred chatbots.

We expect to provide the AI report to reviewers only after they submit their own initial report, mirroring the standard practice for seeing other reviewers’ reports. Although this potentially wastes time if the AI report raises issues that reviewers would want to have addressed, it seems better than inducing reviewers to anchor on one report (whether AI- or human-written).

Future directions

Inspired by the ICLR 2025 experiment, we want to explore giving reviewers optional, private AI feedback on their draft reports—flagging vague claims, pointing to passages in the paper that address a reviewer’s concern, and checking for internal consistency. Because our volume is low, we can afford to have editors audit the AI feedback before it reaches reviewers, adding a human-in-the-loop layer that large conferences cannot.

We are also watching proposals like the DOGE arbitration protocol with interest: the idea that AI can serve as a neutral third party in reviewer-author disputes, rather than as an aide to one side or the other, represents a structural innovation that could address some of the deepest pathologies of peer review (reviewer stubbornness, emotional bias, accountability gaps), and may be especially tractable in a small journal where the volume of disputes is manageable enough to audit the arbitrator’s performance carefully.

Finally, we expect AI to change not just how we review but also what we review. Jess Riedel has sketched a future scenario in which researchers explain ideas conversationally to an LLM, which produces a written artifact, which audience members then consume through their own LLM. The resulting papers would look quite different from what we are used to: higher information density, little to no review of existing knowledge, non-linear organization, exhaustive referencing, and explicit uncertainty markers. If something like this materializes—and alignment researchers, as heavy LLM users, are plausible early adopters—then review criteria and desk-rejection heuristics will need to evolve. We do not yet have concrete plans here, but we flag it as a medium-term design pressure that the journal’s processes should be prepared to accommodate.

It’s hard to guess how quickly we’d be able to implement the ideas in this section. Our ability to build and test novel implementations will be constrained by future funding, editorial-board support, and community participation. Currently we’re prioritizing traditional review processes and perturbing from there, but please reach out if you’re interested in contributing on more speculative ideas.

Credits and thanks

This post has been informed by gracious contribution and feedback from @Alexander Gietelink Oldenziel, Seth Lazar, and @Daniel Murfet. All responsibility for errors resides with the authors.

Appendix 1: Anecdotal AI review experiences

This appendix links to some anecdotal experiences of five AI review services: q.e.d Science, Refine, Reviewer3, Stanford Agentic Reviewer, and OpenAIReview.

q.e.d Science:

Giorgio Gilestro, Nature World View [archive.is backup] (rather critical)
The Scientist feature (includes Michał Turek’s specific capability claim)
labcritics write-up (notes one user called it “average critical thinker”)

Refine:

John Cochrane, Grumpy Economist (rave review on his inflation booklet)
- Comments on the Cochrane post (includes one mixed report on a finance paper)
Joshua Gans, Substack (“highly recommend this as part of your research workflow”)
Luis Garicano on X (“astonishingly useful, adds huge value relative to the underlying LLMs”)
Jessica Leight on X (“much, much better than other free tools”)
Ebehi Iyoha on X (“astounding… caught notation inconsistencies”)
José Morales-Arilla on X (“found errors in model equations”)

Stanford Agentic Reviewer:

The Digital Orientalist blog (humanities scholar’s personal test)
S. Anand’s blog analyzing the “Agents for Science” conference where Stanford’s and two other AI systems reviewed 315 papers (skeptical)
Mehul Gupta, Medium (explainer + light review)

Reviewer3:

Diego Ghezzi on LinkedIn (direct test on his own lab paper; says Reviewer3 was more technical and structured than q.e.d Science, but described both as “very useful”)
César Hidalgo, blog post (hands-on experiment; says Reviewer3 gave better and more technical feedback than Gemini or GPT, but kept generating new objections on each pass)
Faheem Ullah on LinkedIn (mixed review; useful and fast, but also generic in places and weak on deep niche issues)

OpenAIReview:

Jessica Hullman on X (ran progressive mode with Opus 4.6; said it caught many of the same minor issues as Refine, but high-level feedback was weaker)
Ryan Briggs on X (ran OpenAIReview locally via Claude Code/Gemma 4 on a paper he thought should be desk rejected; said it was good enough to identify that)
Quan Le on X (positive but comparative; says OpenAIReview was “very useful” and recommends using it before paying for Refine, which was better)
Scalene Peer Review newsletter (says the web-interface reports are “decent” and notes the code can be refined/customized)

Appendix 2: AAAI-26 results

This appendix collects the detailed findings from Biswas et al. (2026) that inform the main text above.

Scale and cost. All 22,977 full-review submissions to the AAAI-26 main track received one AI review. The full run completed in under 24 hours at under $1 per paper — covered by an in-kind API-credit sponsorship from OpenAI, but small enough relative to a conference budget that the cost model would be sustainable without sponsorship.[17]

Quantitative ratings. Across 5,834 Likert-scale survey responses from authors, PC, SPC, and area chairs, AI reviews were rated higher than human reviews on six of nine review-quality criteria, and all nine AI–human mean differences were statistically significant under a Mann–Whitney U test at α = 0.01. The largest AI advantages (on a −2 to +2 scale, reported as mean-difference Δ): identifying technical errors Δ = +0.67, raising previously unconsidered points +0.61, suggesting presentation improvements +0.54, suggesting research-design improvements +0.49, overall thoroughness +0.48. The characteristic disadvantages: overemphasising minor issues −0.38, committing technical errors of their own −0.22, occasional wrong or unhelpful suggestions −0.11. In aggregate, 53.9% of respondents judged the AI reviews useful (vs. 20.2% not); 61.5% expected AI reviews to be useful in future peer review (vs. 14.5%); and 55.6% said the AI reviews demonstrated capabilities beyond what they had expected. Effect sizes were consistently larger for authors than for PC/SPC/AC respondents.

Top qualitative themes (as a percentage of all classified free-form mentions specific to this pilot): positive — Actionable Revision Guidance 5.3%, Breadth and Thoroughness 5.2%, Technical Error Detection 5.0%, Relative Objectivity and Consistency 4.3%, Presentation and Writing Polish 4.2%. Negative — Weak Big-Picture Judgement on Novelty, Significance, and Impact 9.1%, Nitpicking and Overemphasis on Minor Issues 8.5%, Excessive Verbosity and Cognitive Overload 8.3%, Factual Errors and Misreadings 7.7%, Shallow Contextual and Domain Understanding 7.6%.

SPECS review benchmark. Alongside the deployment report, the authors released SPECS, a reusable benchmark built by using an LLM to inject synthetic errors into the LaTeX source of 120 accepted AAAI-25 papers, one error each across five criteria (Story, Presentation, Evaluations, Correctness, Significance), recompiling the paper, and measuring whether a review system explicitly identifies the injected error. On the 783 perturbed versions, the full AAAI-26 pipeline beat a single-prompt LLM baseline with average recall gain +0.21 (p < 10⁻³⁰), with per-criterion gains ranging from +0.15 on Presentation to +0.32 on Story and +0.24 on Evaluations. Each criterion-targeted intermediate stage was most effective at catching its own intended error type, validating the multi-stage decomposition. The curation process is itself reusable — it can be re-run to generate a new benchmark for a different venue.

Principled concerns raised by respondents. A non-trivial minority of free-form responses raised concerns that are worth taking seriously, especially for an alignment-focused venue: that non-voting AI reviews can still mislead decision-makers via anchoring or authority effects; that authors will start optimising papers for AI preferences rather than for scientific quality; that heavy reliance on AI review tools may erode reviewing skill over time; and more fundamentally that AI review undermines the trust, effort, and interpersonal accountability that peer review is supposed to embody. These arguments did not dominate the survey, but they are precisely the failure modes that tend to compound slowly and are hardest to detect post hoc.

Limitations acknowledged by the authors. Self-selection bias in the survey is noted; the negative-theme count being larger than the positive-theme count is consistent with well-documented negativity bias in open-ended responses. The citation-hallucination rate in a 100-review audit was low (2 of 1,356 cited references flagged as possibly fabricated, and on manual inspection both turned out to be real citations that the automated tool misclassified). The authors flag length-of-review as straightforwardly fixable via tighter output controls and flag criterion-weighting (what counts as a “significant” concern vs. a minor one) as an area of ongoing research.

Appendix 3: Assessment of Kim et al.

This appendix expands on our response to Kim et al., discussed above.

Kim et al. (2025) suggest providing LLM-generated reviews to the authors, but not reviewers, as part of the review process.

The inclusion of LLM-generated reviews serves two main purposes: (1) LLM reviews act as a psychological deterrent against the few irresponsible reviewers who might otherwise rely entirely on LLMs for evaluations, as they know the System is already incorporating such automated reviews, and (2) it provides authors with a soft reference point to identify and flag potential LLM-generated reviews, detailed in our second proposal.

We do not find this compelling. Authors can easily obtain their own reviews from any of the commercially available LLMs. The issue with LLM input into reviews is not detecting it or dissuading reviewers from using it; we hope and expect that most reviewers use LLMs for tasks that LLMs do well, and furthermore that this set of tasks will grow over time. The issue is with reviewers relying on LLMs inappropriately, i.e., to handle tasks LLMs currently struggle with. Using current LLM stylistic quirks (like em dashes and “it’s not X, it’s Y” structure) to detect lazy reviewers is unlikely to work for more than a short while, and does not address the real problem: incentivizing reviewers to produce high-quality reviews.

It’s editors, not authors, who should have experience dealing with reviews that suffer from reckless LLM usage (or other problems) because editors read many such reviews and develop an intuition for their typical problems. It’s not a task that should be pushed onto the authors.

As part of “meta-review”, Kim et al. suggest collecting author ratings of reviewer reports, an old idea we expect is modestly useful. They go on to propose that the rating be based only on the author’s reading of the positive parts of the report:

We propose a sequential release of review content rather than the conventional simultaneous disclosure of all reviews and ratings. Specifically, we divide review content into two distinct sections: section one includes neutral to positive elements, including paper summary, strengths, and clarifying questions, while section two contains more critical parts such as weaknesses and overall ratings. Between these releases, authors evaluate reviews from section one based on two key criteria: (1) the reviewer’s comprehension of the paper and (2) the constructiveness of questions in demonstrating careful reading. During this feedback phase, authors can also flag potential LLM-generated reviews by comparing them with the provided LLM reviews. This two-stage disclosure prevents retaliatory scoring while providing the minimal safeguards necessary for a fair review. Once authors complete their feedback, section two is promptly disclosed, and the authors are not allowed to modify their evaluations.

We find this idea interesting, but we worry it will generate more overhead and complications than it’s worth, at least at the start-up scale. It’s true the beginning of a reviewer’s report often functions in part to demonstrate that the reviewer understands the material, but we don’t wish to create a separate new component that essentially functions as an author-judged (and probably LLM-gameable) test of reading comprehension. If we decide to collect author ratings of reviewer reports, it seems better to statistically correct (condition) them on the reviewer’s rating of the manuscript than to force the reviewer to adopt a particular decomposition of the review into positive and negative parts where the former is supposed to avoid leaking information about the latter.

Finally, Kim et al. propose tracking reviewer quality and recognizing high-quality reviewers with badges (top 10% and top 30%). We generally support this; gathering data on reviewer quality is prudent, and we plan to do so insofar as it can be collected with minimal burden on the authors and reviewers. That said, we expect it will be hard (not impossible) to create badges or awards that will be strongly valued by the community. This is especially true for a new journal, as opposed to the major ML conferences to which Kim et al.’s ideas are addressed. Instead, we expect that signed reviewer abstracts will be a much more powerful and immediately valued source of reviewer recognition. Whenever possible: show, don’t tell.

^ Section 3 of Wei et al. (2025) has a summary of academic research on AI in peer review.
^ (Action) editors at journals play a role that is basically equivalent to an area chair at ML conferences: they moderate the peer review discussion and usually make a final determination about the manuscript. In some circumstances, more senior editors (equivalent to senior area chairs or program chairs) will weigh in.
^ Like many journals, the Frontiers family emphasizes that the AI tool flags issues but never makes decisions.
^ See Scite.ai, Manusights’ Citation Claim Checker, and SemanticCite. Naively it seems like this sort of ability should be integrated into a comprehensive AI review service rather than served separately. Since this is the direction we expect things to go, we are reticent to invest time integrating such a service into our pipeline.
^ Publons’ Global State of Peer Review 2018, describing a 2016 survey result.
^ Stanford Agentic Reviewer and Paperpal Review are currently not suited for editorial use; the former is limited to 15 pages and the latter is integrated into writing assistant software for authors. Alchemist Review is integrated with Clarivate’s ScholarOne publishing platform and appears primarily marketed through publisher/platform partnerships. There are also a number of AI-review writing systems that are at the research stage but not generally available as polished services. See these links, 1, 2, 3, 4, 5 and references therein for descriptions and comparisons. Of these, DeepReviewer seems to have been very recently integrated into a larger commercial offering at DeepScientist.cc.
^ It’s notable that OpenAIReview is open source, which has a different risk profile than commercial services.
^ Authors affiliated with the services have put out various comparisons: (1) Humans vs Reviewer3 vs GPT-5.2 vs Gemini 3 Pro, by Reviewer3-affiliated authors. (2) q.e.d Science vs human Review Commons reviews, by q.e.d Science-affiliated authors. (3) DeepReviewer 2.0 vs Gemini-3.1-Pro-preview vs human ICLR reviews, by DeepReviewer-affiliated authors.
^ Purpose-led Publishing is a coalition consisting of the physical-sciences publishing houses AIP, APS, and IOP.
^ If the AI did not add anything, then an N-round reviewer-author discussion would presumably still take N rounds when intermediated by the AI. But with good AI arbitration the discussants can each interact rapidly with the arbitrator, who could lower the round count by pressing for details early, raising likely counterarguments, etc.
^ We believe this holds much more promise for factual disputes than questions of the paper’s importance, which often come down to knowledge and experience — “taste” — that is slow or impossible to articulate.
^ This includes factual errors, mathematical mistakes, unreliable inferences, and neglecting standard counterarguments. Basically anything where one could quote a paragraph or two from the paper, point out what’s wrong, and have an editor confidently agree.
^ Of course, Refine reports cost at least $30 (unless we qualify for an institutional discount); this is small compared to how much we intend to spend on the review process of a serious paper, but it would be unsustainable if we got a large influx of low-quality or malicious submissions. So there will need to be some basic sanity check first, with human eyeballs and/or cheaper LLM calls.
^ Of course, once AI writing becomes so good that it cannot even be mimicked by humans, we expect the sign of this correlation to flip!
^ Interestingly, even the semantic embedding models like SPECTER2 use the citation graph as the primary supervision signal for which topics are related, which is a weakness for bridging disconnected research communities. An extreme example illustrates this: Suppose there are two communities that are working on conceptually equivalent topics, but they never co-author, never cite each other, and use non-overlapping terminology. Because their citation graphs are disconnected, SPECTER2 would not learn to embed their papers in nearby regions of topic space, but an LLM could potentially understand that the communities were doing the same thing and could recommend that they review each other. Whether editors find LLM-sourced recommendations useful in practice remains to be seen, although we have gotten some positive anecdotal reports.
^ Another minor benefit is that the reports take 5-30 minutes to generate, but we can automatically have the reports prepared ahead of time.
^ Prices in social science papers look similar. On 60 social-science papers, top reasoning models (GPT-5 Pro, GPT-5.2 Pro) ran at $0.85–$0.96/paper; non-reasoning models 1–2 orders of magnitude cheaper; total compute for 208 evaluation runs was $75.77.