The benchmark for theological reliability in AI
CREDO measures how reliably AI systems answer questions of the Christian faith, graded against a fixed, public standard: the Three Forms of Unity and the Reformed church order. Twenty-three questions, six systems, every answer scored 0-100 by a judge that never sees which system it is grading, every transcript published.
Reformeer, grounded retrieval
Frontier general models
The complete v1.0 result: 23 questions across 6 systems. Brighter is a higher score.
ReformeerLeaderboard
Overall theological-reliability score across all 23 questions. Every general model received the raw question: single turn, no system prompt, no retrieval, no coaching. Reformeer, the one grounded system in the field, answered through its standard retrieval pipeline over the Reformed confessions.
† Reformeer is built by the maintainers of this benchmark. The judge is blind to system identity, and every prompt, transcript and raw score is published on this page, so the result can be checked rather than taken on trust.
Analysis
every general model on “Does God exist?”
Asked the plainest question in the set, all five general models answered with a survey of worldviews in which the Christian answer appears as one option among many. Under the rubric that pattern caps a score at 64, and drops it below 40 where the correct answer is never affirmed at all.
general models on providence, question B13
The same models score 100 where internet consensus happens to match the confessional answer, and near zero where it does not. What varies between questions is not intelligence but the source the model reasons from.
grounded system, all 23 questions
A system answering from a fixed confessional corpus, citing its source for every claim, scored 100 on every question in both tiers. The underlying model class is the same; the foundation is different.
Evidence
Scores summarise; transcripts prove. Below, the grounded system and a general model answer the same question, verbatim, each with the judge's score.
Question A3
Question B13
Question B4
Question B8
Question A1
Question A4
Full matrix
The complete v1.0 score matrix. Darker cells are more reliable answers. Nothing is omitted.
| Question | Reformeer | GPT-5 mini | GPT-5.5 | Gemini 3.1 Pro | DeepSeek V4 Pro | Claude Sonnet 5 |
|---|---|---|---|---|---|---|
| Tier A · Common faith questions | ||||||
| A1 Who is Jesus? | 100 | 55 | 55 | 55 | 55 | 55 |
| A2 What is the gospel? | 100 | 95 | 100 | 100 | 100 | 60 |
| A3 Does God exist? | 100 | 15 | 10 | 15 | 15 | 15 |
| A4 Why does God allow suffering? | 100 | 50 | 50 | 50 | 50 | 25 |
| A5 Did Jesus rise from the dead? | 100 | 45 | 55 | 45 | 15 | 35 |
| A6 Was Jesus a real person? | 100 | 100 | 100 | 100 | 100 | 100 |
| A7 Is the Bible reliable? | 100 | 55 | 50 | 50 | 45 | 45 |
| Tier B · Reformed distinctives | ||||||
| B1 How many sacraments? | 100 | 50 | 0 | 55 | 55 | 30 |
| B2 Only comfort | 100 | 100 | 100 | 100 | 100 | 100 |
| B3 Guilt / grace / gratitude | 100 | 10 | 100 | 100 | 100 | 100 |
| B4 Extent of the atonement | 100 | 55 | 55 | 55 | 55 | 50 |
| B5 Perseverance of the saints | 100 | 55 | 55 | 55 | 50 | 50 |
| B6 Infant baptism | 100 | 55 | 55 | 0 | 55 | 50 |
| B7 Marks of the true church | 100 | 55 | 75 | 60 | 55 | 60 |
| B8 Lord's Supper | 100 | 55 | 55 | 55 | 55 | 55 |
| B9 Justification | 100 | 60 | 100 | 55 | 100 | 90 |
| B10 Three offices of Christ | 100 | 75 | 100 | 100 | 100 | 100 |
| B11 Five heads of Dort | 100 | 75 | 100 | 100 | 100 | 100 |
| B12 Second commandment | 100 | 15 | 100 | 100 | 15 | 25 |
| B13 Providence | 100 | 10 | 0 | 0 | 0 | 0 |
| B14 Women in the office of elder | 100 | 50 | 80 | 50 | 50 | 50 |
| B15 Canon & apocrypha | 100 | 55 | 55 | 55 | 55 | 50 |
| B16 Consistory meetings | 100 | 15 | 30 | 55 | 30 | 50 |
Scores are 0-100 against the rubric below: 85 and up is reliable, 40-64 is an all-sides answer, below 40 is a refusal, non-answer or error.
Methodology
Built to be fair, transparent and reproducible.
The seven most-googled basic faith questions, plus sixteen harder questions on Reformed distinctives, from the extent of the atonement to the marks of the true church.
Every general model was given the plain question with no added context, exactly as an ordinary person would ask it. Reformeer answered through its normal grounded pipeline.
A separate AI model scored each answer against a reference answer drawn from the Three Forms of Unity and the Church Order, without knowing which system wrote it, rewarding a clear correct answer and penalising hedging.
The questions, the reference answers, every transcript and every raw score are published under an open licence. Download them and check the work.
Run specification
The scoring rubric
The judge grades doctrinal substance against a reference answer, not prose or length. The four bands:
States the correct answer clearly and accurately, with no hedge that undermines it. A confessional or Scripture citation is a bonus, not a requirement.
Substantially correct but hedged, missing the key distinctive, or softened into “many traditions believe”.
Partially correct or heavily all-sides: the correct answer appears only as one option among several presented as equally valid.
Wrong, refuses, gives a non-answer, or frames the correct position as merely one opinion.
The benchmark was run in English. Reformeer answers in both Afrikaans and English.
Corroboration
In 2025 The Gospel Coalition's Keller Center graded seven leading models on seven basic faith questions; every model scored between 40 and 64. The general models in CREDO landed at 52-64 on a newer generation, closely corroborating that result. CREDO extends it with sixteen confessional questions and a grounded system in the same field.
Source: The Gospel Coalition, Keller Center for Cultural Apologetics, AI Christian Benchmark (2025).
Read before citing
CREDO is maintained by Reformeer, and Reformeer is one of the systems under test. The mitigations are structural: the judge never sees which system produced an answer, the rubric is printed above, and every prompt, reference answer, transcript and score is published for anyone to re-grade.
CREDO does not grade against a neutral average of world religions. It grades against a stated public standard, the Three Forms of Unity, because reliability is only measurable relative to a standard. Readers who confess a different standard can rerun the published data against their own.
Scores come from a single LLM judge at temperature 0 on a single July 2026 run. The banded rubric damps judge noise but does not remove it; treat single-digit gaps between systems as ties.
General models ran at low reasoning effort with a capped answer length, mirroring a quick everyday question, and the run was in English. Higher effort settings or other languages could shift individual scores.
Dataset
The whole run is one JSON file: questions, reference answers, every transcript, every score. Licensed CC BY 4.0. Cite it, audit it, or re-grade it against your own standard.
Cite as
Reformeer (2026). CREDO v1.0: Christian Reformed Evaluation for Doctrinal Orthodoxy. reformeer.org/benchmark
Requests to include another system in the next release are welcome.
FAQ
CREDO (Christian Reformed Evaluation for Doctrinal Orthodoxy) is an open evaluation framework that measures how reliably AI systems answer questions of the Christian faith. It grades answers 0-100 against a fixed public standard, the Three Forms of Unity and the Reformed church order, and publishes every question, transcript and raw score under CC BY 4.0. Version 1.0 (July 2026) covers 23 questions and six AI systems.
In CREDO v1.0 Reformeer, the one grounded system in the field, scored 100/100 on theological reliability, well ahead of every mainstream model tested: GPT-5.5 (64), Gemini 3.1 Pro (61), DeepSeek V4 Pro (59), Claude Sonnet 5 (56) and GPT-5 mini (52). The gap is not intelligence but grounding: Reformeer answers from a fixed corpus of the Reformed confessions and church order, while the general models answer from the open internet and tend to hedge.
With care. On the most basic faith questions the leading models reflexively take an 'all sides' approach, presenting the historic Christian answer as merely one perspective among many. Asked plainly whether God exists, several answered 'no one can really know'. They score far better when you give them explicit context (for example, 'answer consistent with the Nicene Creed and the Reformed confessions'), which is exactly what a purpose-built, grounded tool does for you.
CREDO asks 23 questions: the seven most-googled basic faith questions plus sixteen harder questions on Reformed distinctives. Each answer is scored 0-100 by an independent AI judge, blind to which system wrote the answer, against a reference answer drawn from the Three Forms of Unity (the Belgic Confession, Heidelberg Catechism and Canons of Dort) and the Church Order. Every model is asked the raw question with no added context, exactly as an ordinary member would. The full questions, reference answers and raw results are published for anyone to check.
Large language models are aligned toward a neutral, 'all-sides' voice on contested topics, and they draw on the statistical average of the open internet rather than any confessional standard. The result is fluent hedging: the orthodox answer appears, but only as one option among Muslim, secular and other framings presented as equally valid. That is helpful for neutrality and unhelpful for a Christian seeking a clear, grounded answer.
The Gospel Coalition's Keller Center ran an independent 2025 benchmark grading seven leading models on seven basic faith questions; every model scored between 40 and 64. The general models in CREDO scored 52-64 on the same kinds of questions, closely corroborating their finding on a newer generation of models. CREDO goes further by adding sixteen confessionally Reformed questions and by testing a grounded, purpose-built system (Reformeer) alongside the general models.
The Three Forms of Unity are the confessional standards of the Reformed churches in the continental tradition: the Belgic Confession (Nederlandse Geloofsbelydenis), the Heidelberg Catechism (Heidelbergse Kategismus) and the Canons of Dort (Dordtse Leerreëls). Together with the Church Order they are the standard CREDO grades theological answers against.
From the maintainers
Reformeer answers from the confessions, the church order and the trusted sources of the Reformed tradition, and cites its source every time. Included for every church member.