Independent Research · 6 Studies · 27 Detectors

AI detectors are less accurate
than you've been told.

Every AI humanizer site picks the one detector they beat and puts it on their homepage. Here's the full picture — unfiltered data from 6 independent academic studies testing 27 different AI detection tools. Some are excellent. Some are barely better than a coin flip.

97%
Best detector (Originality.ai)
7%
Worst detector (ContentDetector.ai)
40%
OpenAI's own detector on GPT-4
27
Detectors tested across studies
🔬

About this data

The accuracy scores below come from 6 peer-reviewed studies and independent benchmarks testing AI detection tools across different content types — GPT-4 papers, adversarially rephrased articles, oncology abstracts, student-written essays, and LLM-generated content. Scores represent accuracy (higher = better at detecting AI). "Not included" means the detector wasn't tested in that study. We haven't cherry-picked — this is the full dataset.

Notable findings

The numbers that matter

Before the full table, here are the three findings that every content creator and student should know.

97%
Originality.ai
The most accurate detector tested

Scored 97, 100, 85, 100, 87.9, and 86.5 across all six studies. Consistently the hardest detector to beat — and the one most professional content platforms and publishers use. If you're writing for SEO agencies or publishers, this is the one that matters.

7%
ContentDetector.ai
Barely above random chance

Scored 7% accuracy in independent testing. This means it correctly identifies AI-generated text only 7% of the time — worse than just guessing randomly. Any tool that advertises "passes ContentDetector.ai" is gaming a broken tool. This is not a flex.

40%
OpenAI Detector
Can't reliably detect its own output

OpenAI's own AI detection tool scored just 40% accuracy on GPT-4 papers. The company that built GPT-4 can't reliably detect when it was used. OpenAI eventually shut down their classifier in 2023, citing low accuracy. This tells you everything about how hard this problem actually is.

Full dataset

All 27 detectors, all 6 studies

Accuracy scores from each study. Higher = better at detecting AI-generated content. Color coding: Green = strong (≥80%), Amber = moderate (60–79%), Red = weak (<60%).

Detector Study 1
AH&AITD Accuracy
Study 2
GPT-4 Papers
Study 3
RAID Benchmark
Study 4
LLM-Generated
Study 5
Oncology Abstracts
Study 6
Students & LLMs
Verdict
Originality.ai 97.09 100 85 100 87.9 86.5 Most Accurate
GPTZero 63.77 57 67 45 77.28 84.5 Inconsistent
Writer 69.05 40 Weak
CopyLeaks 100 Limited Data
Turnitin 100 62 Varies by Content
ZeroGPT 83 66 92 85 Decent
Sapling 66.66 33 75.92 Inconsistent
Winston AI 71 77 Moderate
Binoculars 80 Limited Data
FastDetectGPT 74 Limited Data
SEO.ai 83 Limited Data
GPTKit 55.29 Weak
GPT-2 Output Detector 85 Limited Data
Scribbr 69 Limited Data
Crossplag 69 Limited Data
Grammica 62 Weak
Zylalab 68.23 Weak
Content at Scale 52 50 Weak
GPTRadar 31 71 Inconsistent
OpenAI Detector 40 Unreliable
IvyPanda 40 Unreliable
ContentDetector.ai 7 Broken
GLTR 63 Weak
RoBERTa-Base (GPT2) 59 Outdated
RoBERTa-Large (GPT2) 57 Outdated
RoBERTa-Base (ChatGPT) 45 Outdated
LLMDet 35 Broken
What this means

The real takeaways

What every writer, student, and content creator should understand before trusting any AI detector — or any tool that claims to beat one.

🎭

Every site picks the detector they beat

When you see "passes GPTZero!" on a competitor's homepage, check if they also mention Originality.ai. They don't. Cherry-picking one favorable detector while ignoring the accurate ones is the industry standard for misleading marketing.

📚

Content type matters enormously

The same detector can score 45% on one type of content and 92% on another. GPTZero scored 45 on LLM-generated content but 84.5 on student essays. "Bypasses GPTZero" means nothing without specifying what kind of content.

🔄

Detectors update constantly

These scores are snapshots in time. Originality.ai, GPTZero, and Turnitin all update their models regularly. A tool that beats a detector today may not beat it next month. Anyone guaranteeing permanent bypass rates is making a promise they can't keep.

🏫

Turnitin scored 62 on LLM text

Universities make Turnitin sound infallible. The data says otherwise — it correctly identified LLM-generated text only 62% of the time in independent testing. It scored 100 on GPT-4 papers but 62 on rephrased LLM content. Context is everything.

Research sources

The 6 studies behind this data

All data sourced from peer-reviewed papers and independent benchmarks. We've linked the original research where available.

Study 01
Empirical Study of AI-Generated Detection Tools
Accuracy comparison of AI text detection tools on AH&AITD dataset. Tested Originality.ai, GPTZero, Writer, Sapling, GPTKit, Zylalab.
Study 02
Effectiveness of Software Designed to Detect AI-Generated Writing
Percentage correct for GPT-4 papers across 16 detectors including Turnitin, CopyLeaks, ContentDetector.ai, and OpenAI's own classifier.
Study 03
RAID: Benchmark for Robust Machine-Generated Detectors
Mean accuracy at FPR=5% for detectors on non-adversarial outputs. Tested Binoculars, FastDetectGPT, RoBERTa variants, and others.
Study 04
Great Detectives: Humans vs. AI Detectors in LLM-Generated Writing
Mean accuracy for ChatGPT-generated and AI rephrased articles across GPTZero, ZeroGPT, Turnitin, Content at Scale, and GPT-2 Output Detector.
Study 05
Characterizing AI Content Detection in Oncology Scientific Abstracts
Mean AUROC comparing GPT-3.5 vs. human-written in medical/scientific context. Tested Originality.ai, GPTZero, and Sapling.
Study 06
Students Using LLMs and AI Detectors
Mean accuracy for (a) human vs. AI and (b) human vs. disguised content. Tested Originality.ai, GPTZero, ZeroGPT, and Winston AI.

Naturaly shows you real scores.
Not cherry-picked ones.

Our built-in detector checks against the tools that actually matter — and we tell you when your score isn't where you need it. No fake 99% guarantees. Just honest results and better writing.

Try Naturaly free — no card needed →
AI Humanizer Resume Tailor Live detector scores $12/month all features