A new study put 13 open-source AI penetration-testing agents through a single, unified benchmark and asked the question every security leader is quietly asking: are these things hackers, or hallucinators? After four months, more than ten billion tokens, and a panel of human experts reading the logs, the honest answer is "both." Here is my read, and what it means for how you should actually use AI in offensive security.
Source: "Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing," Peng et al., arXiv:2604.05719 (April 2026).
What the study actually did
This is the most thorough look yet at autonomous AI pentest tools. The authors systematized 13 representative open-source frameworks plus 2 baselines across six dimensions: agent architecture, planning, memory, execution, external knowledge, and benchmarks. Then they ran them under one benchmark and judged the results.
The scale is the tell. Evaluating these agents consumed more than ten billion tokens, produced over 1,500 execution logs, and required 15 or more cybersecurity experts reviewing those logs over four months.
Read that last part again. Even to find out whether AI can pentest on its own, you need a room full of senior humans reading what it did.
The uncomfortable middle
The hype says AI replaces pentesters. The cynics say it is a toy that hallucinates. The research lands in the uncomfortable middle, and that is the useful place to stand.
These agents are genuinely capable. They do reconnaissance, plan multi-step attacks, chain known techniques, and move faster than any human at the repetitive work.
They are also unreliable on their own. They report findings that are not there. They declare success on exploits that never landed. They walk past the business-logic flaws that actually get companies breached, because those need an understanding of intent, not pattern-matching on payloads.
Breadth without judgment is noise. A 200-item "critical" list that is half hallucination is worse than no list at all, because it buries the three findings that matter.
What most teams get wrong
There are two failure modes, and both are expensive.
Buy the hype, point an autonomous agent at production, and you flood your team with false positives and a false sense of coverage. Dismiss the whole category as a toy, and you hand speed and scale to the attackers, who are not being nearly so precious about it.
The right answer is neither tool nor toy. It is a harness: AI for breadth, speed, and persistence, and senior humans for judgment, verification, and the exploit that proves a finding is real.
How our AI pentest and red team harness is built
This is exactly the system we built. Our harness gives the AI bounded autonomy with real structure: disciplined planning, memory of your specific application, and controlled execution. Then every finding crosses a senior expert's desk before it ever reaches yours.
The machine proposes. A senior offensive-security engineer disposes. We verify each finding, exploit it for real, throw out the hallucinations, reason about the business-logic abuse the agent could not, and write the evidence your board and auditors can act on.
You get the speed and coverage of AI with none of the make-believe. Hackers, not hallucinators.
The real takeaway
The lesson of this research is not "AI can pentest," and it is not "AI cannot." It is that AI changes who does what. The teams that win pair machine breadth with human truth, on purpose, in a system built for it.
AI belongs in your offensive testing. Unsupervised, it does not.
See how we put this to work in our Agentic Product Penetration Test and Agentic Red Team, or book a quick scoping call.