Our years of experience in offensive security have taught us to recognize a pattern. Every few years, a new technology arrives with promises that it will finally replace the human pentester.
We’ve seen this before with vulnerability scanners like Nessus, automated frameworks like Metasploit, and more recently, AI-assisted autonomous pentesting tools that claim to think like a hacker.
More recently, tools like PentAGI and platforms such as XBOW have been generating significant attention, often positioned as autonomous red teamers capable of replacing traditional engagements.
A few weeks ago, our CEO, Rafay Baloch, tested PentAGI and shared his observations in a detailed review on RedSecLabs. Our team has also been closely following these systems, their claims, and how they perform in real-world environments.
After reviewing their outputs across multiple engagements, here’s what we believe:
“Autonomous pentesting will not replace manual penetration testing, and this is not due to temporary limitations that will be patched over time, but because the constraints are structural.”
In this blog, we break down why.
What Is Autonomous Pentesting?
Autonomous pentesting refers to AI-driven systems that can plan, execute, and chain penetration testing activities with minimal or no human input.
In practice, these systems work by:
- Take a defined scope (domain, IP range, or application)
- Plan a sequence of reconnaissance and exploitation steps
- Invoke AI-driven toolchains (nmap, sqlmap, Metasploit, Hydra)
- Store findings in memory across sessions
- Generate a final report
More advanced systems introduce multi-agent architectures, where different models handle scanning, exploitation, and reasoning in parallel, coordinated by a central planner. Some also maintain context using memory layers that allow them to build on previous runs.
On paper, this resembles a real penetration testing workflow. In controlled environments, it can produce impressive results.
But security testing is not just execution. It is interpretation, validation, and judgment under uncertainty.
That is where the gap begins.
Problem 1: These Systems Hallucinate, and in Security, That Is Not a Minor Issue
Modern LLMs can generate fluent, confident analysis. They can also produce outputs that are completely incorrect. These mistakes, often called hallucinations, occur when the model generates information that sounds correct but is not grounded in reality.
When an AI pentesting agent hallucinates, it does not stop. It produces a finding, assigns severity, writes a description, and moves on. There is no reliable internal check that asks whether the vulnerability actually exists.
What that looks like in practice:
False positives: reporting vulnerabilities that do not exist
Missed chains: failing to connect related weaknesses across the system
Confident dead ends: building exploitation paths on incorrect interpretations of tool output
In controlled demos, this is easy to miss. In real engagements, it becomes a cost. Time saved on automation is spent reviewing output and filtering noise.
Human testers operate differently. They validate reality through verification.
Problem 2: Business Logic Vulnerabilities Are Invisible to Automated Systems
This is where the real damage happens, and where the gap between AI tools and human testers is the widest.
Think of it this way. The application is working perfectly. No errors, no crashes, no suspicious responses. But the way it works can still be used against you. That is what makes these vulnerabilities hard to find. There is nothing broken to detect.
An AI tool scans for things that look wrong. A human tester asks a different question. Could this be misused?
Those are two different approaches, and only one of them consistently catches this class of vulnerability.
Here are a few realistic scenarios:
Manipulating Discount Logic
An e-commerce platform applies discount codes at the session level instead of recalculating them at checkout. A user applies a one-time discount code, then reuses the same session across multiple orders. The backend fails to invalidate the discount after first use, allowing repeated application without triggering validation errors. All requests appear valid, and the system responds normally.
Abusing Refund and Cancellation Workflows
An order system allows immediate cancellation, but refunds are processed asynchronously. By rapidly placing and cancelling orders, a user can trigger a race condition where refunds are issued before the system updates the order state, leading to duplicate payouts. There are no malformed requests, only a flaw in how the workflow is sequenced.
Privilege Escalation Through Inconsistent Authorization
An application exposes resource identifiers in API requests. While some endpoints enforce ownership checks, others trust the identifier without validating access. By modifying these values, a user can access or manipulate data belonging to other accounts. The requests are valid, and responses are normal, but the authorization model is inconsistent.
In each case, every request appears valid. Every response looks normal. An automated system evaluates correctness at the request-response level.
A human tester evaluates intent.
Business logic vulnerabilities require understanding why a system exists and how it is supposed to behave, not just how it responds under test conditions. That remains outside the reliable scope of current autonomous systems.
Problem 3: Real Attacks Are Chains, Not Isolated Findings
A penetration test is not a collection of individual vulnerabilities. It is a map of how an attacker could move through a system from entry point to impact.
The most dangerous attack paths often look like this:
- An information disclosure vulnerability reveals internal API paths
- One of those paths allows manipulation that bypasses authentication
- That access leads to privileged functionality
Each finding alone would not raise an alarm. Together, they represent full compromise.
Systems like PentAGI and XBOW can chain steps, but in practice they tend to operate in relatively short sequences. They execute planned actions, but struggle to maintain a strategic view across an entire engagement.
This is where human testers differ. They hold the broader objective in mind and continuously reassess how individual findings connect.
Problem 4: Someone Has to Be Accountable
This is the argument that rarely appears in vendor marketing, but it is the one that matters most to enterprises.
Penetration testing reports are used to make real decisions:
- Whether a product is safe to launch
- Whether a compliance requirement has been met
- Whether a critical system is ready
Those decisions carry liability. If a test misses something and that issue is exploited, someone is responsible. That responsibility cannot be assigned to a model.
Problem 5: The Workflow Has Not Changed as Much as You Think
In our view, it is more accurate to describe these tools as smart scanners with memory rather than autonomous agents.
They plan sequences, chain tools, and retain context.
When everything works as expected, the results look strong.
But when things go wrong:
- The model hallucinates a vulnerability
- It misreads tool output
- It follows incorrect paths without reassessment
At that point, a human is required to step in and correct the process.
In practice:
- Time saved on reconnaissance is spent reviewing results
- A qualified tester is still needed to validate findings
- Someone is still accountable for the outcome
Calling it autonomous is more of a positioning choice than a technical reality.
Problem 6: The Economics Behind the Hype
Another factor that is rarely discussed is cost.
Many of these systems rely on large language models accessed through APIs. Today, usage costs remain relatively low, in part due to competitive pricing and subsidized infrastructure.
Autonomous workflows are not lightweight. They involve long reasoning chains, repeated tool execution, and retries when things fail.
This leads to high token usage and inefficient cycles, especially when agents loop or fail to converge.
If pricing changes or usage scales across enterprise environments, cost becomes a constraint. This creates pressure on tools that depend heavily on external APIs, particularly those built as thin orchestration layers.
The Future of Autonomous Pentesting
In the end, the market for autonomous pentesting tools will grow. The tools will get better. The models underlying them will improve, the chains will get longer, and coverage will expand.
But the structural limitations described here are not going away on the timelines being implied. Business logic requires understanding intent. Attack chaining requires strategic reasoning. Accountability requires a human.
The future of penetration testing is not replacement. It is augmentation.
Human pentesters will work faster and cover more surface area because they have better tools. The consultants who learn to use these tools well will accomplish more in less time. Those who do not will be replaced, not by the tools, but by others who know how to use them.
What will not change is the need for someone who can look at a system and ask not just what is broken, but what it was built to do and how that can be abused.
That question has always been the job. It still is.
Get expert-led penetration testing from RedSecLabs and uncover the risks automation alone will miss. Contact us today.
Mail📧 [email protected] 🌐 www.redseclabs.com 📞 +44 20 3996 1505