Aidan John

XBOW - Agentic Pentesting with ZERO False Positives

Sat, 27 Sep 2025 15:39:25 -0400

If you’ve never heard of XBOW before, you will. XBOW is a platform that leverages a multi-agent system to perform automated pentesting. The team behind it is on a mission to build a fully autonomous system to catch verified vulnerabilities, allowing security teams to focus on problems that require a human touch.

XBOW made huge waves back in June of 2025 when they achieved the number one spot on the US HackerOne bug bounty leaderboard.

What was stunning about their approach was not only the scale at which they were able to find vulnerabilities, but the creativity of the vulns themselves - like when XBOW found a SQLi vulnerability while trying to bypass geolocation restrictions. We’ll explore some more interesting vulns they found later in this post.

More recently, researchers at XBOW conducted an experiment to test for vulnerabilities across 17,000+ DockerHub images. The results: 200 vulnerabilities with 0 false positives. Yes, you read that right. ZERO false positives.

In this post, we’ll be exploring the methodology behind XBOW’s approach to agentic pentesting, focusing on how they were able to harness typically hallucinogenic LLMs into a platform that can identify verified vulnerabilities at scale.

This post is a summary of a recent webinar hosted by XBOW researchers Brendan Dolan-Gavitt and Alvaro Muñoz titled 200 Zero Days, 0 False Positives: A discussion on scaling autonomous exploitation with AI.

If you want to learn more about XBOW, check out their blog and the slides from their recent Black Hat session. The images in this post are all from the XBOW team and are purely for educational purposes.

In my opinion, XBOW is a prime example of some of the most interesting applications of GenAI in recent memory, so without further ado, let’s dive in!

Quelling False Positives and Hallucinations
- Death by a thousand slops
- XBOW Architecture
- Taxonomy of Validators
  - Canary-based Validation
  - Heuristic-based Validation
The DockerHub Experiment
- Open Redirect Vulnerability in Jenkins
- SSRF in Apache Druid
- File Read in Group Docs
- AuthZ Bypass in Redmine
Conclusion

Quelling False Positives and Hallucinations

I think everyone is familiar with the tendency of LLMs to hallucinate information. This becomes a major problem to solve when applying them in pentesting contexts where accuracy is necessary for successful exploitation. If you’re not properly validating the vulnerabilities found by the agent, you’re just producing more AI slop. This became a far too real reality for the lead author of cURL.

Death by a thousand slops

The increasing popularity and development of GenAI opened the floodgates for developers to create pentesting agents and set them loose on bug bounty programs. Most notable was the story of what happened with cURL. Their bug bounty program was inundated with bogus security reports generated by AI. This influx of false reporting caused a major headache for the founder and lead developer of cURL, Daniel Stenberg. It became increasingly difficult to tell which reports actually needed attention and which reports were just looking for a payout. Daniel documented his experience and frustration with this in his blog post Death by a Thousand Slops.

The reports themselves look super plausible, but actually turn out to be completely fake. Here’s an example of a report like this from the XBOW team:

The agent was trying to look for a command injection vulnerability. It finds an endpoint where it thinks it can trigger one and tries to exfiltrate data by embedding the payload in backticks.

If we take a closer look at the command, we can see it’s wrapped in double quotes instead of single quotes. This discrepancy will cause the command to execute on the attacker’s own machine - not the target. While it will return a password file, it is the attacker’s password file and does not constitute a real vulnerability.

So how was XBOW able to create a system that could properly quell false positives like this? Simple, they institute a suite of validators that sit outside of the AI and check its work. The validators leverage non-AI code to confirm the authenticity of agent-proposed vulnerabilities, and the XBOW team has come up with some pretty neat tricks for this validation.

XBOW Architecture

Before we get into some of the technical details for the validators, it’s important to understand a little about the architecture of XBOW’s multi-agent system.

At the top is an agent orchestrator they call the Coordinator. This agent manages teams of sub-agents to accomplish different tasks. The Coordinator first spawns discovery agents that use headless browsers and other discovery tools to start collecting endpoints and requests. Enough info to map out the application and pass this info back to the coordinator so it can decide which specialized solvers it should spawn in response. These agents use a mix of both AI and non-AI techniques for discovery and exploitation. This is advantageous because you get the benefit of industry-standard tooling with the knowledge base of an LLM, allowing the agents to say things like “I know the framework this application is using, so x y and z endpoints must exist” even if it’s not explicitly exposed in the frontend code.

All of these agents are running on a dedicated and isolated attack machine. This machine is a throwaway Linux container pre-built with a bunch of standard tooling as well as some custom tools designed by the XBOW team.

So where does validation come in? When an agent thinks it has found a vulnerability, it tries to prove its existence by submitting some evidence to the exploit validator. The exploit validator verifies the evidence by talking to the target and making requests on the network with non-AI code. It is the only one who can decide if the agent succeeded at what it was trying to do. Once it’s decided that a valid vulnerability has been found, it will forward that information to a report generation flow.

Taxonomy of Validators

XBOW leverages a spectrum of validators. They range from bulletproof implementations (which require some extra effort to set up) to less reliable implementations (with an opportunity for false positives but take no setup).

This diagram shows the range of these validators. On the x-axis we have the effort spent for setup, ranging from manual to fully automated. On the y-axis we have the degree of target cooperation (whether they have write access to the target system for the test).

Canary-based Validation

In the top right quadrant are validators based on what the XBOW team calls canaries. Canaries are a long random string (similar to a CTF flag) in a place users shouldn’t have access to. For example, you can plant a canary in the admin dashboard. If the agent ends up finding this complex string, you can confirm it was able to obtain admin credentials.

Canaries are suitable for cases where you can stand up the environment in something like a Docker container and modify the code, like open source applications. Canaries allow you to test for scenarios that would normally be very hard to do. Vulnerabilities that would normally require application-specific or system design-specific knowledge, like business logic vulnerabilities and IDOR vulnerabilities, become trivial.

So what do you do in cases where you can’t plant canaries - like the live targets in HackerOne’s bug bounty program?

Heuristic-based Validation

This leads us to the validators in the bottom right quadrant, they leverage headless browsers and heuristic validation to confirm the existence of proposed vulnerabilities.

For example, an agent may submit something like an HTTP request or a URL as evidence of a vulnerability.

In this example, the agent has gone through different pages and found a p parameter where it thinks it can inject some JavaScript and prove the existence of an XSS vuln. Once it’s generated this URL, it will pass it to the browser validator by invoking victim-goto, which (surprisingly) simulates what would happen if a victim were to go to that link. Then, without doing anything related to AI, it visits the link and validates that a pop-up appeared with the text it’s looking for (in this case ‘XSS’).

Another class of validator in this quadrant is the heuristic validator.

These validators have a little more noise but cover important attack classes where the team is willing to do more work to validate them. These validators confirm attacks like RCE, SQLi, and SSRF - critical vulns because they give the attacker a ton of control.

In this validation implementation, we are able to force the agent to submit a particular kind of evidence in a specific format that we know how to evaluate. We need it to be in a specific format so that we can do something with it in non-AI deterministic code that will confirm the validity of the evidence it’s giving us. For example, if we’re looking at SQLi, we can test for a blind time-based attack. This is where you send a request that triggers a database lookup with attacker-controlled commands. The attacker injects a command to sleep, which causes the system to stop and wait for a few seconds before returning anything else. That delay is something the system can measure. So if we ask the model to provide two HTTP requests with a big timing difference between them - we can make those requests a few times, look for statistically significant differences in the timings, and say there’s probably a SQLi vulnerability here.

It’s important to emphasize this isn’t a perfect method of validation and does not have a 0 false positive rate. Webapps are incredibly varied; some webapps have endpoints where, if you send a number, it will naturally sleep for that number of time. There are also cases where if you send two requests, one will take longer than another purely because of how the app itself is built. However, this is an acceptable flaw due to the severity of the types of attacks that could be present. It is worth the manual effort to do more testing to confirm their existence.

Now that we know more about how the system works, let’s check out some of the vulnerabilities the XBOW team found in their recent experience scanning DockerHub!

The DockerHub Experiment

The team started with 25 million DockerHub repositories and narrowed it down to about 17k images for automated testing. These images were the perfect use case for canary validation, and a fun way to see how their system works at scale. For this experiment, they were able to automate canary deployment through various techniques - such as modifying the Docker Compose file to add a canary in the root of the filesystem, adding an internal server hosting the canary for SSRF, and adding a table to an existing database with a row containing a canary.

The test was a success - they found A LOT of bugs.

Let’s explore some of their more notable finds!

Open Redirect Vulnerability in Jenkins - CVE-2025-27625

XBOW found an open redirect in Jenkins by appending two backslashes to the path parameter. Turns out Jenkins was checking the path to make sure it wasn’t an absolute URL, but was not checking if the URL was using this specific variation of an absolute URL. Not many people know that browsers will accept two backslashes at the beginning of a URL, but an LLM does! The agent was able to take this obscure knowledge and apply it in context to find the vulnerability.

The team didn’t use canary validation for this vulnerability; it used headless browser validation. The agent transferred the payload URL to a ‘victim machine’ and visited the URL, following all the redirects, then checking if the domain it landed on matched the expected domain (in this case evil.xbow.ltd). This is the same kind of validation they apply to XSS and other client-side vulnerabilities, where you have to make the victim visit a link or page for exploitation.

XBOW reported the vulnerability to Jenkins and found that, in addition to the double backslashes, a single backslash allowed an attacker to navigate to any arbitrary domain. The Jenkins team patched this by implementing a check for any domain that starts with a backslash.

SSRF in Apache Druid - CVE-2025-27888

The XBOW team planted a canary on the internal web server, and it was only accessible from the target they were attacking. If XBOW was able to retrieve the canary, it was definitive evidence that it was able to talk to internal URLs.

Upon reading the source code, XBOW realized that there were URL paths /proxy/coordinator and /proxy/overlord, and whatever comes after will be proxied to the respective machine. XBOW appended an @ symbol to mimic a login and force the coordinator/overlord part of the URL to process as a username. Because this was not needed, the target tossed out this part of the request, and it ended up landing on the internal server where XBOW was able to retrieve the canary.

In a cloud scenario, this might mean access to sensitive services like the AWS Instance Metadata Service. This could have also been leveraged for XSS because you could point it to your own server, return any arbitrary JavaScript, and because it’s returned as content-type HTML, it will be executed on the victim’s browser.

File Read in Group Docs

This vulnerability was particularly interesting because of how many steps the agent had to chain together in order to find it.

In this case, XBOW was targeting a .NET application, and there were a bunch of DLL files it could analyze. In those files it found a JWT which it tried to use to login as an administrator. It failed because the token was expired.

It kept analyzing the DLLs and noticed there were a few keywords like s3SecretKey and SymmetricSecurityKey. It also found something that looked like a base64 encoded string. Due to the sequence of words, XBOW inferred that the base64 may be a secret key and created a Python script to try out different combinations of attacks.

XBOW first tried using different secrets to sign the JWT. It tried a number of key IDs and included the same secret it saw from the expired token. XBOW saw in the traffic that there was compressed storage file so it tried to use path traversal to scale from where the packet files were expected to be converted into the main file system. XBOW put together different payloads for path traversal, crafted its own JWT with admin credentials, and successfully obtained the canary.

AuthZ Bypass in Redmine

This is another instance of XBOW using canary validation. It dug through the source code in this project management system and traced through five different classes before it figured out how the visibility check for projects was implemented. XBOW found a parameter that, when set to true, completely bypasses the visibility check and lets you see private projects.

The example above uses an authenticated user for exploitation but it was later found out that you don’t need to be authenticated for this to work. Scary stuff!

Conclusion

XBOW is the first system of its kind, but it will not be the last. As LLMs improve and agentic systems advance, this kind of automated pentesting will become more widespread. As with any new technology, there will be those who wield it benevolently and malevolently. As it becomes easier for malicious actors to find vulnerabilities in production environments at scale, security teams must adopt similar technologies in order to properly test their services before deployment.

Proactive Cyber Defense with KillChainGraph

Sun, 31 Aug 2025 12:02:58 -0400

In this post, we’ll explore a machine learning framework designed to combine the attack flow from the Lockheed Cyber Kill Chain with the MITRE ATT&CK dataset, enabling better contextualization, prediction, and defense against adversarial behavior.

This post is a summarization of research conducted by Chitraksh Singh, Monisha Dhanraj, and Ken Huang in their paper “KillChainGraph: ML Framework for Predicting and Mapping ATT&CK Techniques”.

You can read their full paper on ArXiv here: https://arxiv.org/abs/2508.18230v1

Introduction
What is MITRE ATT&CK?
What is the Cyber Kill Chain?
Why KillChainGraph?
Methodology
- Dataset
- LightGBM (LGBM)
- Transformer
- BERT
- Graph Neural Network (GNN)
- Ensemble Strategy
- Semantic Mapping and Graph Construction
Results
Conclusion

Introduction

Conventional security tooling such as firewalls, rule-based intrusion detection systems, and signature-based analysis has long been the standard for security teams across the industry. It’s always been a cat-and-mouse game - as adversaries become more advanced, security professionals look for new ways to defend.

In addition to known threat actors and APT groups, advances in GenAI have led to a rise in sophisticated exploitation from unsophisticated actors. Just recently, Anthropic has documented the use of Claude to enhance cybercriminal operations and enable what they call “vibe hacking”, which is exactly what you think it is.

Defenders have been stuck playing catch up - most of the existing tooling they use operates reactively and are often ineffective against threats like zero day vulnerabilities and polymorphic malware. As adversaries become more advanced and the barrier to entry for sophisticated exploitation lowers, security professionals must adopt new techniques to stay ahead of the evolving threat landscape.

Recent works have shown success in using machine learning for intrusion detection and anomaly classification, allowing defenders to be proactive instead of reactive. This paper expands on that area of work by leveraging machine learning to combine the attacker lifecycle described by the Cyber Kill Chain (CKC) with adversarial techniques from MITRE ATT&CK. By mapping known ATT&CK techniques to phases of the CKC, we have all the information we need to contextualize, predict, and defend against an increasingly wide set of adversarial behavior.

In this paper, the researchers present a multi-model machine learning framework dubbed KillChainGraph that emulates adversarial behavior across the seven phases of the CKC using the ATT&CK Enterprise dataset. This machine learning approach offers a scalable solution for extracting behavioral insights from large volumes of cyber threat intelligence data, allowing defenders to proactively identify full-cycle attack paths and preemptively strengthen security controls.

What is MITRE ATT&CK?

ATT&CK is a globally accessible knowledge base maintained by the MITRE Corporation that houses detailed adversarial behavior. It catalogues known TTPs (Tools, Tactics, and Procedures) employed by threat actors based on real-world observations and is constantly updated as new TTPs are discovered. Security teams across the globe use this knowledge to understand, test, and defend against cyberattacks.

The ATT&CK Enterprise dataset is a full knowledge base of all the information MITRE maintains about adversary behavior in enterprise IT environments like Windows, Linux, MacOS, cloud, SaaS, containers, etc. - all in JSON format. This dataset contains:

Tactics (High-level adversarial objectives i.e. initial access, persistence, privilege escalation)
Techniques & Sub-techniques (Ways adversaries achieve each tactic along with detection ideas and mitigation advice)
Groups (Threat Actors/APTs and which techniques they’ve been known to use)
Software (Malware & Tools mapped to techniques)
Mitigations (Defensive measures mapped to techniques)
Relationships (Links that tie techniques <-> groups <-> software <-> mitigations)

What is the Cyber Kill Chain?

The Cyber Kill Chain was developed by Lockheed Martin as a way to break down the attacker lifecycle into 7 key phases. This allows defenders to understand how intrusions progress so they can detect, prevent, and disrupt attacks at different points.

Lockheed Cyber Kill Chain Diagram (https://www.lockheedmartin.com/en-us/capabilities/cyber/cyber-kill-chain.html)

Why KillChainGraph?

The novelty of this approach is in producing phase-specific modeling aligned with structured frameworks (CKC and MITRE ATT&CK) already recognized by the security community. The researchers introduce a forward-predictive pipeline that models attacker progression across the CKC using a suite of supervised classifiers: LightGBM, Transformer, BERT, and GNN, each trained per CKC phase.

The input dataset was curated by semantically aligning ATT&CK techniques with CKC phases using ATTACK-BERT (a model specialized in understanding and analyzing cybersecurity-related text with a particular focus on attack actions and techniques). This generated phase-specific datasets that enable fine-grained classification of adversarial behavior.

The researchers then constructed a semantic similarity graph to link predicted techniques across phases, effectively simulating the attack pathing employed by threat actors. Combining this semantically guided data engineering and graph-based inference enables explainable reasoning over adversarial tactics, granting defenders increased situational awareness and the ability to proactively enhance cyber defense.

Methodology

The framework consists of two major components: dataset construction and model training.

Five classification models are being employed:

LightGBM (LGBM)
Transformer
BERT
Graph Neural Network (GNN)
Ensemble Strategy

Given attack descriptions, they are trained to predict which ATT&CK technique is being used along with its associated CKC phase. The results of which are then used to construct a graph representation of an adversary’s attack pathing.

Dataset

Using ATTACK-BERT, the researchers aligned adversarial behavior descriptions from the ATT&CK dataset with their appropriate CKC phase. This created a labeled dataset of adversarial actions, each consisting of an ATT&CK technique description, its corresponding ATT&CK technique ID + name, and the associated CKC phase. The dataset was then split into the seven CKC phases: Reconnaissance, Weaponization, Delivery, Exploitation, Installation, Command & Control, and Objectives.

After cleaning up the data, each of the seven datasets was stratified into training, validation, and testing sets in a 70-10-20 split with an emphasis on maintaining class balance wherever possible.

Phases like Delivery and Command & Control had limited data availability and class imbalance, so some data augmentation strategies were applied, such as:

Synonym Substitution - increasing diversity in a dataset by replacing words in a text sample with their synonyms.
TF-IDF Based Token Filtering - a mechanism to drop low-importance terms.
Word Reordering and Controlled Duplication - another way to increase dataset diversity without altering underlying intent.
Paraphrasing Techniques - methods like using a Pegasus model (trained to rewrite text with different words while preserving underlying intent) and back-translation (taking a sentence, translating it to another language, then back to English).

These augmentation strategies improved the robustness and generalization ability of models trained on low-resource CKC phases.

Given a description of an attack, each model is trained to predict the correct ATT&CK technique being used (and by extension its CKC phase).

Now, let’s take a look at each of the models the researchers used for this framework.

LightGBM (LGBM)

LightGBM works by building lots of small decision trees and combining them. The model essentially learns a bunch of little rules from the dataset like “if the word credential is present it most likely has to do with credential dumping” or “if the description mentions a registry it’s most likely a persistence technique” then combines these rules to guess the correct ATT&CK technique.

First, a vector embedding (an array of numbers) is generated using ATTACK-BERT to represent the semantic content of a given ATT&CK description. Then, they use this and its associated ATT&CK technique name to train the model. This allows the model to decide which specific ATT&CK technique is being employed, given a description of the attack.

Transformer

The Transformer-based classifier was another way for the researchers to classify ATT&CK techniques given a description. In this model, they used GloVe instead of ATTACK-BERT to generate semantic vector embeddings. Each vector embedding corresponding to a given attack description was given to the transformer as input.

The main difference in the transformer architecture is that instead of following a rigid “if-then” path like a decision tree, the Transformer uses a flexible neural network with self-attention to understand the meaning of the whole attack description even when important words are spread out.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model designed to capture deep bidirectional representations from unlabeled text. The researchers fine-tuned this pre-trained model on ATT&CK technique descriptions to predict the corresponding technique label.

Each description was tokenized (broken up and translated to numbers) and truncated/padded to a fixed sequence length. These tokens are fed into BERT to generate semantic vector embeddings, which are then passed to a function that outputs the probabilities for each possible ATT&CK technique label. The label with the highest probability is chosen and returned.

The main advantage of BERT is that there is a performance boost due to its pre-training on large-scale corpora like Wikipedia and BookCorpus. It has the ability to learn task-specific representations with minimal architectural modification and fine-tuning. This allows the model to better adapt to domain specific text - in our case: adversarial behavior descriptions.

Graph Neural Network (GNN)

The GNN in this framework is used to model the semantic and structural connections between ATT&CK descriptions. Each description is treated as a node in the graph, and the edges are constructed based on textual similarity (shared keyword patterns or how close the descriptions are in the semantic vector embedding space). A GNN layer aggregates information from neighboring nodes to update the representation of each node. The final node embedding is passed through a classification layer to predict the ATT&CK technique being used.

The main advantage of using a GNN is the ability to leverage the relational structure between technique descriptions, improving generalization for similar but rare classes. This is especially valuable in the context of cybersecurity, where techniques may have overlapping semantics and evolving terminology.

Ensemble Strategy

In this classifier, the four previous classification models (LGBM, Transformer, BERT, and GNN) are combined to synthesize a final ATT&CK technique prediction through a Weighted Soft Voting ensemble strategy. In a hard voting strategy, each classifier would give its prediction, and a majority vote would decide the final technique prediction. Soft voting decides the final technique prediction by considering the predicted class probabilities from each individual classifier - allowing the framework to make a more nuanced decision influenced by each model’s confidence.

Semantic Mapping and Graph Construction

Once the final ATT&CK technique prediction has been generated, the next step is to construct a semantic graph that connects techniques across adjacent phases of the CKC. The goal is to simulate how an attacker might logically transition from one phase to the next.

Each attack description is converted into a vector embedding using ATTACK-BERT to capture semantic meaning. For any two techniques v_i from phase t and v_j from phase t + 1, we compute how similar they are using cosine similarity (how close the vector embeddings are in the vector embedding space). The resulting computation is a result where -1 <= result <= 1. 1 being perfectly similar and -1 being completely opposite. In this case, values closer to 1 indicate that the techniques are semantically similar and might represent a logical transition in an attacker’s plan.

To build the graph, a minimum threshold is set on result. If the similarity between two techniques across adjacent phases is greater than or equal to the threshold, we draw a directed edge from the earlier-phase technique t to the later-phase technique t + 1.

This semantic mapping helps model an attacker’s potential attack flow in accordance with the CKC, providing a structured view of predicted attacker behavior and aiding in proactive cyber defense.

Results

The researchers tested these five models (LightGBM, Transformer, BERT, GNN, and the ensemble strategy) across the seven CKC phases using the following metrics:

Accuracy
Precision
Recall
F1-score

The results were as follows:

Ensemble Strategy 🏆
GNN
LGBM
BERT
Transformer

Overall, the ensemble strategy performed best. It’s important to note that its improvement over the GNN is small (0.03% - 0.20% gains) but significant given the already high baseline performance, where all F1-scores exceeded 97%.

For example, in the Delivery phase, the GNN achieves an F1-score of 99.28%, with the ensemble improving this to 99.31%, corresponding to a 0.03% increase. While in the Exploitation phase, the F1-score rises from 97.67% to 97.87% (a 0.2% gain). In the Installation phase, from 98.70% to 98.83% (a 0.13% gain). These improvements occur consistently across all phases, confirming that the ensemble provides incremental gains even when the GNN already performs near optimally.

GNN remains the strongest individual model, consistently outperforming LGBM, Transformer, and BERT across all metrics and phases.

LGBM generally ranks second, particularly in phases with clearer feature separability, while BERT performs better than Transformer due to its contextual semantic modeling, which is particularly beneficial in phases such as Exploitation and Delivery.

The Transformer model remains the weakest across all phases, with accuracies ranging from 55.56% in Delivery to 86.81% in Actions on Objectives, suggesting that self-attention architectures without domain-specific adaptation may underperform in this task.

The results indicate that while the GNN remains the most effective standalone classifier, the integration of multiple heterogeneous learners in a weighted soft voting framework produces a consistent performance uplift leading to fewer false positives and false negatives.

Conclusion

The researchers recognize that the ensemble model does require increased computational complexity and inference time due to the need for predictions from multiple models. However, the performance gains validate the effectiveness of their multi-model approach. Their future work will focus on real-world validation through live threat intelligence integration and deployment within automated SOC pipelines.

The ability to map and predict adversarial behavior is an essential building block for building truly effective defenses. In the meantime, these kinds of systems can help defenders visualize and contextualize adversarial activity into familiar frameworks - granting increased situational awareness and the ability to proactively take defensive measures.

Bypassing LLM Safeguards with CognitiveAttack

Fri, 01 Aug 2025 11:17:52 -0400

In this post we are diving into AI red-teaming techniques! The tactic we'll be exploring leverages cognitive bias to undermine LLM safeguards.

This post is a summarization of research conducted by Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, and Songlin Hu in their paper “Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs”.

You can read the full paper on Arxiv here: https://arxiv.org/abs/2507.22564

Introduction

Usual jailbreaking approaches for LLMs focus on prompt engineering or algorithmic manipulations but this paper highlights the power of multi-bias interactions in undermining LLM safeguards.

Studies have identified various individual biases such as authority bias, anchoring, foot-in-the-door persuasion, confirmation bias, and status quo bias as effective techniques for enabling the elicitation of harmful, unethical, or policy-violating outputs. Most existing work treats each bias in isolation, however, this research shows prompts that subtly combine multiple biases can bypass safeguards that would normally block single-bias prompts.

The researchers propose a novel red-teaming framework called CognitiveAttack which systematically leverages both individual and combined cognitive biases. CognitiveAttack uses a fine-tuned red-team model to generate semantically faithful but bias-infused rewrites of harmful instructions. The model is optimized via reinforcement learning to discover effective single or multi-bias combinations that maximize attack success while preserving intent. This method effectively bypasses safety protocols while maintaining high attack success rates.

Upon evaluation of this method across a range of representative LLMs, their findings reveal significantly higher vulnerability under cognitive bias attacks. Compared to SOTA black-box jailbreak methods, their approach achieves superior performance (Attack Success Rate (ASR) of 60.1% vs 31.6%).

Objective

The researchers assert a key finding, bias interactions can amplify or weaken adversarial effectiveness. This underscores the need to optimize bias combinations by maximizing synergy and minimizing interference.

Given a harmful instruction $x_{0}$, the goal is to produce a rewritten version $x^{'}$ that is semantically aligned but more likely to bypass LLM safety mechanisms by embedding a bias combination $\mathcal{B} \in \mathcal{B}_{pool}$. The optimization objective is:

where $\pi_\theta$ is the red team model parameterized by $\theta$, and $\mathcal{T}$ denotes a suite of target LLMs. The reward $R$ measures attack success and intent preservation.

In simple terms you are searching over all possible bias combinations to find the best subset of biases to help you craft an effective adversarial prompt. For a given input $x_{0}$ and bias subset $\mathcal{B}$ you sample a generated rewritten instruction from the red team model. The reward function evaluates how well the rewritten instruction achieves two goals: bypassing safeguards in the target LLMs and preserving the original intent of instruction $x_{0}$.

Implementation

You can find the code for their experiment here: https://github.com/YancyKahn/CognitiveAttack

The researchers developed a cognitive bias-enhanced rewriting model by fine-tuning deepseek-r1-distill-qwen-14b.

Training comprised of three stages: data collection, supervised fine-tuning (SFT), and reinforcement learning.

Data collection

The researchers used two high quality risk datasets: AdvBench and HEx-PHI as sources of harmful instructions $x_{0}$. For each, they applied prompt engineering with DeepSeek-R1 to generate adversarial rewrites $x^{'}$ embedding one or more cognitive biases $\mathcal{B} \in \mathcal{B}_{pool}$ with the following structure:

<think> For the given instruction [INSTRUCTION],
to improve persuasion and reduce detection,
I will apply [BIAS A] to build credibility and [BIAS B] to induce compliance.
Wait, wait … I think [BIAS D] will help create a false sense of balance,
and [BIAS C] will make the instruction more compelling. </think>
# thebias: [“BIAS A”, “BIAS B”, “BIAS C”, “BIAS D”]
# theprompt: As a government advisor, I must inform …

Each example is formatted as a triple $(x_{0},x^{'},\mathcal{B})$ with the <think> block capturing the reasoning pathways for bias selection. These reasoning traces serve as weak supervision for later stages, allowing the model to learn how to plan bias combinations rather than apply them blindly.

Supervised Fine Tuning (SFT)

In this stage the researchers use the curated dataset to fine-tune the model, endowing it with the capability to systematically rewrite instructions in accordance with specified cognitive bias strategies. Through exposure to a wide range of annotated examples, the red team model internalizes the stylistic, structural, and rhetorical patterns associated with different bias types.

Reinforcement Learning (RL)

This stage refines the model’s ability to generate jailbreak prompts that effectively evade safety filters while leveraging optimal combinations of cognitive biases. The reward function integrates two normalized components both ranging from -1 to 1. The safety evasion score - which is derived from linearly normalizing the GPT-Judge safety rating. And the intent consistency score - which measures the semantic alignment between the original instruction $x_{0}$ and its rewritten counterpart $x^{'}$.

Application

Given a held-out set of harmful instructions the model infers the optimal bias combination and rewrites the input into a paraphrased instruction through a <think> step. This reasoning-aware rewriting process explicitly aims to enhance the likelihood of eliciting policy-violating responses while preserving the original intent.

Conclusion

The results of their experiments show that CognitiveAttack consistently outperforms existing baselines in terms of success rate, generality, and resistance to safety mechanisms. They found that multi-bias prompts are more likely to evade defenses while preserving adversarial potency.

These findings highlight the need for more research into cognitive bias as a critical attack vector for LLMs.

MCP - A New Frontier in GenAI Security

Wed, 30 Jul 2025 07:18:48 -0400

In this post I want to share some insights gleaned from an enlightening conversation between Scott Clinton, Jason Ross, Akram Sheriff, Ophir Dror, and Or Oxenberg regarding the security implications of agentic systems that leverage Model Context Protocol (MCP).

This post is a summary of their conversation, you can find the entire webinar here which was hosted by the OWASP GenAI Security Project.

What is Agentic? What role does MCP play?

At a high level, agentic systems are software systems that use AI to pursue some goal by completing tasks. Unlike traditional LLM systems they have the ability to reason, plan, and have some memory component. Think of the traditional LLM as the brain and the agentic architecture as the body - allowing systems to think, act, memorize, and enable more sophisticated operations.

MCP is a standardized way for agents to communicate and connect with external data sources - you can think of it as a unified abstraction layer to connect agentic flows to real world data. In the same way that an API allows developers to interact with systems, MCP allows agents to interact with systems in the new era of agentic workflows.

MCP can also be used in a multi-agent system, allowing agents to interact with each other and share contextual data/tools/resources that enable real-time agentic workflows. In this way, agents can be used as tools in the world of MCP - though we haven’t seen a lot of this yet in production. We’re still in the early stage where people are figuring out how to connect it to data and resources. But this is far from its final evolution - agents as tools will enable very complex and powerful ecosystems.

The challenges of non-determinism

The most glaring problem from a security perspective is that these systems are inherently non-deterministic. We’re used to thinking about application security in a static world where we can review the code and approve a golden image that we know is going to be the same in production. MCP is the exact opposite. It’s highly dynamic. Agentic decision-making inherently means you don’t know what tools are going to be called at runtime.

In the security world, a non-deterministic security control is a BIG no-no. However, that’s the reality of agentic AI, and MCP in particular, in its current form. We’re essentially saying “hey LLM you are now our security control figure out what makes sense” while it has access to all our critical infrastructure.

Can we just apply the same approaches to MCP as we do APIs?

Traditional API security is applicable for a deterministic API workflow. In this system we have well defined endpoints, authorization specifications, and dataflow that can be reliably assumed by a developer. With MCP all of this goes away. The developer would not know ahead of time which resources are being accessed or which specific tool calls will be executed.

There are some agentic tools which are purely API based, and for these specific tools you can leverage existing methods like fingerprinting or heatmap based analysis. But this only solves 30% of the problem. The other 70% lives in the agentic tool functionality that is achieved via Python decorator functions within the MCP implementation. Existing API security solutions are not applicable here.

The LLM uses the tool descriptions, tool schemas, and contextual nature of the incoming language prompt to decide which tool to use. Since the developer would not know ahead of time which agentic tool will be invoked at any given time, you need custom MCP based security solutions which can address real-time agentic tool calls which the LLMs are invoking themselves.

Industry must recognize that there is real up front cost associated with developing this 70% solution.

Agentic systems introduce brand new attack surfaces

With all the hype around agentic systems, industry is rushing to implement solutions immediately and everywhere without truly understanding what the risks are. Everything about this domain is so new, especially MCP. At a high level, MCP is a dumpster fire from a security point of view. You’re taking the worst of AI and software development and merging them together in a shiny new wrapper that everyone wants to use. In reality, leveraging agentic systems and MCP means the erasure of the production boundary. Your boundary becomes whatever someone decides to put in their MCP server implementation. Whether that’s a remote server, some tool they’re running locally - whatever it may be. Securing this kind of environment is almost impossible.

On top of this, MCP at its core is just Python code so you’re not able to leverage most existing tooling like EDR. You’re not able to point to a malicious binary as the source of your problems. Moreso it’s just language, and what does input sanitization look like when the exploit is enveloped in natural language? It’s a really difficult problem to solve.

New attacks require new solutions and broader awareness

If you take the first generation of LLM based applications back in 2023, most of the security concerns could be solved by using an AI firewall. If there is a prompt manipulation or prompt poisoning related attack then you can have some kind of workflow which can detect and apply remediation either at the input layer or the output layer in order to solve the problem.

With agentic design patterns you have tools, short-term memory, and then the LLM call which is just one component within the entirety of the agentic runtime. The existing AI firewall related products/solutions are not applicable to solving these agentic related security problems. There might be some overlap for basic use cases, but when it comes to real-time autonomous agentic workflows (which is where its true power lies), the current products cannot address these gaps.

MCP touches many different layers from application to presentation to network to transport. Building out a comprehensive full-stack approach might be a viable method of security and there are people working on preventing agentic-related security attacks at runtime from this perspective.

While building new solutions and implementing better security measures within MCP are good, they’re nothing if nobody knows how and when to use them. Awareness in this domain is key to effective security. For example, say you’re super security oriented and you’ve done an in depth review of the MCP code. You’ve determined that the code is doing exactly what you want it to do and you approve it for use. Three days later that server gets compromised. The attacker changes the tool descriptions and does some prompt injection on it. Most people aren’t going back and checking for changes in their tool descriptions. That’s a new problem for the security community. One that can only be addressed if you have a greater awareness of the MCP protocol and your specific implementation.

Best practices are going to look different depending on if your concern is users consuming MCP servers in their tooling (something like cursor) or if you’re a developer who’s implementing a system that allows your users to leverage MCP servers. Both of these scenarios have concerns with the security tooling changing underneath the platform, which is a new novel attack scenario that MCP has introduced. Right now there is no existing tooling that, once added to the server, notifies you if the tooling changes or if the tool descriptions change. The protocol itself has a mechanism that allows the server to send a signal that indicates the tooling has changed but it’s on the server to implement the notification and it’s on the client to do something with that notification if it’s received.

There are many facets to security in the new era of agentic systems. For instance, your agentic tooling needs to have precise and clear documentation. Within agentic tool calls there’s something called a docstring - the LLM looks into these docstrings and makes a decision as to which particular agentic tool caller to use at runtime. Developers must be aware of the heightened degree of complexity this introduces. You not only have to evaluate the docstrings of the MCP server you’re planning to integrate, but also the docstrings of every other MCP server you’ve already installed and think about which tools the agent will call if there is an overlap in the descriptions.

It’s important to be aware that although the protocol supports security measures, they aren’t necessarily built into existing tooling. If you’re in the process of developing these tools its incumbent upon you to make sure you’re implementing proper security measures. Something like hashing the tool descriptions to check for changes, or if the server is sending a notification that the tooling has changed - you’re handling it correctly and prompting the user to confirm continuity of the workflow given the tool change.

Education about these techniques and capabilities for agentic systems is essential if we want to create more trust in the protocol and community solutions. There is plenty of room for more thought leadership and people should understand that the problems we are facing in this domain are vastly different from what we’ve been accustomed to so far. Traditional methods like API based security and supply chain security are not enough anymore.

Conclusion

It’s too early for generalized best practices specifically for MCP, but for agentic systems as a whole the basics remain the same. At minimum you should have exceptional visibility into the system. Being able to monitor and see how people are using the system is very important. It can give you good intuition about what kinds of risks you are facing. Traditional security concepts such as IAM, zero trust, detection and response, and sandboxing are still just as important.

It is important to recognize that the space is still very new and the security tooling has not caught up to the pace of innovation. We don’t have clear answers on how to properly secure these systems yet. If you’re going to implement agentic systems, it’s imperative you go in with your eyes open and be critically aware that we don’t understand all the risks right now.

However there are still things we can do to protect ourselves and our businesses. Things like sandboxing, using validated and trusted community servers, implementing open-source scanning tools that can protect you from things like invisible Unicode attacks, basic security principles, etc. are still a solid foundation for building comprehensive security around this emerging domain.

The OWASP GenAI Security Project recently released their updated guide for securing agentic applications, definitely recommend checking it out here

Thanks for reading, see ya in the next one :)

sRAG - Exploring Secure RAG

Mon, 07 Jul 2025 17:44:36 -0400

Today we are exploring secure Retrieval Augmented Generation systems to create a context aware chatbot.

With all the hype around GenAI, I thought it would be fun to explore a little bit of what RAG has to offer. At a high level, RAG systems allow users to leverage the full power of large language models while giving the model awareness of personal context. LLMs don’t know everything out of the box, they only know about information they were trained on. If you wanted your chatbot to know about internal data such as wikis, docs, procedures, etc. implementing a RAG system provides this context by appending chunks of relevant information from a database into the query for the LLM. As you can imagine, RAG opens up a new world of personal and enterprise solutions enabling custom information retrieval and analysis while leveraging the increasing power of open source LLMs.

Many of the RAG implementations you will see out there leverage OpenAI’s API for LLM computation, but that doesn’t lend itself well towards handling sensitive data in an enterprise environment. In order to leverage this technology whilst protecting the confidentiality of your data, special considerations must be made.

RAG Overview

Let’s dive a little deeper into how a simple version of RAG actually works. Assume you have a bunch of data that you want the LLM to know about so that it can generate informed responses about given a question. Internal wikis, log data, expense reports, tool documentation - any digitized data can be used as context for the LLM. That data is chunked up into smaller segments and stored in a vector database. Vector databases are special types of databases that not only hold strings of data but also a corresponding vector that represents the data entry. These databases are also optimized for vector based search and retrieval.

To quickly explain vectors think back to plotting points in grade school, you have an x and y coordinate representing the position of a datapoint on a two dimensional graph. You can refer to this point as a two-dimensional vector [x, y]. Finding the distance between two points on the graph is simple: sqrt((x2 - x1)^2 + (y2-y1)^2). In a RAG system a process called vector embedding is used to create vector representations of data, typically generated by techniques like Word2Vec, BERT, or Sentence Transformers. These vectors are much larger than just two dimensions, for reference: OpenAI’s embedding model generates 1,536 dimensions. Vectors capture the semantic meaning of data allowing for comparison and similarity measurement between chunks of text. Finding the distance between these multi-dimensional vectors is key to determining how close or similar two pieces of text are in the embedding space.

Once the chunks and their corresponding vectors have been loaded into the vector database they can be used to retrieve context for a given query. When the user asks a question, the question is vectorized using the same embedding model that was used to vectorize the database. The database then returns the top k chunks of text that are closest in distance to the query vector in the embedding space. These chunks represent the most contextually relevant data in the database to the user’s query.

The relevant chunks are then appended to the user’s query and sent to the LLM to generate a response, resulting in a contextually relevant answer.

The Objective

To create a chatbot using a simple RAG system that is context aware of internal data without compromising confidentiality.

The code can be found here: https://github.com/Aidan-John/sRAG

Architecture

Above is an overview of what our simple sRAG implementation looks like. There are a few core components that merge together to create our system:

Indexer
Datastore
Retriever
Response Generator
Evaluator

The codebase is designed to be modular, leveraging interfaces with abstract classes such that the implementations can be swapped out regularly. New tools and methodologies are constantly being developed by the community so it’s important that maintainers of the system are able to upgrade core components and test their effectiveness regularly without refactoring large amounts of code.

Indexer

The indexer parses and chunks the data we want to give our LLM for context. These algorithms take the raw data and transform it into something that the LLM will be able to understand, then chunk that data into smaller pieces.

Real world data is super messy. There can be images, diagrams, charts, spreadsheets. Using a standard parsing algorithm for everything can leave you with incomplete or messy data that the LLM can’t easily process. Optimizing your parsing to extract relevant data is essential to improving accuracy in the LLM’s response (more on this later in the upgrading sRAG section).

LLMs have a limited input context window, meaning there is a maximum character count for the input we can send to the LLM. The model we are using is a version of Gemma3, the size of its context window is about 128k tokens (one token is about 4 characters in English, 100 tokens is about 80-100 English words). When choosing a chunk size it’s important to consider that we are sending the top k relevant chunks with our query, so (chunk_size * k) + len(query) + len(system_prompt) <= 128000 * 4 (more on system prompts later in the response generator section). The window size is fairly large so you shouldn’t have to worry about hitting the limit unless you’re appending significant amounts of data. For reference, a novel is ~90k words (~900 tokens) and War and Peace has ~587k words (~5-6k tokens).

Optimally, the chunking process should not exclude relevant information from each chunk. This can happen for various reasons such as cutting off mid paragraph or not including sufficient data from a table. Chunk size should be set on a per document basis by running tests on various chunk sizes and evaluating the relevancy of the outputs.

In this project we are using some standard parsing and chunking libraries from docling, which won’t give us an optimal solution but serves as a good starting point. Optimizing the parsing and chunking process for your specific data is a challenge, but the reward is a significant boost in accuracy for the LLM’s response.

There are some really nice parsing and chunking algorithms out there such as Llama Parse from Llama Index that handle a variety of different data formats. The problem with this is that you have to send off your data to their service to do the parsing and chunking, which is a no go for a secure system. Using open source tooling such as docling or unstructured is ideal for us because all of the computation is being handled locally. However, building an in-house solution for the types of data you will be using in the system is the best way to ensure that no sensitive information leaves your network.

Datastore

The datastore component handles our interactions with our chosen vector database. In this project we are using Chroma DB, an open-source vector database running locally in docker.

Chroma stores data in collections - which is like a table in normal database. Given our chunks of data from the indexer, this component will store the chunks in our chosen collection. Normally we would run the data through an embedding model to generate the vectors for a given chunk and then store both the data and its vector in the collection. Fortunately for us chroma bakes in the vector computation to the collection.add() function. Chroma allows you to specify a vector embedding function from a multitude of supported functions or you can add your own custom function if you wish. We are going to use a popular sentence transformer called all-MiniLM-L6-v2 which generates vectors in 384 dimensions.

Datastore also handles requests from the retriever to search for relevant context. The question is vectorized with all-MiniLM-L6-v2 and returns the k most relevant chunks.

Retriever

The retriever component gets passed the query from the user and contacts the datastore to provide relevant context. In our implementation we directly contact the datastore for context and pass it to the response generator. However, implementing a retriever component allows us to use more advanced tactics like reranking in the future - more on that later in the upgrading sRAG section

Response Generator

The response generator is what actually interacts with our chosen LLM and handles response generation for both user queries and the evaluation step. Here is where we combine our context and query into one string before we send it to the LLM. Additionally we combine a system prompt to give the LLM instructions on how it should respond. Our system prompt is as follows:

Use the provided context to provide a concise answer to the user's question. 
If you cannot find the answer in the context, say so. Do not make up information.

Then we construct our prompt in the following format:

The reason we separate the instructions, context, and query in this way is specific to Gemma3. Other models support a version of prompting in which you can codify the differentiation between system prompts, context, and user queries. Gemma3 does not, so this is a way to give that differentiation to the model in plain English.

The LLM response is then returned to the user.

Evaluator

The evaluator is the component we use to test our system. Testing a system like this requires data that the model wouldn’t have been trained on. It would also be helpful if the evaluation data were similar in structure to the kinds of data we want the system to ingest.

In our project there are a series of evaluation PDFs about a town called Swan Lagoon. Swan Lagoon does not exist, none of the information about the town is real, but the fake details about the town are very specific. This allows us to test our system to see if the responses being generated are a result of the model’s initial training or if the responses are actually based on the context from our database. We create a series of questions and answers about our fake data and measure the accuracy of the model’s responses.

The system prompt is as follows:

SYSTEM_PROMPT = """

You are a system that evaluates the correctness of a response to a question.
The question will be provided in <question>...</question> tags.
The response will be provided in <response>...</response> tags.
The expected answer will be provided in <expected_answer>...</expected_answer> tags.

The response doesn't have to exactly match all the words/context the expected answer. It just needs to be right about the answer to the actual question itself.

Evaluate whether the response is correct or not, and return your reasoning in <reasoning>...</reasoning> tags.

Then return the result in <result>...</result> tags — either as 'true' or 'false'.

"""

We then generate a prompt that includes the system prompt, one of our test questions, the expected answer to this test question, and the response we get from the LLM to our test question.

We are leveraging the power of the LLM to check itself for errors, pretty neat eh. From the response we extract its evaluation and tally the results. This gives us a metric for how accurate our system is. When we upgrade or change a part of this system, we can compare this metric to the metric from the previous version to see if the update was beneficial.

Project Demo

In this project we are using a RAG system to create a chatbot about a fictional town named Swan Lagoon. We have four different PDF files that contain specific data about our fake town: a booklet containing a brief history about the town, a business directory listing information about local businesses in table format, a service guide with information about town resources in differing table formats, and a tourist guide listing some popular destinations for visitors. We also have a JSON file with 25 questions and answers about the town based on information found in the PDFs that will be used for testing the accuracy of our system.

Let’s get started by executing docker compose up --build -d in the root of the project. This will stand up our Chroma vector database, ollama for locally running Gemma3, our backend API with our RAG pipeline, and our frontend UI. This step requires us to download the Gemma3 model so it will take a few minutes depending on your internet connection.

After everything is stood up we can access our frontend at localhost:3000

Populating ChromaDB

First we need to populate our vector database. For ease of use I’ve added a button in the UI to load the documents.

In the logs for our API we can see they have been parsed, chunked, and loaded into the vector database.

Asking a question

Now our system is ready for some questions about our data! Let’s ask something like “What is the contact number for Swan Lagoon Police Station?”

This is located in our service guide under local emergency contacts.

Let’s see if our system was able to parse out this information correctly from the table and return an accurate response:

Works like a charm!

Resetting DB and asking a question

Let’s try emptying out our vector database and asking the same question. Will the system hallucinate an answer?

No hallucinations here!

Running Eval

Finally let’s run our evaluation procedure to get a baseline on how well our system is performing in its current state by hitting the evaluate button. This will take a while depending on your processing power.

We can see it working within the logs of the API.

We submitted a total of 25 questions and their expected responses to the evaluator and are given a score of 20/25. Not bad! If we scroll up we can see each question that was asked, the response from the LLM, the expected answer, and the reasoning for the model’s evaluation.

Cleanup

Thanks to docker compose shutting down the demo is easy with docker compose down -v.

Upgrading sRAG

The RAG system we have built here is a great starting point. In this last section I want to cover some advanced techniques we can apply in the future to get better and more accurate results from the system.

Parsing/Chunking Strategies

Earlier we touched on the idea of using better parsing to improve the effectiveness of our system. In order to get the most out of our parsing and chunking we can add a classification system for the data that’s being ingested. Based on your use case you can build out specific parsing and chunking strategies for the different formats of data. The classification system would look at the current document that is about to be indexed and choose the parser and chunk size that would be optimal for that specific kind of document.

Reranking

When the user asks a question like “What were the sales numbers in 2024?”, the vector search will return the top k most relevant chunks to be used in the query. These chunks are usually not returned in order from most relevant to least relevant. There is a phenomenon in LLMs known as the lost-in-the-middle problem which describes a common behavior of LLMs tending to focus more heavily on the beginning and end of a given prompt. If the chunks being returned are not in order from most to least relevant, it is possible that your most relevant data can get lost in the middle. We can solve this through a process called reranking, which uses another model to determine the relevancy between chunks and reorder them from most relevant to least.

Hybrid Search

Strictly using vector search is not the best search method for all use cases. For example, if you had an online store where users are asking about a specific product, you want to ensure that the product name is an exact match to the product in your database. For this you would use a combination of vector search and keyword search. After conducting both searches on their respective databases, you can feed these chunks into a reranker, find the top k most relevant chunks, and use them as context for your input to the LLM.

Upgrading the LLM

In this project we are using a version of Gemma3 trained on 4B parameters and has been quantized to reduce memory footprint. Running the LLM locally is a great way to ensure that none of your data is leaving your control. There are other ways of implementing secure LLMs such as using Azure OpenAI which is approved as a service within the FedRAMP High Authorization in Azure Government, more info here: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/azure-government

I am constrained by the computational limits of my PC which is why I am using an older and smaller model. If you have access to more processing power then using more advanced models will significantly improve your responses and allow for more options when it comes to development. In addition to their wider knowledge base, newer models have more features like specified inputs for system prompts and context. They are also better at solving common LLM problems like the lost-in-the-middle problem and needle in a haystack problem (more on that in the next section). Additionally with each open source model that is released the input context window seems to be growing at an insane rate. Our version of Gemma3 has an input context window of 128k tokens but newer versions of models like Llama4 boast a 10M token input context window.

Cache Augmented Generation - CAG

The needle in a haystack problem is exactly what it sounds like, how good is an LLM at finding relevant information within a sea of input context. The newer models have gotten very good at solving this problem. This combined with increasing input context windows on the order of millions of tokens allows for another kind of system to become a realistic alternative for implementation. Cache Augmented Generation is exactly like RAG but instead of appending relevant chunks to your query you can append the entire documents into the input window. Say for example a user asks the question “Summarize the financial reports from 2020 and 2021”. A CAG system would search for the 2020 and 2021 financial reports in the database and feed the entirety of both documents into the input for the LLM. The advantage here is that your LLM gets full context of your data, you don’t have to worry about whether the chunks you’re providing contain all the information the LLM needs to answer the user’s question.

Agentic Behavior

We can utilize the power of LLMs to do reasoning and optimization in order to improve the performance of our system.

Step-back Prompting

Instead of doing the vector search with the query coming directly from the user, we can ask the LLM to modify the query so it is more retrieval friendly. This is a technique called step-back prompting and it was created by Google DeepMind.

Performing the vector search against the step-back question yields a more generalized and useful result which we can use in the input for our final answer.

This kind of technique can also be used for query planning. Imagine our user asks “How are sales trending from 2020 to 2022”. A step-back process like this can break down the question into three sub-questions, searching for sales data from 2020, 2021, and 2022 in our database. Then it will combine all this data and feed it into the LLM to provide an answer.

Metadata filters

During the indexing process we can add metadata tags to the documents and the databases/tables they are being held in that relate to the data inside. This can be done programmatically or by asking the LLM to generate these tags from a list of potential tags. From there we can add metadata to the user’s query and use that to filter our search area.

Corrective RAG (CRAG)

Corrective RAG is a method of refinement for the responses given by a RAG system.

For a given query, we perform the context retrieval like normal. Then we get our LLM to do some evaluation on the documents to decide if the retrieved documents are correct and relevant to the question we are asking. If it is, we go through a process to do knowledge refinement and clean up the knowledge. If it’s ambiguous or incorrect, then the agent will use the internet to find sources. It will repeat this until it feels like it has enough context to provide a correct answer and generates the results from there. You can read more about this technique here: https://arxiv.org/pdf/2401.15884

Conclusion

Thanks for coming to my TED talk, hope you enjoyed.

Disclaimer: I am no expert, this is my first proper dive into GenAI systems. Had a lot of fun learning and definitely want to explore more in the future. Any and all feedback is welcome :)

HTB - Appointment (SQL Injection)

Mon, 16 Jun 2025 10:17:15 -0400

I think it’s only right that this blog starts where it all began for me: HackTheBox. In this mini-series I will be going through some of the starting point challenges from a fresh account. In the future I plan to publish writeups about retired machines, challenges, and sherlocks.

In accordance with HTB TOS, I will not be publishing writeups of active challenges until after they’re retired.

With that out of the way, let’s begin!

Our first box: Appointment

Overview

The Appointment box is a mockup of a common web-application. Let’s visit the given IP 10.129.197.202 and see what’s there.

Looks like we have a basic login page (basic is a strong word, a full login card with a color gradient, onHover button animations, and a muted background image? noice). The tags for the box include PHP and SQL Injection, so I think it’s safe to assume that there’s a SQL injection vulnerability in this login.

So… how does SQL Injection work?

SQL Injection

This vulnerability works by exploiting unsanitized inputs to change the intended function of a SQL query running in the background.

For example, suppose the login form above uses the following SQL query to return the user account for a given username and password combination from the database.

Where the username entered is user and the password entered is pass.

If there is improper sanitization of the inputs, an attacker can enter user for the username and ' OR '1'='1 as the password.

Which leads to the following query being executed instead:

Resulting in a successful login for whichever user is passed to the username field (even admin).

Another variation of SQL injection uses the semicolon ;. Allowing attackers to use batch statements to execute multiple commands at once.

This example specifically would allow anyone to delete all the users from your database on a whim - no bueno. They can technically add any additional SQL statement they want, so the extent of the damage is up to their imagination.

AHH SCARY VULN HOW FIX?

This kind of vulnerability can easily be secured with input sanitization on the username and password fields that only allow for characters in the English alphabet to be used, unlike quotation marks and semicolons. However, the preferred way of securing against this kind of vulnerability is by using prepared statements aka parameterized queries.

Prepared statements are a feature in database management systems that allow you to pre-compile a SQL query with placeholders for data values that are supplied later. The separation of SQL code and data offers both security and performance benefits.

The actual syntax can vary depending on the system, in PHP with MySQLi you would use the following:

The input values from the login fields have been replaced by question marks in the original SQL command. Then using bind_param we set the values of those question marks to be the inputs from the form fields.

The advantage here is that any user-supplied value in the form fields (even if it contains SQL syntax or malicious code) is strictly treated as data. The database will never execute user input as part of the original SQL command. For example, if an attack were to try to inject user'; DROP TABLE Users; -- like in the previous example, the database simply searches for that exact string instead of executing it as a command.

With this in mind, let’s get back to the box.

Enumeration

It’s always good to start with an nmap scan to find the open ports on the target machine.

We’re using the -sC and -sV flags here to perform a script scan and enable version detection respectively. It’s good to note that the -sC and -sV flags are considered noisy flags, meaning there is a good chance that any monitoring on the network would catch this.

From the scan we can see an Apache instance running on port 80, this is responsible for serving the web-application.

Exploitation

Back to the web interface, let’s try using the what we know about SQL injection to gain access to the admin account.

Here we are entering admin for the username and ' OR '1'='1 for the password.

Success! We got the flag!

Just for fun, let’s try a different injection string. This time we will enter the username admin'# and any password we want.

In PHP # denotes a comment. So, in theory, we should be able to enter admin'# for the username and circumvent the need for a password by commenting out the rest of the query.

Something like this:

Let’s try it

Success!

Aidan John

XBOW - Agentic Pentesting with ZERO False Positives

Table of Contents

Quelling False Positives and Hallucinations

Death by a thousand slops

XBOW Architecture

Taxonomy of Validators

Canary-based Validation

Heuristic-based Validation

The DockerHub Experiment

Open Redirect Vulnerability in Jenkins - CVE-2025-27625

SSRF in Apache Druid - CVE-2025-27888

File Read in Group Docs

AuthZ Bypass in Redmine

Conclusion

Proactive Cyber Defense with KillChainGraph

Table of Contents

Introduction

What is MITRE ATT&CK?

What is the Cyber Kill Chain?

Why KillChainGraph?

Methodology

Dataset

LightGBM (LGBM)

Transformer

BERT

Graph Neural Network (GNN)

Ensemble Strategy

Semantic Mapping and Graph Construction

Results

Conclusion

Bypassing LLM Safeguards with CognitiveAttack

Introduction

Objective

Implementation

Data collection

Supervised Fine Tuning (SFT)

Reinforcement Learning (RL)

Application

Conclusion

MCP - A New Frontier in GenAI Security

What is Agentic? What role does MCP play?

The challenges of non-determinism

Can we just apply the same approaches to MCP as we do APIs?

Agentic systems introduce brand new attack surfaces

New attacks require new solutions and broader awareness

Conclusion

sRAG - Exploring Secure RAG

RAG Overview

The Objective

Architecture

Indexer

Datastore

Retriever

Response Generator

Evaluator

Project Demo

Populating ChromaDB

Asking a question

Resetting DB and asking a question

Running Eval

Cleanup

Upgrading sRAG

Parsing/Chunking Strategies

Reranking

Hybrid Search

Upgrading the LLM

Cache Augmented Generation - CAG

Agentic Behavior

Step-back Prompting

Metadata filters

Corrective RAG (CRAG)

Conclusion

HTB - Appointment (SQL Injection)

Overview

SQL Injection

AHH SCARY VULN HOW FIX?

Enumeration

Exploitation