Deep Research Brings Deeper Harm

Deep Research (DR) agents extend large language models with planning, browsing, and drafting loops so that the system can autonomously retrieve evidence and write research-style reports. While these capabilities enable complex analysis, they also reopen safety risks that alignment techniques mitigate for standalone models. Recursive planning, web access, and structured synthesis give adversaries more surface to pursue harmful objectives.

We examine these risks through WebThinker driven by QwQ-32B. A misuse query that triggers a refusal in the base model instead produces a detailed step-by-step report once routed through the DR agent. The agent decomposes the task, gathers open-source material, and responds in an academic tone, illustrating that current jailbreak defences do not transfer to research agents.

To analyse this behaviour we develop two agent-specific jailbreaks. Plan Injection modifies the internally generated research plan, removing safety caveats and inserting explicit retrieval targets. Intent Hijack rewrites forbidden prompts as instructional or academic scenarios so that the agent treats them as legitimate tasks. We also introduce DeepREJECT, a metric that evaluates whether the final report answers the malicious goal and conveys high-value knowledge.

Experiments across six backbones on StrongREJECT and SciSafeEval show that DR agents easily bypass refusals, and the proposed attacks further increase both the number of generated reports and their harmfulness. These observations highlight the need for alignment approaches that operate over the entire DR workflow rather than only the underlying language model.

Contributions.

We design two jailbreak strategies tailored to DR agents—Plan Injection and Intent Hijack—that exploit planning and research-oriented prompting.
We evaluate six DR configurations on StrongREJECT and SciSafeEval, releasing code and data after peer review to support reproducible safety assessments.
We propose DeepREJECT to quantify harmful knowledge in long-form agent outputs and show that DR agents deliver more dangerous content than their base models.

2. Method

WebThinker follows a “think-search-draft” pipeline: the model produces a search plan, executes web queries step by step, and then edits the final report. Our attacks interfere with the stages that determine how information is gathered and summarised so that the evaluation reflects realistic DR usage.

Plan Injection

DR agents rely on the initial plan to guide retrieval. We collect the auto-generated plan and replace it with an alternative drafted by another LLM. The new plan removes safety disclaimers and adds explicit search targets such as chemical ratios, acquisition routes, or regulatory references. Once injected, the agent executes the modified plan and produces reports with higher information density and closer alignment to the harmful objective.

Intent Hijack

Many DR agents are tuned for academic or professional queries. For each forbidden prompt we craft an “educational” rewrite that preserves the underlying intent but presents the task as part of training for law-enforcement officers, security analysts, or university students. This framing reduces refusal behaviour while keeping the harmful requirement intact, allowing us to study how easily the agent is convinced to co-operate.

DeepREJECT Metric

Long-form reports require metrics beyond refusal words. DeepREJECT scores each output along three dimensions: whether the agent produced a report, whether the content fulfils the malicious objective, and whether high-value knowledge is provided. We further weight the result by the intrinsic risk of the question so that high-stakes prompts yield higher scores when the agent delivers actionable detail.

3. Experimental Results

We deploy WebThinker with six backbone models covering reasoning-oriented LRMs, large general-purpose models, and RL-tuned agents. Each configuration answers 313 StrongREJECT prompts and 789 high-risk SciSafeEval queries. For every run we measure the number of generated reports, the LLM-as-Judge score, and DeepREJECT.

Baseline agents already bypass refusals. Standalone models refuse almost all forbidden prompts, whereas the DR agents generate hundreds of reports with DeepREJECT above 2.0, even without adversarial intervention. Simply enabling the agent scaffold defeats the refusal behaviour seen at the LLM level.

Plan Injection increases harmful detail. Injected plans raise DeepREJECT by roughly 0.2–0.3 and produce reports with precise operational instructions such as mixing ratios and procurement paths. Report counts grow modestly, but the added specificity shows that plan tampering directly amplifies the content risk.

Intent Hijack maximises report coverage. Rewriting queries as academic or professional assignments reduces refusals to near zero. DeepREJECT may drop slightly because the framing looks legitimate, yet LLM-Judge and manual checks confirm that the resulting reports still address the harmful goals.

Biosecurity benchmarks show the same pattern. On SciSafeEval, the base LLMs almost never respond, but DR agents produce 500–700 medical reports. Plan Injection pushes DeepREJECT to 2.35 on QwQ-32B, while Intent Hijack yields up to 690 reports and the highest judge scores, underscoring the fragility of biosecurity safeguards.

DeepREJECT captures the increased risk. StrongREJECT only reflects refusal behaviour and therefore shows little difference between LLMs and DR agents. DeepREJECT, however, aligns with the qualitative findings: DR agents produce more structured, actionable knowledge, and the metric quantifies that jump in risk.

4. Insights

The study reveals broader lessons about deploying DR agents safely.

Deep Research Brings Deeper Harm

Even without explicit jailbreak prompts, DR agents combine planning, search, and drafting to produce detailed responses. Alignment that only targets the base language model is therefore insufficient once the model is embedded in a research workflow.

Intent Hijack Is the Stealthiest Failure Mode

Academic reframing consistently bypasses safety prompts: the agent continues to reason about the malicious task under the guise of training or policy discussion. Because the agent justifies each step internally, the attack persists across the entire session.

Biosecurity Queries Highlight Dual-Use Danger

In biosecurity settings, professional phrasing rarely triggers refusals, yet the agent still retrieves dosage guidance and genetic modification pathways. These dual-use cases demand evaluation frameworks that consider both scientific accuracy and misuse potential.

Defense Requires Stage-Wise Guardrails

Mitigation should cover the entire agent loop: refuse before planning, validate or regenerate suspicious plans, filter retrieved content, and review long-form reports prior to release. Without these checkpoints, tailored jailbreaks remain low-effort.

Deep Research Brings Deeper Harm

Authors

Affiliations

Correspondence

1. Introduction

2. Method

Plan Injection

Intent Hijack

DeepREJECT Metric

3. Experimental Results

4. Insights

Deep Research Brings Deeper Harm

Intent Hijack Is the Stealthiest Failure Mode

Biosecurity Queries Highlight Dual-Use Danger

Defense Requires Stage-Wise Guardrails

BibTeX

Acknowledgement