* feat(memory): add memory.token_counting config to avoid tiktoken network dependency (#3429)
Add a `memory.token_counting` option (`tiktoken` | `char`) so deployments in
network-restricted environments can opt out of tiktoken entirely. In `char`
mode the memory-injection budget uses a network-free character-based estimate
and never triggers the BPE download from openaipublic.blob.core.windows.net,
which could otherwise block for tens of minutes (see #3402).
Also harden the default `tiktoken` path:
- cache an in-flight LOADING sentinel so concurrent callers fall back
immediately instead of spawning more blocking get_encoding threads when the
first load is still running (e.g. under the 5s startup warm-up timeout);
- cache failures with a timestamp and retry after a cooldown so a transient
network outage self-heals back to accurate counting without a restart;
- skip startup warm-up entirely in char mode.
The new config is surfaced via the memory config API and config.example.yaml
(config_version bumped). Default remains `tiktoken`, so existing deployments
are unaffected.
* fix(memory): use CJK-aware char token estimate and address review feedback
- Replace the flat len(text)//4 fallback with a CJK-aware estimate so
Chinese/Japanese/Korean memory content does not over-fill the injection budget
- Document the internal tiktoken retry cooldown and char-mode escape hatch
- Sync CLAUDE.md / config.example.yaml / MEMORY_IMPROVEMENTS.md wording
- Fix MemoryConfigResponse mocks/assertions and add CJK estimate tests
`client.py` imported the private `_build_middlewares` from `agent.py` across a
module boundary and called it as public API. Because the `_` name signals
"module-private, no external callers", any future rename or signature change
silently breaks the embedded `DeerFlowClient` path — and the test suite even
monkeypatched `deerflow.client._build_middlewares`, baking the leak in.
`DeerFlowClient` is a lead-agent variant that genuinely needs the lead agent's
full middleware composition, so make the dependency honest: promote the helper
to a documented public entry point `build_middlewares` and update every in-repo
caller. Found during #3341 review; #3341 already removed one such leak
(`_assemble_deferred` -> public `assemble_deferred_tools`) and left this one out
of scope on purpose.
- agent.py: rename def + both internal call sites; expand the docstring into a
public-entry-point contract and document the previously-undocumented
model_name / app_config / deferred_setup params
- client.py: import + call site now use the public name (removes the last
cross-module private import)
- scripts/tool-error-degradation-detection.sh: update its import + call site
- tests (5 files): update monkeypatch/patch targets and direct calls
- docs (backend/CLAUDE.md, plan_mode_usage.md, middlewares.mdx): sync the live
references that describe the symbol as current API
Pure mechanical rename, no behavior change. Historical design docs (rfc,
superpowers spec) intentionally keep the old name as point-in-time records.
Closes#3431
* feat(subagents): extend deferred MCP tool loading to subagents (#3341)
Subagents now reuse the lead agent's deferred-tool path: when
tool_search.enabled, MCP tool schemas are withheld from the model and
surfaced by name in <available-deferred-tools>, fetched on demand via the
generated tool_search helper. DeferredToolFilterMiddleware deterministically
rewrites request.tools to hide the deferred schemas (the prompt section is
discovery only, not enforcement).
Consolidates the assembly into deerflow.tools.builtins.tool_search, now the
single home for both assemble_deferred_tools (centralized fail-closed guard,
replacing the lead-only private _assemble_deferred) and the relocated
get_deferred_tools_prompt_section. Shared by every build path: lead agent,
embedded client, and subagent executor.
tool_search is appended after the subagent's name-level tool policy and is
treated as infrastructure: its catalog is built from the already
policy-filtered list, so it can never surface a tool the policy denied.
Follow-up to #3370. Fixes#3341.
* test(subagents): assert the real middleware builder emits a working deferred filter (#3341)
The existing recipe test hand-constructs DeferredToolFilterMiddleware, so it
cannot catch a regression in how build_subagent_runtime_middlewares (the call
executor._create_agent actually makes) wires the deferred setup into the
filter. Add a test that sources the filter from the real builder given a real
setup and runs it through a graph: a wrong catalog hash would silently stop
promotion, a dropped filter would stop hiding — both now caught.
Running the full real middleware stack is intentionally avoided (the other
runtime middlewares need sandbox/thread infra to execute, which would make the
test flaky); their attachment + ordering before Safety stays locked in
test_tool_error_handling_middleware.py.
* test(subagents): keep executor tests config-free in CI
* chore: trigger ci
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* docs(spec): MiniMax integration for generation skills + new music skill
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(plan): MiniMax generation providers implementation plan
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(skills): add importlib loader + FakeResp for skill tests
* test(skills): register loaded module in sys.modules; raise requests.HTTPError in FakeResp
* feat(image-generation): add MiniMax provider with env auto-detect
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(image-generation): guard unknown provider, derive ref MIME, strengthen tests
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(video-generation): add MiniMax provider with async poll/download
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(video-generation): surface base_resp errors while polling; add timeout test
* feat(podcast-generation): add MiniMax t2a_v2 provider with env auto-detect
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(podcast-generation): restore TTS credential guard; add volcengine + voice tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(music-generation): new MiniMax music skill via skill-creator
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(music-generation): treat empty lyrics as absent; test no-audio-data path
* refactor(skills): add request timeouts to MiniMax network calls
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Potential fix for pull request finding 'Explicit returns mixed with implicit (fall through) returns'
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
* fix(models): strip inconsistent user-message names for MiniMax chat
DeerFlow middlewares tag user messages with provenance names (user-input, summary, loop_warning); langchain serializes them into the OpenAI-compatible payload and MiniMax rejects mismatched user-message names with "user name must be consistent (2013)". PatchedChatMiniMax now drops the per-message name from user-role messages. Point the config.example MiniMax models at PatchedChatMiniMax so they also get reasoning_content mapping.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(image-generation): MiniMax sends JSON prompt field, guard 1500-char limit
MiniMax image-01 takes one text string capped at 1500 chars, but the skill was sending the whole structured JSON. The MiniMax provider now extracts the JSON `prompt` field (relying on prompt_optimizer to expand it) and fails fast with a clear error before calling the API when that field exceeds 1500 chars. Authoring stays provider-agnostic; Gemini still receives the full JSON.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(podcast-generation): per-provider TTS concurrency and retry/backoff
Each TTS provider owns its concurrency internally — MiniMax runs single-threaded to reduce rate-limit failures, Volcengine keeps 4 workers — with automatic retry and backoff on transient HTTP and base_resp errors. No caller-facing concurrency knob.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(skills): address Copilot review comments on generation skills
- video: add raise_for_status + timeout to the Gemini download/POST/poll calls so non-2xx responses surface as clear HTTP errors instead of JSON/KeyError or hangs
- video: check the task Fail status before the generic base_resp check so the failure keeps its task_id context
- video/image: create the output file parent directory before writing (matching music-generation) so nested output paths do not raise FileNotFoundError
- music: require a non-empty prompt and fail fast with ValueError instead of sending an empty prompt to the API
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(scripts): reclaim dev ports across worktrees in make stop/dev
All deer-flow worktrees (main checkout + linked worktrees) hardcode the same dev ports (8001/3000/2026), so a service started from any worktree must be reclaimable from another. stop_all now resolves the set of worktree roots (DEERFLOW_ROOTS) and treats a process as deer-flow-owned when its open files live under any of them. It also force-kills survivors on 2026 alongside 8001/3000, fixing `make dev` aborting on the nginx port preflight when a prior nginx lingered on 2026.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(view-image): hide the injected image-context message from the UI
ViewImageMiddleware injects a HumanMessage (text + base64 images) so the vision model can see viewed images, but it was the only internal injector that set neither hide_from_ui nor a hidden name, so it leaked into the chat UI (and IM channels) as a user bubble reading "Here are the images you've viewed:". Mark it with additional_kwargs={"hide_from_ui": True}, matching todo/dynamic_context injections, which the frontend isHiddenFromUIMessage and the channel sender already honor. The model still receives the full content.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(minimax): mark M2.7 models as text-only (no vision)
MiniMax M2.7 / M2.7-highspeed do not support vision; only M3 does. The
provider config asserted vision support for M2.7 in four places.
- config.example.yaml: 4 M2.7 entries -> supports_vision: false
- backend/docs/CONFIGURATION.md: M2.7 + highspeed -> supports_vision: false
- wizard: add LLMProvider.model_vision_overrides + extra_config_for() so
selecting an M2.7 model writes supports_vision: false while M3 (default)
keeps vision; wire it through setup_wizard.py
- tests: M2.7-highspeed fixture -> supports_vision=False; add
test_minimax_vision_is_per_model
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
* fix(middleware): externalize oversized tool output into sandbox for non-mounted sandboxes
ToolOutputBudgetMiddleware persisted oversized tool results to the host
filesystem and returned a /mnt/user-data/outputs virtual path. For sandboxes
that do not use thread-data mounts (e.g. remote AIO sandbox), that virtual
path does not exist inside the sandbox, so the model's read_file tool could
not read it back and reported 'file not found'.
Branch on SandboxProvider.uses_thread_data_mounts:
- Mounted sandboxes (local Docker, AIO + LocalContainerBackend) keep the
original host-disk path; the host outputs dir is bind-mounted to the same
virtual path inside the sandbox, so behavior is unchanged.
- Non-mounted (remote) sandboxes externalize into the sandbox itself via
execute_command('mkdir -p ...') + write_file + 'test -s' validation. The
validation step is required because AIO sandbox execute_command returns
'Error: ...' as a string on failure instead of raising, so a silent mkdir
failure would otherwise leak through.
Any failure (rejected subdir, mkdir/write/validate error) falls back to the
existing inline head+tail truncation, so an unreadable path is never returned
to the model.
The sandbox resolver reads the sandbox_id that SandboxMiddleware already
writes into runtime.state['sandbox']; it never calls provider.acquire(),
keeping the tool-call hot path free of blocking I/O. Tools that do not use a
sandbox (web_search, MCP, ...) resolve to None and fall through to inline
truncation, which is the safe behavior for them.
Fixes#3416
* fix(middleware): address Copilot review feedback on sandbox externalization
- Make get_sandbox_provider() lookup best-effort in _budget_content: only
query when outputs_path or sandbox is available, and fall back to inline
truncation if provider initialization raises rather than propagating
the error. A resolved sandbox instance is sufficient on its own to take
the non-mounted externalization branch.
- Strict-match the sandbox post-write validation echo
(check.strip() == 'OK') to avoid false positives if execute_command
ever surfaces unrelated stdout/stderr containing 'OK' as a substring.
Refs: #3417
* test: fix flaky tests relying on /nonexistent/... path under container root
Two tests in this module (test_returns_none_on_invalid_path and
test_fallback_when_disk_write_fails) used paths like
'/nonexistent/impossible/path' to trigger _externalize's OSError
fallback. These paths are creatable when the test process runs as root
inside the CI container: os.makedirs(..., exist_ok=True) successfully
creates the entire chain under /, so the OSError branch is never hit
and the tests fail. Reproducible on main independently of this PR.
Switch to '/dev/null/cannot-mkdir-here'. /dev/null is a character
device on both Linux and macOS, so os.makedirs always fails with
NotADirectoryError regardless of privileges, reliably exercising the
OSError fallback.
* fix(tool-output-budget): only consult sandbox provider when a sandbox is resolved
The previous revision called get_sandbox_provider() whenever externalization
was triggered, including on the legacy host-disk path. Environments without
a configured sandbox -- in particular CI runners without a config.yaml --
would raise FileNotFoundError there, get caught, and silently fall back to
inline truncation. That defeated the host-disk externalization path that
predates this PR and was the root cause of the regressing legacy tests.
Restructure the branching so the provider is only consulted when a sandbox
has actually been resolved for the current tool call:
- sandbox resolved + provider.uses_thread_data_mounts: host-disk write
(bind-mounted into the sandbox, equivalent to a sandbox-side write).
- sandbox resolved + non-mounted provider: sandbox write (#3416).
- no sandbox + outputs_path: host-disk write
(legacy / non-sandbox tools, no provider call at all).
- otherwise: inline fallback.
No test changes; the legacy externalization tests are provider-agnostic by
construction and now pass without monkeypatching.
Refs: #3416
* test(tool-output-budget): assert legacy path does not call sandbox provider
Lock in the contract introduced by d6e2d25b: when no sandbox is resolved
for a tool call, _budget_content must externalize to the host outputs
directory without consulting get_sandbox_provider(). Regressing this would
re-break legacy / non-sandbox tools in environments without a configured
sandbox (e.g. CI without config.yaml), which is the failure mode #3416's
fix avoids.
The test injects a get_sandbox_provider that raises on call, so any
future refactor that moves the provider lookup out of the sandbox-only
branch will fail loudly.
Refs: #3416
* fix(middleware): offload memory injection off event loop to prevent tiktoken blocking (#3402)
DynamicContextMiddleware.abefore_agent() called _inject() synchronously
on the asyncio event loop. The first time memory is injected (second
request), _inject() → format_memory_for_injection() → _count_tokens()
→ tiktoken.get_encoding("cl100k_base") needs to download the BPE data
from openaipublic.blob.core.windows.net. In network-restricted
environments this download blocks until the OS TCP timeout (~26 min),
starving ALL concurrent handlers including /api/v1/auth/me.
Fix:
- abefore_agent now uses asyncio.to_thread(self._inject, state) so
file I/O and tiktoken never block the event loop.
- Extract _get_tiktoken_encoding() with a module-level cache so
tiktoken.get_encoding() is called at most once per encoding name.
- Add warm_tiktoken_cache() startup helper; gateway lifespan pre-warms
the cache via asyncio.to_thread so the first request never triggers a
cold download.
- _count_tokens falls back to len(text) // 4 on any encoding failure.
Tests:
- tests/test_tiktoken_cache_and_count_tokens.py (12 tests): cache
hit/miss, fallback paths, warm-up helper.
- tests/blocking_io/test_dynamic_context_middleware.py (2 tests):
Blockbuster gate verifies abefore_agent does not block the event
loop; async/sync parity check.
Fixes#3402
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix the lint error
* fix(memory): use future annotations to avoid NameError when tiktoken is absent
Add `from __future__ import annotations` to prompt.py so that
tiktoken.Encoding type hints are never evaluated at runtime. Without
this, environments where tiktoken is not installed could raise NameError
on the module-level cache and function return annotations.
Addresses Copilot review comment on PR #3411.
* fix(middleware): bound abefore_agent injection with timeout to prevent hung requests
Wrap the asyncio.to_thread(self._inject) offload in asyncio.wait_for()
with a 5-second cap. If the startup warm-up failed silently (e.g.
network blip during deploy), a cold tiktoken BPE download on the first
request can block until the OS TCP timeout (~26 min). The bounded
timeout ensures the request degrades gracefully (no memory/date context
for that turn) rather than hanging.
Adds test_abefore_agent_returns_none_on_timeout to the blocking-IO
regression anchors.
Addresses review feedback from xg-gh-25 on PR #3411.
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(subagent): structured subagent_status field over text parsing
Closes#3146.
## Why
The frontend used to derive subtask card state by string-matching the
leading text of the `task` tool's result. That contract surface was
fragile — `#3107` BUG-007 and the `#3131` review both surfaced cases
where new backend wording (`Task cancelled by user.`,
`Task polling timed out after N minutes`, `ToolErrorHandlingMiddleware`
exception wrappers) silently broke the card lifecycle. The frontend
fallback kept growing more prefixes; any future rewording would break
it again.
## Design
1. **Backend → frontend contract**: `ToolMessage.additional_kwargs`
carries `subagent_status` (one of `completed | failed | cancelled |
timed_out | polling_timed_out`) and an optional `subagent_error`
blob. The frontend prefers it over parsing `content`.
2. **Centralised stamping, not 8 sprinkled stamps**: rather than have
each of `task_tool.py`'s 5 normal-return + 3 pre-execution `Error:`
paths remember to set `additional_kwargs`, `ToolErrorHandlingMiddleware`
stamps the field after every task-tool call. Adding a new return
path in `task_tool.py` cannot now skip the stamp.
3. **Cross-language contract fixture**: the prefix→status mapping is
the one piece both sides must agree on. The shared fixture at
`contracts/subagent_status_contract.json` lists every backend return
string, the expected status, and what the error substring should
contain. Backend test (`backend/tests/test_subagent_status_contract.py`)
and frontend test (`frontend/tests/unit/core/tasks/subtask-result.test.ts`)
both load that fixture and assert the same cases. A wording drift on
either side fails the matching language's test.
4. **Round-trip serialisation pinned**: the round-trip test asserts
`ToolMessage.model_dump_json()` → `model_validate_json()` preserves
`additional_kwargs.subagent_status`. Catches the case where a future
LangChain or Pydantic upgrade silently strips unknown kwargs.
5. **Frontend status collapse documented**: the backend has five status
values, the frontend card has three (`completed | failed |
in_progress`). `cancelled` / `timed_out` / `polling_timed_out` all
collapse to `failed` with the original status preserved in `error`.
`parseSubtaskResult` returns `in_progress` for unknown values so a
backend that ships a new enum variant before the frontend upgrades
degrades to the legacy prefix fallback instead of getting pinned.
## Changes
Backend:
- `deerflow.subagents.status_contract` — new module exporting
`SUBAGENT_STATUS_KEY`, `SUBAGENT_ERROR_KEY`,
`SUBAGENT_STATUS_VALUES`, `extract_subagent_status(content)`, and
`make_subagent_additional_kwargs(status, error)`.
- `ToolErrorHandlingMiddleware`: new `_stamp_task_subagent_status`
helper centralises the stamp; `wrap_tool_call` / `awrap_tool_call`
stamp on the success path; `_build_error_message` stamps on the
wrapper path (carrying `ExcClass: detail` into `subagent_error`).
Non-task tools are untouched.
- New tests: `test_subagent_status_contract.py` (19 cases from the
shared fixture + status-enum / blank-error / unknown-status
rejection) and `test_tool_error_handling_subagent_stamp.py`
(middleware integration: terminal-content stamps, non-terminal
doesn't, non-task tools untouched, async path mirrors sync,
existing additional_kwargs survive, JSON round-trip preserved).
Frontend:
- `parseSubtaskResult(text, additionalKwargs?)` — prefers the
structured stamp; falls back to the legacy prefix matcher for
historical threads / unknown future status values.
- `STRUCTURED_STATUS_TO_SUBTASK` documents the five→three collapse.
- `message-list.tsx` passes `message.additional_kwargs` through.
- `subtask-result.test.ts` adds a structured-status block + a
fixture-driven contract block; legacy prefix tests stay green for
the fallback path.
Contract:
- `contracts/subagent_status_contract.json` — single source of truth
both languages load. Whitespace variants, varied N for polling
timeouts, the 3 pre-execution `Error:` returns task_tool produces,
and the middleware wrapper shape are all in there.
## Test plan
- `make lint` clean (backend + frontend).
- `pytest tests/test_subagent_status_contract.py
tests/test_tool_error_handling_subagent_stamp.py` → 37 passed.
- `pnpm test --run` → 103 passed (was 76, +27 new).
## Migration / fallback retirement
The text-prefix fallback stays in place until backend telemetry shows
the frontend never hits it for newly produced messages. At that point
a follow-up PR can drop the prefix branches and keep only the
structured-status branch.
Refs: bytedance/deer-flow#3138 (split summary), #3107 (origin), #3131
(prior prefix-only fix), #3146 (this issue).
* fix(subtask): back-fill result/error from text when structured status present
Three follow-ups on the PR #3154 review:
1. `readStructuredStatus` no longer short-circuits the prefix parse.
The backend currently stamps only the `subagent_status` enum value;
the human-facing `result` body and wrapped-error message still live
in `ToolMessage.content`. Dropping the text parse meant successful
tasks rendered empty completed pills and wrapped failures lost their
diagnostic. Now both shapes get composed: structured status wins,
`result`/`error` come from text when both sides agree, and a lying
success body under a `failed` stamp is dropped instead of leaking.
2. Replace the ESM-incompatible `__dirname` fixture lookup in
subtask-result.test.ts with `fileURLToPath(new URL(..., import.meta.url))`.
The frontend package is `"type": "module"`, so the previous path
would have thrown at runtime if anything ever changed under the
contract directory.
3. Drop the `$schema` reference from contracts/subagent_status_contract.json
pointing at a file that doesn't exist in the tree.
Three new tests cover the structured + text composition: completed
back-fills the success body, failed back-fills the wrapper text, and
unrecognised content under a `failed` stamp stays empty rather than
echoing noise.
* fix(summarization): tag summary LLM calls nostream to stop phantom stream messages (#2503)
The SummarizationMiddleware runs its summary LLM call inside a before_model
hook. Without a nostream tag the summary tokens were captured by LangGraph's
messages-tuple stream callback and broadcast to the frontend as a phantom AI
message.
Generate a dedicated summary model copy tagged with "nostream" (merged on top
of any existing tags such as "middleware:summarize" so RunJournal attribution
is preserved) and override _create_summary / _acreate_summary to invoke it
directly. This avoids temporarily swapping the shared self.model, which would
otherwise leak the RunnableBinding across concurrent runs and break parent
logic that inspects the raw model (profile / _get_ls_params).
Add regression tests covering nostream tagging, concurrent-run isolation, raw
model preservation, and existing-tag merge.
* fix(summarization): address nostream review feedback
* fix(#3189): prevent write_file streaming timeout on long reports
Adds a layered defense against StreamChunkTimeoutError caused by oversized
single-shot write_file tool calls:
- factory: default stream_chunk_timeout to 240s for OpenAI-compatible
clients (overridable via ModelConfig.stream_chunk_timeout in config.yaml)
- sandbox/tools: server-side 80 KB length guard on non-append write_file
calls (configurable via DEERFLOW_WRITE_FILE_MAX_BYTES env var, 0 disables);
rejects oversized payloads with a structured error pointing the model at
str_replace or append=True
- middleware: classify StreamChunkTimeoutError as transient but cap retries
at 1 via per-exception _RETRY_BUDGET_OVERRIDES (same-payload retry on a
chunk-gap timeout buffers the same way upstream; full 3-attempt loop
would stack 6-12 min of dead air)
- middleware: surface an actionable user-facing message for stream-drop
exceptions instead of leaking the raw langchain stack
- prompts: add a routing-style File Editing Workflow hint to both lead_agent
and general_purpose subagent prompts, pointing the model at str_replace
for incremental edits (mirrors Claude Code's Edit / Codex's apply_patch)
- tests: behavioural coverage for size guard, retry budget override,
stream-drop user message, factory default injection
Refs #3189
* fix(#3189): drop stream_chunk_timeout for non-OpenAI providers
Address CR feedback on PR #3195:
- factory: pop `stream_chunk_timeout` from kwargs for any model_use_path other than `langchain_openai:ChatOpenAI` instead of returning early. `ModelConfig.stream_chunk_timeout` is part of the shared schema, so a user-supplied value on a non-OpenAI provider would otherwise be forwarded to its constructor and raise `TypeError: unexpected keyword argument`.
- factory: rewrite docstring to describe the actual `exclude_none=True` behaviour (explicit null is excluded and falls back to the default) instead of the misleading "None falling out via exclude_none=True keeps its value".
- tests: add regression coverage asserting the kwarg is stripped before reaching a non-OpenAI provider's constructor.
Refs: bytedance#3189
* fix(#3189): restrict stream-drop user copy to StreamChunkTimeoutError only
Per CR on #3195: narrow _STREAM_DROP_EXCEPTIONS to StreamChunkTimeoutError. Generic httpx RemoteProtocolError / ReadError fall back to the standard 'temporarily unavailable' copy, since they routinely fire on transient network blips where the 'split the output' guidance is misleading. Retry/backoff classification is unchanged — both remain transient/retriable. Tests updated to reflect new copy, plus a symmetric regression test for ReadError.
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Follow-up to #3342 (deferred MCP tool loading). Maintainability cleanup plus
hardening of malformed/empty tool_search queries; no change to the deferral
mechanism or search ranking.
- Add deerflow/tools/mcp_metadata.py as the single source of truth for the
"deerflow_mcp" tag (MCP_TOOL_METADATA_KEY + tag_mcp_tool + public
is_mcp_tool). Removes the duplicated magic string and the private,
cross-module _is_mcp_tool import.
- tool_search.search: never raise on model-generated input. Extract
_compile_catalog_regex (shared compile-with-literal-fallback); return empty
for empty/whitespace queries and a bare "+" instead of matching everything
or raising IndexError.
- DeferredToolSetup: document the empty-vs-populated invariant.
- build_deferred_tool_setup: comment the two distinct empty-return branches.
- _assemble_deferred: add return type, rename local to deferred_setup, build
the final list with an explicit append.
- Tests: use tag_mcp_tool instead of per-file tag helpers; cover empty and
bare-"+" queries.
* feat(tool-search): add hash-scoped promoted state to ThreadState
* feat(tool-search): add immutable DeferredToolCatalog with stable hash
* feat(tool-search): add build_deferred_tool_setup + Command-writing tool_search
* refactor(tool-search): replace deferred-tool ContextVar with closures + graph state (#3272)
Build the deferred catalog + tool_search tool per agent from the policy-filtered
tool list (after skill allowed-tools), pass deferred_names + catalog_hash
explicitly to DeferredToolFilterMiddleware and the prompt, and record promotions
in ThreadState.promoted (scoped by catalog_hash) via a Command-returning
tool_search. Removes DeferredToolRegistry and the _registry_var ContextVar so
deferral no longer depends on build/execute sharing an async context. MCP tools
are tagged with metadata[deerflow_mcp]; client.py assembles deferral the same way.
Catalog is built AFTER tool-policy filtering (no policy-excluded tool can leak via
tool_search) and assembly is fail-closed. Migrate tests off the deleted registry
APIs; delete the obsolete ContextVar-based #2884 regression (re-covered by
state-based tests in a follow-up).
* test(tool-search): lock tool_search promotion into next model turn via graph state
* test(tool-search): cross-context, policy-leak, fail-closed, #2884 isolation regressions
* test(tool-search): align real-LLM e2e with closure-based deferred setup
* docs: update DeferredToolFilterMiddleware description for closure+state design
* style(tests): drop unused import in test_deferred_setup (ruff)
* test(tool-search): harden merge_promoted + replace tautological catalog test
From independent code review:
- merge_promoted: use existing.get("catalog_hash") so a forward-incompatible
or externally-injected persisted promoted dict triggers a replace instead of
a KeyError crash; add regression test for the malformed-existing case.
- test_deferred_catalog: replace the `== [] or True` tautology (a test that
could never fail) with a deterministic invalid-regex->literal-fallback check
(positive match on calc + negative empty match).
- DeferredToolCatalog: comment why frozen-without-slots is required for the
cached_property hash/names fields (adding slots=True would break them).
* fix(tool-search): read tool_search.enabled from self._app_config in client
DeerFlowClient._ensure_agent called get_app_config() directly to read
tool_search.enabled, but the client already resolves and stores its config as
self._app_config at construction (and uses it everywhere else). The bare call
re-resolves config from disk at agent-build time, which raises FileNotFoundError
in environments without a config.yaml (CI) — test_client.py's fixture only
patches get_app_config during __init__, so the later call hit the real loader.
Use self._app_config, matching the rest of the client.
* test(tool-search): lock tool_search post-policy append ordering
tool_search is appended after skill-allowlist filtering, so the allowlist
can no longer deny it by name. Lock the intended contract: it only appears
when allowed MCP tools survive the filter, and its catalog (derived from the
already policy-filtered list) can never expose a denied tool. Addresses the
ordering observation from the Copilot review on #3342.
UploadsMiddleware defines only the sync `before_agent` hook. LangChain wires a
sync-only hook as `RunnableCallable(before_agent, None)`, and LangGraph's
`ainvoke` runs it directly on the event loop when `afunc is None` — so the
per-message uploads-directory scan (`exists`/`iterdir`/`stat` plus reading
sibling `.md` outlines) blocks the asyncio event loop on every message that has
an uploads directory.
Add `abefore_agent` that offloads the scan to a worker thread via
`run_in_executor`; it copies the current context, preserving the `user_id`
contextvar read by `get_effective_user_id()`.
Add a runtime anchor under `tests/blocking_io/` that drives the real
`create_agent` graph via `ainvoke` under the strict Blockbuster gate, so a
regression back onto the event loop fails CI. Update blocking-IO docs.
* feat(agent): add ToolOutputBudgetMiddleware for oversized tool output protection
Closes#3289. Adds a unified middleware that enforces per-result budgets
on ALL tool outputs (MCP, sandbox, community, custom), preventing
oversized external tool results from blowing the model context window.
Design informed by claude-code (persistToolResult), hermes-agent
(tool_result_storage), and pi (OutputAccumulator) — the three most
mature implementations in production coding-agent frameworks.
Key features:
- Disk externalization: oversized outputs written to thread-local
.tool-results/ directory, replaced with compact preview + file
reference. Model can read full output via read_file with offset/limit.
- Fallback truncation: head+tail truncation when disk is unavailable
(no thread_data, write failure), ensuring the context is always
protected.
- read_file exemption: prevents persist-read-persist infinite loops
(independently discovered by claude-code, hermes-agent, and pi).
- Per-tool threshold overrides via config.
- Line-boundary-aware truncation (no partial lines in previews).
- Multimodal content passthrough (images/structured blocks skip budget).
- Historical ToolMessage patching in wrap_model_call for checkpoint
recovery scenarios.
Related: #3222 (design RFC), #1844 (comprehensive context management),
#3137 (write_file args compaction), #1677 (sandbox tool truncation).
* test: add MCP content_and_artifact format coverage
Add 5 tests for MCP tool output format (list of content blocks):
- text content blocks are extracted and budgeted
- multiple text blocks are joined and budgeted
- image content blocks are skipped (multimodal passthrough)
- mixed text+image blocks are skipped
- small text blocks pass through unchanged
Total test count: 59 (was 54).
* fix(agent): address Codex review findings for ToolOutputBudgetMiddleware
Three issues identified by Codex code review, all fixed:
1. `enabled` config field was unused — middleware now checks
`config.enabled` and skips all processing when disabled.
2. `_build_fallback` could exceed `fallback_max_chars` — the marker
text itself (~139 chars) was not deducted from the budget. Now
pre-computes marker overhead and falls back to hard slice when
max_chars is smaller than the marker.
3. Sync file I/O in async path — `awrap_tool_call` now delegates
`_patch_result` to `asyncio.to_thread` to avoid blocking the
event loop during disk writes.
Tests updated to use realistic fallback_max_chars values (500+)
that can accommodate the marker overhead, plus two new tests:
- `test_result_never_exceeds_max_chars` (parametric across sizes)
- `test_very_small_max_chars_does_not_crash`
* fix(agent): address Copilot review — path traversal, async perf, shared config
1. Path traversal defense: sanitize tool_name via _sanitize_tool_name()
(strips separators, .., absolute paths), validate storage_subdir is
relative, and verify resolved filepath stays inside storage_dir.
2. Async hot-path optimization: add _needs_budget() cheap check before
asyncio.to_thread offload — small outputs (99% of calls) skip the
thread overhead entirely.
3. Replace shared module-level _DEFAULT_CONFIG with _default_config()
factory to prevent cross-instance mutation of mutable fields.
12 new tests: TestSanitizeToolName (5), TestExternalizePathTraversal (3),
TestNeedsBudget (4).
* fix(agent): correct preview hint to match read_file actual API
read_file uses start_line/end_line (1-indexed line numbers), not
offset/limit. The previous wording was copied from hermes-agent
which has a different read_file interface.
* perf(agent): hoist hot-path imports, add model-call pre-scan (review #3303)
Address maintainer review feedback:
1. Hoist inline imports to module level — `import asyncio` (was in
awrap_tool_call hot path) and `from dataclasses import replace`
(was in _patch_result) now live at module top.
2. Add a cheap pre-scan to _patch_model_messages so the historical
message list is not rebuilt on every model call when nothing is
oversized (the common case once results are budgeted at tool-call
time). Also adds the same _needs_budget gate to the sync
wrap_tool_call for symmetry with awrap_tool_call.
The pre-scan is refactored into per-tool-aware helpers
(_effective_trigger / _tool_message_over_budget) that mirror the exact
trigger conditions in _budget_content — including tool_overrides — so
the fast-path can never produce a false negative (silently skipping
budgeting for a tool with a low per-tool threshold).
7 new regression tests lock the per-tool-override-through-pre-scan path
and the model-call early return.
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(agents): preserve todos state across node updates
ThreadState.todos had no reducer, so any downstream node returning a
partial state without todos was implicitly setting it to None, which
LangGraph then used to overwrite the previously streamed value. This
caused the to-do list to render correctly during streaming but vanish
once streaming completed.
Add a merge_todos reducer that keeps the last non-None value, mirroring
the merge_artifacts pattern already used in the same file. An explicit
empty list is still respected so that 'user cleared todos' works.
Tests: 10 new unit tests in tests/test_thread_state_reducers.py covering
merge_todos plus regression coverage for merge_artifacts and
merge_viewed_images. All 69 thread-related tests pass locally.
Closes#3123
* test(agents): add annotation binding regression guard
Address Copilot review feedback on #3123:
- Add TestThreadStateAnnotations asserting that ThreadState.todos is
Annotated with merge_todos. Without this guard, reverting the
Annotated[list | None, merge_todos] binding would silently regress
#3123 while all existing reducer unit tests continue to pass.
- Align test imports to 'from deerflow.agents.thread_state import ...'
matching the rest of the backend test suite.
* fix(runtime): suppress tool execution when provider safety-terminates with tool_calls
When a provider stops generation for safety reasons (OpenAI/Moonshot
finish_reason=content_filter, Anthropic stop_reason=refusal, Gemini
finish_reason=SAFETY/BLOCKLIST/PROHIBITED_CONTENT/SPII/RECITATION/
IMAGE_SAFETY/...), the response may still carry truncated tool_calls.
LangChain's tool router treats any non-empty tool_calls as executable,
so partial arguments (e.g. write_file with a half-finished markdown)
get dispatched and the agent loops on retry.
Add SafetyFinishReasonMiddleware at after_model: detect safety
termination via a pluggable detector registry, clear both structured
tool_calls and raw additional_kwargs.tool_calls / function_call,
preserve response_metadata.finish_reason for downstream observers,
stamp additional_kwargs.safety_termination for traces, append a
user-facing explanation to message content (list-aware for thinking
blocks), and emit a safety_termination custom stream event so SSE
consumers can reconcile any "tool starting..." UI.
Default detectors cover OpenAI-compatible content_filter, Anthropic
refusal, and Gemini safety enums (text + image). Custom providers are
added via reflection (same pattern as guardrails). Wired into both
lead-agent and subagent runtimes.
Closes#3028
* fix(runtime): persist safety_termination as a middleware audit event
Address review on #3035: the SSE custom event is great for live
consumers but invisible to post-run audit. RunEventStore should carry
its own row so operators can answer "which runs were safety-suppressed
today?" from a single SQL query without joining the message body.
Worker now exposes the run-scoped RunJournal via
runtime.context["__run_journal"] (sentinel key, internal channel).
SafetyFinishReasonMiddleware calls the previously-unused
RunJournal.record_middleware, which emits
event_type = "middleware:safety_termination"
category = "middleware"
content = {name, hook, action, changes={
detector, reason_field, reason_value,
suppressed_tool_call_count,
suppressed_tool_call_names,
suppressed_tool_call_ids,
message_id, extras}}
Tool *arguments* are deliberately excluded — those are the very content
the provider filtered and persisting them would defeat the purpose of
the safety filter (per review note in #3035).
Graceful skips when journal is absent (subagent runtime, unit tests,
no-event-store local dev). Journal exceptions never propagate into the
agent loop.
Refs #3028
* fix(runtime): satisfy ruff format + address Copilot review
- ruff format on safety_finish_reason_config.py and e2e demo (CI lint
failed on ruff format --check; backend Makefile lint target runs
ruff check AND ruff format --check).
- Docstring on SafetyFinishReasonConfig now says resolve_variable to
match the actual loader used in from_config (the wording was
resolve_class previously; behavior is unchanged — resolve_variable
mirrors how guardrails.provider is loaded).
- Switch the AIMessage type check in SafetyFinishReasonMiddleware._apply
from getattr(last, "type") == "ai" to isinstance(last, AIMessage),
matching TokenUsageMiddleware / TodoMiddleware / ViewImageMiddleware
/ SummarizationMiddleware which are the dominant pattern.
Refs #3028
* fix(tracing): propagate session_id and user_id into Langfuse traces
Adds Langfuse v4 reserved trace attributes (langfuse_session_id,
langfuse_user_id, langfuse_trace_name, langfuse_tags) to
RunnableConfig.metadata inside the run worker, so the langchain
CallbackHandler can lift them onto the root trace.
- New deerflow.tracing.metadata.build_langfuse_trace_metadata() returns
the reserved keys when Langfuse is in the enabled providers, else {}.
- worker.run_agent merges them with setdefault so caller-supplied keys
win, allowing per-request overrides from upstream metadata.
- session_id mirrors the LangGraph thread_id; user_id reads
get_effective_user_id() (falls back to "default" in no-auth mode).
- trace_name defaults to "lead-agent"; tags carry env and model name
when DEER_FLOW_ENV (or ENVIRONMENT) and a model name are present.
Closes#2930
* fix(tracing): attach Langfuse callback at graph root so metadata propagates
The first commit injected ``langfuse_session_id`` / ``langfuse_user_id`` /
``langfuse_trace_name`` / ``langfuse_tags`` into ``RunnableConfig.metadata``,
but on ``main`` the Langfuse callback is attached at *model* level
(``models/factory.py``). LangChain still threads ``parent_run_id`` through
the contextvar, so the handler sees the model as a nested observation and
``__on_llm_action`` strips the ``langfuse_*`` keys
(``keep_langfuse_trace_attributes=False``). The trace's top-level
``sessionId`` / ``userId`` therefore stayed empty in deer-flow's LangGraph
runtime — confirmed live against a real Langfuse instance.
This commit moves the callback to the **graph invocation root** so the
handler fires ``on_chain_start(parent_run_id=None)`` and runs the
``propagate_attributes`` path that actually lifts ``session_id`` /
``user_id`` onto the trace:
- ``models/factory.py``: add ``attach_tracing`` keyword (default ``True``)
so standalone callers (``MemoryUpdater``, etc.) keep their direct
model-level tracing.
- ``agents/lead_agent/agent.py``: call ``build_tracing_callbacks()`` once
inside ``_make_lead_agent`` and append the result to
``config["callbacks"]``; the four in-graph ``create_chat_model`` sites
(bootstrap, default agent, sync + async summarization) pass
``attach_tracing=False`` to avoid duplicate spans.
- ``agents/middlewares/title_middleware.py``: same ``attach_tracing=False``
for the title-generation model, since it inherits the graph's
RunnableConfig via ``_get_runnable_config``.
Test updates:
- ``tests/test_lead_agent_model_resolution.py`` and
``tests/test_title_middleware_core_logic.py``: extend the fake
``create_chat_model`` signatures / mock assertions to accept the new
``attach_tracing`` kwarg.
- ``tests/test_worker_langfuse_metadata.py``: switch the no-user fallback
test from direct ContextVar mutation to ``monkeypatch.setattr`` on
``get_effective_user_id`` to avoid pollution across the langfuse OTel
global tracer provider.
- ``tests/conftest.py``: add an autouse fixture that resets
``deerflow.config.title_config._title_config`` to its pristine default
after every test. Any test that loads the real ``config.yaml`` (via
``get_app_config()``) calls ``load_title_config_from_dict`` and mutates
the module-level singleton, which previously poisoned the
title-middleware suite when run after, e.g., the new
``test_worker_langfuse_metadata.py`` cases. The fixture is independent
of this PR's main change but unblocks the cross-file test run.
Live verification (same Langfuse instance as before):
- Drove ``worker.run_agent`` against the real ``make_lead_agent`` +
``gpt-4o-mini`` for three distinct ``user_context`` identities
(``fancy-engineer``, ``alice-pm``, ``bob-designer``).
- Each run produced one ``lead-agent`` trace whose top-level
``sessionId`` / ``userId`` / ``tags`` carry the expected values, e.g.
``session=e2e-2930-8f347c-alice-pm user=alice-pm name='lead-agent'
tags=['model:gpt-4o-mini']``.
Refs #2930.
* fix(tracing): extend root-callback + metadata injection to the embedded client
Addresses Copilot review on PR #2944.
Commit 2 disabled model-level tracing for ``TitleMiddleware`` and
``_create_summarization_middleware`` because ``_make_lead_agent`` now
attaches the tracing callbacks at the graph invocation root. But the
embedded ``DeerFlowClient`` does not call ``_make_lead_agent`` — it
calls ``_build_middlewares`` directly and never appends the tracing
handlers to its ``RunnableConfig``. So under the embedded path,
title-generation and summarization LLM calls were left untraced —
a regression introduced by this PR.
This commit mirrors the gateway worker's injection in
``DeerFlowClient.stream``:
- Append ``build_tracing_callbacks()`` to ``config["callbacks"]`` so
the Langfuse handler sees ``on_chain_start(parent_run_id=None)`` at
the graph root and runs the ``propagate_attributes`` path.
- Merge ``build_langfuse_trace_metadata(...)`` into
``config["metadata"]`` with ``setdefault`` so caller-supplied keys
still win.
- ``_ensure_agent`` now creates its main model with
``attach_tracing=False`` to avoid duplicate spans now that the
callback lives at the graph root.
Docs:
- ``backend/CLAUDE.md`` Tracing section rewritten to describe the
graph-root attachment model (replacing the inaccurate
"at model-creation time" wording).
- ``README.md`` Langfuse section now lists both injection points
(worker + client) instead of only the worker path.
Tests:
- ``tests/test_client_langfuse_metadata.py`` (new, 3 cases):
callbacks + metadata are injected when Langfuse is enabled,
caller-supplied metadata overrides win via ``setdefault``, and the
injection is inert when Langfuse is disabled.
Live verification on the real Langfuse instance:
=== user=fancy-client ===
id=cbd22847.. session=client-2930-6b9491-fancy-client user=fancy-client name='lead-agent'
=== user=alice-client ===
id=b4f6f576.. session=client-2930-6b9491-alice-client user=alice-client name='lead-agent'
Refs #2930.
* refactor(tracing): address maintainer review on PR #2944
Addresses @WillemJiang's 5 comments.
1. Duplicated metadata-injection code between worker.py and client.py
New ``deerflow.tracing.inject_langfuse_metadata(config, ...)`` helper
takes the 10-line build + merge + setdefault logic that was duplicated
in ``runtime/runs/worker.py`` and ``client.py``. Both callers now share
a single source of truth, so the two paths cannot drift.
2. Direct private-attribute mutation in conftest.py and tests
Added public ``reset_tracing_config()`` / ``reset_title_config()``
functions. ``tests/conftest.py`` and every test that previously did
``tracing_module._tracing_config = None`` or
``title_module._title_config = TitleConfig()`` now goes through the
public API. A future internal rename will surface as an ImportError
instead of a silent no-op.
3. client.py reading os.environ directly
``DeerFlowClient.__init__`` grows an optional ``environment`` parameter
so programmatic callers can pass the deployment label explicitly.
``stream()`` consults ``self._environment`` first and only falls back
to ``DEER_FLOW_ENV`` / ``ENVIRONMENT`` env vars when nothing was
passed in. Backwards compatible — env-var behaviour preserved for
callers that opt to keep using it.
4. build_tracing_callbacks() cached on hot path
Not implemented. Inspected the langfuse v4 ``langchain.CallbackHandler``
constructor: it only resolves the module-level singleton client via
``get_client()`` and initialises a few dicts (no I/O, no env parsing
at construction time). The build is essentially free. Caching would
trade a non-measurable speedup for two real risks: handler instances
carry per-run state internally (``_run_states``, ``_root_run_states``,
``last_trace_id``), and tracing config can be reloaded by env-var
changes between runs. Will revisit if profiling ever shows it as
a hot spot.
5. attach_tracing=False easy to forget at new in-graph call sites
- Module docstring at the top of ``lead_agent/agent.py`` documents
the invariant ("every in-graph ``create_chat_model`` MUST pass
``attach_tracing=False``") and enumerates the current sites.
- New regression test
``test_make_lead_agent_attaches_tracing_callbacks_at_graph_root`` in
``tests/test_lead_agent_model_resolution.py`` locks both halves of
the invariant: ``config["callbacks"]`` carries the tracing handler
after ``_make_lead_agent``, AND every ``create_chat_model`` call
captured by the test passes ``attach_tracing=False``. A future
in-graph site that forgets the flag will fail this test.
Lint clean. Full touched-suite bundle: 246 passed.
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(loop-detection): defer warn injection to wrap_model_call
The warn branch in LoopDetectionMiddleware injected a HumanMessage
into state from after_model. The tools node had not yet produced
ToolMessage responses to the previous AIMessage(tool_calls=...), so
the new HumanMessage landed *between* the assistant's tool_calls and
their responses. OpenAI/Moonshot reject the next request with
"tool_call_ids did not have response messages" because their
validators require tool_calls to be followed immediately by tool
messages.
Detection now runs in after_model as before, but only enqueues the
warning into a per-thread list. Injection happens in wrap_model_call,
where every prior ToolMessage is already present in request.messages.
The warning is appended at the end as HumanMessage(name="loop_warning")
— pairing intact, AIMessage semantics untouched, no SystemMessage
issues for Anthropic.
Closes#2029, addresses #2255#2293#2304#2511.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(channels): remove loop warning display filter
* feat(loop-detection): scope pending warnings by run
* docs(loop-detection): update docs
* test(loop-detection): assert deferred warnings are queued
* fix(loop-detection): cap transient warning state
* docs: update docs
* add async awrap_model_call test coverage
* docs(loop-detection): document transient warnings
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix(trace):memory 中文 in trace is unicode
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(middleware): Prevent todo completion reminder IMMessage leak (#2892)
* make format
* fix(middleware): Clear stale todo reminder counts (#2892)
* add size guard for _completion_reminder_counts and add a integration test
* fix(memory): isolate queued memory updates by agent
* fix(memory): include user in queue identity
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* Fix the lint error
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* feat: real-time subagent token usage display in header and per-turn
Backend:
- Persist subagent token usage to AIMessage.usage_metadata via
TokenUsageMiddleware, so accumulateUsage() naturally includes
subagent tokens without frontend state management
- Cache subagent usage by tool_call_id in task_tool, write back
to the dispatching AIMessage on next model response
- Emit subagent token usage on all terminal task events
(task_completed, task_failed, task_cancelled, task_timed_out)
- Report subagent usage to parent RunJournal for API totals
- Search backward from ToolMessage to find dispatching AIMessage
for correct multi-tool-call attribution
Frontend:
- Remove subagentUsage state, custom event handling, and prop
threading — subagent tokens are now embedded in message metadata
- Simplify selectHeaderTokenUsage (no subagentUsage parameter)
- Per-turn inline badges show turn-specific usage via message
accumulation
- Remove isLoading guard from MessageTokenUsageList for dynamic
updates during streaming
* fix: prevent header token double counting from baseline reset race
onFinish, onError, and thread-switch useEffect all reset
pendingUsageBaselineMessageIdsRef to an empty Set. If
thread.isLoading is still true on the next render, all messages
pass the getMessagesAfterBaseline filter and their tokens are
added to backendUsage (which already includes them), causing
the header to display up to 2× the actual token count.
Capture current message IDs instead of using an empty Set so
that getMessagesAfterBaseline correctly returns no pending
messages even if thread.isLoading lags behind the stream end.
* fix: write back subagent tokens for all concurrent task tool calls
TokenUsageMiddleware only processed messages[-2], so when a
single model response dispatched multiple task tool calls only
the last ToolMessage had its cached subagent usage written back
to the dispatch AIMessage.usage_metadata. Earlier tasks' usage
stayed in _subagent_usage_cache indefinitely (leak) and never
appeared in the per-turn inline token display.
Walk backward through all consecutive ToolMessages before the
new AIMessage, and accumulate updates targeting the same
dispatch message into one state update so overlapping writes
don't clobber each other.
* fix: clean up subagent usage cache entry on task cancellation
When a task_tool invocation is cancelled via CancelledError, any
cached subagent usage entry leaked because the TokenUsageMiddleware
writeback path never fires after cancellation. Pop the cache entry
before re-raising to prevent unbounded growth of the module-level
_subagent_usage_cache dict.
* fix: address token usage review feedback
* fix: handle missing config for subagent usage cache
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* feat(middleware): inject dynamic context via DynamicContextMiddleware
Move memory and current date out of the system prompt and into a
dedicated <system-reminder> HumanMessage injected once per session
(frozen-snapshot pattern) via a new DynamicContextMiddleware.
This keeps the system prompt byte-exact across all users and sessions,
enabling maximum Anthropic/Bedrock prefix-cache reuse.
Key design decisions:
- ID-swap technique: reminder takes the first HumanMessage's ID
(replacing it in-place via add_messages), original content gets a
derived `{id}__user` ID (appended after). Preserves correct ordering.
- hide_from_ui: True on reminder messages so frontend filters them out.
- Midnight crossing: date-update reminder injected before the current
turn's HumanMessage when the conversation spans midnight.
- INFO-level logging for production diagnostics.
Also adds prompt-caching breakpoint budget enforcement tests and
updates ClaudeChatModel docs to reference the new pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(token-usage): log input/output token detail breakdown in middleware
Extend the LLM token usage log line to include input_token_details and
output_token_details (cache_creation, cache_read, reasoning, audio, etc.)
when present. Adds tests covering Anthropic cache detail logging from
both usage_metadata and response_metadata.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: fix nginx
* fix(middleware): always inject date; gate memory on injection_enabled
Date injection is now unconditional — it is part of the static system
prompt replacement and should always be present. Memory injection
remains gated by `memory.injection_enabled` in the app config.
Previously the entire DynamicContextMiddleware was skipped when
injection_enabled was False, which also suppressed the date.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): format files and correct test assertions for token usage middleware
- ruff format dynamic_context_middleware.py and test_claude_provider_prompt_caching.py
- Remove unused pytest import from test_dynamic_context_middleware.py
- Fix two tests that asserted response_metadata fallback logic that
doesn't exist: replace with tests that match actual middleware behavior
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(middleware): address Copilot review comments on DynamicContextMiddleware
- Use additional_kwargs flag for reminder detection instead of content
substring matching, so user messages containing '<system-reminder>'
are not mistakenly treated as injected reminders
- Generate stable UUID when original HumanMessage.id is None to prevent
ambiguous 'None__user' derived IDs and message collisions
- Downgrade per-turn no-op log to DEBUG; keep actual injection events at INFO
- Add two new tests: missing-id UUID fallback and user-text false-positive
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Make loop detection configurable
Expose LoopDetectionMiddleware thresholds through config.yaml while preserving existing defaults and allowing the middleware to be disabled.
Refs bytedance/deer-flow#2517
* feat(loop-detection): add per-tool tool_freq_overrides to Phase 1
Adds ToolFreqOverride model and tool_freq_overrides field to
LoopDetectionConfig, wires it through LoopDetectionMiddleware, and
documents the option in config.example.yaml.
Resolves the gap flagged in the #2586 review: without per-tool overrides,
users hit by #2510/#2511 (RNA-seq workflows exceeding the bash hard limit)
had no way to raise thresholds for one tool without loosening the global
limit for every tool.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* docs(loop-detection): document tool_freq_overrides in LoopDetectionMiddleware docstring
Add the missing Args entry for tool_freq_overrides, explaining the
(warn, hard_limit) tuple structure and how per-tool thresholds supersede
the global tool_freq_warn / tool_freq_hard_limit for named tools.
Also run ruff format on the three files flagged by the lint check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(loop-detection): validate LoopDetectionMiddleware __init__ params eagerly
Raise clear ValueError at construction time instead of crashing at
unpack-time inside _track_and_check when bad values are passed:
- tool_freq_overrides: must be 2-tuples of positive ints with hard_limit >= warn
- scalar thresholds: warn_threshold, hard_limit, tool_freq_warn,
tool_freq_hard_limit must be >= 1 and hard limits must >= their warn pairs
- window_size, max_tracked_threads must be >= 1
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(test): isolate credential loader directory-path test from real ~/.claude
The test didn't monkeypatch HOME, so on any machine with real Claude Code
credentials at ~/.claude/.credentials.json the function fell through to
those credentials and the assertion failed. Adding HOME redirect ensures
the default credential path doesn't exist during the test.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* style(test): add blank lines after import pytest in TestInitValidation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(loop-detection): collapse dual validation to LoopDetectionConfig
Modifications
- LoopDetectionMiddleware.__init__: stripped of all ValueError raises;
becomes a plain field-assignment constructor.
- LoopDetectionMiddleware.from_config: classmethod that builds the
middleware from a Pydantic-validated LoopDetectionConfig and handles
the ToolFreqOverride -> tuple[int, int] conversion.
- agents/factory.py: SDK construction routed through
LoopDetectionMiddleware.from_config(LoopDetectionConfig()) so the
defaults path is Pydantic-validated too.
- agents/lead_agent/agent.py: uses from_config instead of unpacking
config fields by hand.
- tests/test_loop_detection_middleware.py: deleted TestInitValidation
(16 methods exercising the removed __init__ checks); added
TestFromConfig (4 tests: scalar field mapping, override tuple
conversion, empty overrides, behavioral smoke test).
Result: one validation layer (Pydantic), zero duplication, no __new__
hacks. Both production construction sites flow through LoopDetectionConfig.
Test results
make test -> 2977 passed, 18 skipped, 0 failed (137s)
make format -> All checks passed; 411 files left unchanged
* feat(agents): make loop_detection configurable in create_deerflow_agent
Adds a `loop_detection: bool | AgentMiddleware = True` field to
RuntimeFeatures, mirroring the existing pattern used by `sandbox`,
`memory`, and `vision`. SDK users can now disable LoopDetectionMiddleware
or replace it with a custom instance built from their own
LoopDetectionConfig — e.g.
`LoopDetectionMiddleware.from_config(my_cfg)` — instead of being stuck
with the hardcoded defaults previously installed by the SDK factory.
The lead-agent path (which already reads AppConfig.loop_detection) is
unchanged, and the default `True` preserves prior always-on behavior for
all existing callers.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: knight0940 <631532668@qq.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Amorend <142649913+knight0940@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* feat(agent): add update_agent tool for in-chat custom-agent self-updates (#2616)
Custom agents had no built-in way to persist updates to their own SOUL.md /
config.yaml from a normal chat — `setup_agent` was only bound during the
bootstrap flow, so when the user asked the agent to refine its description
or personality, the agent would shell out via bash/write_file and the edits
landed in a temporary sandbox/tool workspace instead of
`{base_dir}/agents/{agent_name}/`.
Changes:
- New `update_agent` builtin tool with partial-update semantics (only the
fields you pass are written) and atomic temp-file + os.replace writes so
a failed update never corrupts existing SOUL.md / config.yaml.
- Lead agent now binds `update_agent` in the non-bootstrap path whenever
`agent_name` is set in the runtime context. Default agent (no
agent_name) and bootstrap flow are unchanged.
- New `<self_update>` system-prompt section is injected for custom agents,
instructing them to use `update_agent` — and explicitly NOT bash /
write_file — to persist self-updates.
- Tests: 11 new cases in `tests/test_update_agent_tool.py` covering
validation (missing/invalid agent_name, unknown agent, no fields),
partial updates (soul-only, description-only, skills=[] vs omitted),
no-op detection, atomic-write safety, and AgentConfig round-tripping;
plus 2 new cases in `tests/test_lead_agent_prompt.py` covering the
self-update prompt section.
- Docs: updated backend/CLAUDE.md builtin tools list and tools.mdx
(en/zh) with the new tool description.
* feat(agent): isolate custom agents per user
Store custom agent definitions under the effective user, keep legacy agents readable until migration, and cover API/tool/migration behavior with tests.
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat: consistent write/delete targets & add --user-id to migration
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(loop-detection): keep tool-call pairing on warn injection (#2724)
* make format
* fix(loop-detection): avoid IMMessage leak to downstream consumer
* fix(channels): filter loop warning text from IM replies
* refactor: thread app config through lead prompt
* fix: honor explicit app config across runtime paths
* style: format subagent executor tests
* fix: thread resolved app config and guard subagents-only fallback
Address two PR review findings:
1. _create_summarization_middleware passed the original (possibly None)
app_config into create_chat_model, forcing the model factory back to
ambient get_app_config() and risking config drift between the
middleware's resolved view and the model's view. Pass the resolved
AppConfig instance through end-to-end.
2. get_available_subagent_names accepted Any-typed config and forwarded
it to is_host_bash_allowed, which reads ``.sandbox``. A
SubagentsAppConfig (also accepted upstream as a sum-type input) has
no ``.sandbox`` attribute and would be silently treated as "no
sandbox configured", incorrectly disabling the bash subagent. Guard
on hasattr and fall back to ambient lookup otherwise.
Adds regression tests for both paths.
* chore: simplify hasattr guard and tighten regression tests
- Collapse if/else into ternary in get_available_subagent_names; hasattr(None, ...) is False so the explicit None check was redundant.
- Drop comments that narrate the change rather than explain non-obvious WHY (test names already convey intent).
- Replace stringly-typed sentinel "no-arg" in regression test with direct args tuple comparison.
---------
Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com>
* fix(subagents): use model override for tools and middleware
* fix(config): resolve effective subagent model
* fix(subagents): defer app config loading
* fix(subagents): fully defer config.yaml load in executor __init__
The previous attempt only relocated the explicit get_app_config() call,
but left resolve_subagent_model_name(...) running eagerly in __init__.
That helper has its own internal get_app_config() fallback, which still
fired when both app_config and parent_model were None and
config.model == "inherit" — exactly the path unit tests hit, breaking
21 tests in CI with FileNotFoundError: config.yaml.
Skip the eager resolve in __init__ when it would require loading the
config file, and defer to _create_agent (which already has the
app_config or get_app_config() fallback).
* fix(memory): replace short-lived asyncio.run() with persistent event loop to prevent zombie httpx connections
The memory updater used asyncio.run() inside daemon threads, creating
and destroying short-lived event loops on every update. Langchain
providers (e.g. langchain-anthropic) cache httpx AsyncClient instances
globally via @lru_cache, so SSL connections created on a loop that is
subsequently destroyed become zombie connections in the shared pool.
When the main agent's lead run later reuses one of these connections,
httpx/anyio triggers RuntimeError: Event loop is closed during
connection cleanup.
Replace the ThreadPoolExecutor + asyncio.run() pattern with a
_MemoryLoopRunner that maintains a single persistent event loop in a
daemon thread for the process lifetime. Since the loop never closes,
connections bound to it never become invalid. The _run_async_update_sync
function now submits coroutines to this persistent loop via
run_coroutine_threadsafe instead of creating throwaway loops.
* update the code to address the review comments
* Fix the review comments of 2615
P1 — user_id forwarded through sync path: Added user_id parameter to _prepare_update_prompt, _finalize_update, and _do_update_memory_sync, and forwarded it to get_memory_data(agent_name, user_id=user_id) and
save(..., user_id=user_id). The update_memory() entry point now passes user_id through both the executor.submit path and the direct call path. Added TestUserIdForwarding with two regression tests (sync + async)
verifying get_memory_data and save receive the correct user_id.
P2 — aupdate_memory() delegates to sync: Replaced the model.ainvoke() call with asyncio.to_thread(self._do_update_memory_sync, ...). This eliminates the unsafe async provider client path entirely — all memory
updater entry points now use the isolated sync model.invoke() path. Updated the test from asserting ainvoke is awaited to asserting invoke is called and ainvoke is not.
Nit — duplicate comment removed: Removed the duplicated # Matches sentences... comment on line 230.
* Chore(test): update the code of test_memory_updater
---------
Co-authored-by: rayhpeng <rayhpeng@gmail.com>
* refactor: thread app_config through middleware factories
Continues the incremental config-refactor sequence (#2611 root, #2612 lead
path) one layer deeper into the middleware factories. Two ambient lookups
inside _build_runtime_middlewares are eliminated and the LLMErrorHandling
band-aid removed:
- _build_runtime_middlewares / build_lead_runtime_middlewares /
build_subagent_runtime_middlewares now require app_config: AppConfig.
- get_guardrails_config() inside the factory is replaced with
app_config.guardrails (semantically identical — same default-factory
GuardrailsConfig — verified by direct equality check).
- LLMErrorHandlingMiddleware.__init__ now requires app_config and reads
circuit_breaker fields directly. The class-level
circuit_failure_threshold / circuit_recovery_timeout_sec defaults are
removed along with the try/except (FileNotFoundError, RuntimeError):
pass band-aid — the let-it-crash invariant the rest of the refactor
enforces.
Caller chain (already-resolved app_config sources):
- _build_middlewares in lead_agent/agent.py: reorder so
resolved_app_config = app_config or get_app_config() is computed BEFORE
build_lead_runtime_middlewares is called, then passed as kwarg.
- SubagentExecutor: optional app_config parameter (mirrors the lead-agent
pattern); _create_agent does the same `or get_app_config()` fallback at
agent-build time, so task_tool callers don't need to plumb app_config
through yet (typed-context plumbing for tool runtimes is a separate
refactor).
Tests:
- test_llm_error_handling_middleware: _make_app_config helper using
AppConfig(sandbox=SandboxConfig(use="test")) — same minimal-config
pattern conftest already uses. Three direct LLMErrorHandlingMiddleware()
calls each followed by post-construction circuit_breaker mutation fold
cleanly into _build_middleware(circuit_failure_threshold=...,
circuit_recovery_timeout_sec=...).
Verification:
- tests/test_llm_error_handling_middleware.py — 14 passed
- tests/test_subagent_executor.py — 28 passed
- tests/test_tool_error_handling_middleware.py — 6 passed
- tests/test_task_tool_core_logic.py — 18 passed (verifies task_tool
unchanged behavior)
- Full suite: 2697 passed, 3 skipped. The single intermittent failure in
tests/test_client_e2e.py::test_tool_call_produces_events is pre-existing
LLM flakiness (the test asserts the model decided to call a tool;
reproduces 1/3 on unchanged main as well).
* fix: address middleware app config review comments
* fix: satisfy app config annotation lint
* test: cover explicit app config middleware wiring
---------
Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com>