* fix(runtime): harden JSONL async I/O and DB put_batch thread validation (#2816)
- JsonlRunEventStore: offload all file I/O to asyncio.to_thread() so the
event loop is never blocked; add per-thread asyncio.Lock to serialise
concurrent puts and prevent interleaved JSONL lines
- Split _ensure_seq_loaded into a sync _compute_max_seq (runs in thread)
and an async wrapper; seq counter is recovered from disk on fresh store init
- DbRunEventStore.put_batch: raise ValueError when events span multiple
thread_ids (previously silently assumed same thread)
- Add test_jsonl_event_store_async_io.py: 12 tests covering lock reuse,
concurrent seq monotonicity, disk recovery, and mixed-thread batch rejection
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address Copilot review comments
- delete_by_thread: pop _write_locks after releasing the lock to prevent
unbounded growth when threads are repeatedly created and deleted
- tests: add regression guard asserting asyncio.to_thread is called for
_write_record in put(); assert _write_locks entry removed on delete
* fix(lint): move patch import to local scope to fix ruff I001
* fix(lint): apply ruff check+format fixes to test file
* fix(runtime): address review feedback for JSONL async I/O hardening (#2816)
Use setdefault for atomic lock init in _get_write_lock; pop _write_locks
inside the held lock scope in delete_by_thread; update test docstring
and assert lock entry also cleared on delete.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: rayhpeng <rayhpeng@gmail.com>
* fix(mcp): skip session pooling for HTTP/SSE transports to avoid anyio RuntimeError (#3203)
HTTP/SSE transports use anyio.TaskGroup internally for streamable
connections. These task groups have cancel scopes bound to the async task
that created them, so closing a pooled session from a different task
raises RuntimeError. Restrict session pooling to stdio transports only.
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* docs: clarify MCP pooling applies only to stdio tools
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/2dd9881d-54c6-45fd-90bc-154a09e29841
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
* feat(tests): add Blockbuster runtime gate for event-loop blocking IO
Adds a strict runtime gate that fails CI when sync blocking IO calls run
on the asyncio event loop thread through DeerFlow business code.
Components:
- backend/tests/support/detectors/blocking_io_runtime.py — Blockbuster
context scoped to `app.*` and `deerflow.*` so test infrastructure,
pytest internals, and third-party libraries stay silent.
- backend/tests/blocking_io/conftest.py — pytest_runtest_protocol
hookwrapper that wraps every item (setup + call + teardown) with the
strict context. Respects `@pytest.mark.allow_blocking_io` opt-out.
- backend/tests/blocking_io/test_skills_load.py — regression anchor for
the #1917 fix (asyncio.to_thread offload around
LocalSkillStorage.load_skills).
- backend/tests/blocking_io/test_sqlite_lifespan.py — regression anchor
for the #1912 fix (asyncio.to_thread offload around
ensure_sqlite_parent_dir).
- backend/tests/blocking_io/test_gate_smoke.py — meta-test asserting the
gate actually catches unoffloaded blocking IO and that the
`@pytest.mark.allow_blocking_io` opt-out works.
- backend/Makefile — `make test-blocking-io` target.
- .github/workflows/backend-blocking-io-tests.yml — hard-fail PR gate on
ubuntu-latest. Windows matrix deferred to follow-up.
Dependencies:
- blockbuster>=1.5.26,<1.6 added to dev group.
Coverage boundary (called out in PR body): the gate only catches blocking
IO on code paths the test suite actually exercises. Static AST inventory
(separate, informational) is the complementary coverage tool. Three blind
spot categories — untested paths, mocked-away paths, env-mismatched paths
— are documented in the PR description.
Findings surfaced while authoring this PR:
- resolve_sqlite_conn_str in runtime/store/_sqlite_utils.py:19 does sync
Path.resolve() -> os.path.abspath on the lifespan loop thread, ahead of
the #1912 fix. Not addressed here; tracked as follow-up.
Tests: 4 passed locally (`make test-blocking-io`).
Lint/format: clean (`ruff check` and `ruff format --check`).
* fix(tests): scope Blockbuster gate to blocking-io suite
* fix(tests): harden Blockbuster runtime gate
* test(blocking-io): add project rule extension point
* test(blocking-io): address review cleanup
* fix(sandbox): add group/other read permissions to uploaded files for Docker sandbox (#3127)
When using AIO sandbox with LocalContainerBackend, uploaded files are
created with 0o600 (owner-only) permissions by the gateway process
running as root. The sandbox process inside the Docker container runs
as a non-root user and cannot read these bind-mounted files, causing
a "Permission denied" error on read_file.
Add `needs_upload_permission_adjustment` attribute to SandboxProvider
(default True) to indicate that uploaded files need chmod adjustment.
LocalSandboxProvider opts out (same user). A new `_make_file_sandbox_readable`
function adds S_IRGRP | S_IROTH bits after files are written, changing
permissions from 0o600 to 0o644 so the sandbox can read the uploads.
fixes#3127
* fix(uploads): unconditionally adjust file permissions for sandbox access
The conditional check meant uploaded files retained 0o600
permissions in some Docker sandbox configurations, preventing the
sandbox process (UID 1000) from reading them. Always add group/other
read bits so every sandbox setup can access uploaded content. Also add
read bits to the sync-path writable helper as defense in depth.
* fix(agents): preserve todos state across node updates
ThreadState.todos had no reducer, so any downstream node returning a
partial state without todos was implicitly setting it to None, which
LangGraph then used to overwrite the previously streamed value. This
caused the to-do list to render correctly during streaming but vanish
once streaming completed.
Add a merge_todos reducer that keeps the last non-None value, mirroring
the merge_artifacts pattern already used in the same file. An explicit
empty list is still respected so that 'user cleared todos' works.
Tests: 10 new unit tests in tests/test_thread_state_reducers.py covering
merge_todos plus regression coverage for merge_artifacts and
merge_viewed_images. All 69 thread-related tests pass locally.
Closes#3123
* test(agents): add annotation binding regression guard
Address Copilot review feedback on #3123:
- Add TestThreadStateAnnotations asserting that ThreadState.todos is
Annotated with merge_todos. Without this guard, reverting the
Annotated[list | None, merge_todos] binding would silently regress
#3123 while all existing reducer unit tests continue to pass.
- Align test imports to 'from deerflow.agents.thread_state import ...'
matching the rest of the backend test suite.
* fix runtime run creation persistence atomicity
* fix run creation cancellation rollback
* fix run manager test cleanup await
* clarify run creation rollback on cancellation
* document new run persistence rollback boundary
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(runtime): suppress tool execution when provider safety-terminates with tool_calls
When a provider stops generation for safety reasons (OpenAI/Moonshot
finish_reason=content_filter, Anthropic stop_reason=refusal, Gemini
finish_reason=SAFETY/BLOCKLIST/PROHIBITED_CONTENT/SPII/RECITATION/
IMAGE_SAFETY/...), the response may still carry truncated tool_calls.
LangChain's tool router treats any non-empty tool_calls as executable,
so partial arguments (e.g. write_file with a half-finished markdown)
get dispatched and the agent loops on retry.
Add SafetyFinishReasonMiddleware at after_model: detect safety
termination via a pluggable detector registry, clear both structured
tool_calls and raw additional_kwargs.tool_calls / function_call,
preserve response_metadata.finish_reason for downstream observers,
stamp additional_kwargs.safety_termination for traces, append a
user-facing explanation to message content (list-aware for thinking
blocks), and emit a safety_termination custom stream event so SSE
consumers can reconcile any "tool starting..." UI.
Default detectors cover OpenAI-compatible content_filter, Anthropic
refusal, and Gemini safety enums (text + image). Custom providers are
added via reflection (same pattern as guardrails). Wired into both
lead-agent and subagent runtimes.
Closes#3028
* fix(runtime): persist safety_termination as a middleware audit event
Address review on #3035: the SSE custom event is great for live
consumers but invisible to post-run audit. RunEventStore should carry
its own row so operators can answer "which runs were safety-suppressed
today?" from a single SQL query without joining the message body.
Worker now exposes the run-scoped RunJournal via
runtime.context["__run_journal"] (sentinel key, internal channel).
SafetyFinishReasonMiddleware calls the previously-unused
RunJournal.record_middleware, which emits
event_type = "middleware:safety_termination"
category = "middleware"
content = {name, hook, action, changes={
detector, reason_field, reason_value,
suppressed_tool_call_count,
suppressed_tool_call_names,
suppressed_tool_call_ids,
message_id, extras}}
Tool *arguments* are deliberately excluded — those are the very content
the provider filtered and persisting them would defeat the purpose of
the safety filter (per review note in #3035).
Graceful skips when journal is absent (subagent runtime, unit tests,
no-event-store local dev). Journal exceptions never propagate into the
agent loop.
Refs #3028
* fix(runtime): satisfy ruff format + address Copilot review
- ruff format on safety_finish_reason_config.py and e2e demo (CI lint
failed on ruff format --check; backend Makefile lint target runs
ruff check AND ruff format --check).
- Docstring on SafetyFinishReasonConfig now says resolve_variable to
match the actual loader used in from_config (the wording was
resolve_class previously; behavior is unchanged — resolve_variable
mirrors how guardrails.provider is loaded).
- Switch the AIMessage type check in SafetyFinishReasonMiddleware._apply
from getattr(last, "type") == "ai" to isinstance(last, AIMessage),
matching TokenUsageMiddleware / TodoMiddleware / ViewImageMiddleware
/ SummarizationMiddleware which are the dominant pattern.
Refs #3028
* fix(mcp): persist MCP sessions across tool calls for stateful servers
MCP tools loaded via langchain-mcp-adapters created a new session on
every call, causing stateful servers like Playwright to lose browser
state (pages, forms) between consecutive tool invocations within the
same thread.
Add MCPSessionPool that maintains persistent sessions scoped by
(server_name, thread_id). Tool calls within the same thread now reuse
the same MCP session, preserving server-side state. Sessions are evicted
in LRU order (max 256) and cleaned up on cache invalidation.
Fixes#3054
* fix(sandbox): add group/other read permissions to uploaded files for Docker sandbox (#3127)
When using AIO sandbox with LocalContainerBackend, uploaded files are
created with 0o600 (owner-only) permissions by the gateway process
running as root. The sandbox process inside the Docker container runs
as a non-root user and cannot read these bind-mounted files, causing
a "Permission denied" error on read_file.
Add `needs_upload_permission_adjustment` attribute to SandboxProvider
(default True) to indicate that uploaded files need chmod adjustment.
LocalSandboxProvider opts out (same user). A new `_make_file_sandbox_readable`
function adds S_IRGRP | S_IROTH bits after files are written, changing
permissions from 0o600 to 0o644 so the sandbox can read the uploads.
* fix(mcp): address review comments on session pool and tools
- _extract_thread_id: return "default" instead of stringifying None
when get_config() returns no thread_id
- call_with_persistent_session: fix **arguments annotation from
dict[str,Any] to Any
- Replace private _convert_call_tool_result import with a local
implementation that handles all MCP content block types
- _make_session_pool_tool: accept tool_interceptors and apply the
configured interceptor chain on every call (preserving OAuth and
custom interceptors)
- MCPSessionPool: replace asyncio.Lock with threading.Lock; restructure
get/close methods to never await while holding the lock; add
close_all_sync() that closes sessions on their owning event loops
- reset_mcp_tools_cache: use pool.close_all_sync() instead of
asyncio.run-in-thread to close sessions deterministically
- test: add test_session_pool_tool_sync_wrapper_path_is_safe covering
tool invocation via the sync wrapper (tool.func) path
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/9e7f9e7f-1d2b-464a-b3b7-7f1649b74122
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* fix(mcp): extract SESSION_CLOSE_TIMEOUT to class constant
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/9e7f9e7f-1d2b-464a-b3b7-7f1649b74122
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* Potential fix for pull request finding 'Empty except'
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
* fix(task-tool): unwrap callback manager when locating usage recorder
`config["callbacks"]` may arrive as a `BaseCallbackManager` (e.g. the
`AsyncCallbackManager` LangChain hands to async tool runs), not just a plain
list. The previous `for cb in callbacks` loop raised
`TypeError: 'AsyncCallbackManager' object is not iterable`, which
`ToolErrorHandlingMiddleware` then converted into a failed `task` ToolMessage
even though the subagent had completed internally — Ultra mode lost subagent
results and the lead agent fell back to redoing the work.
Unwrap `BaseCallbackManager.handlers` before searching for the recorder.
Refs: bytedance/deer-flow#3107 (BUG-002)
* fix(frontend): treat any task tool error as a terminal subtask failure
The subtask card status machine matched only three English prefixes (`Task
Succeeded. Result:`, `Task failed.`, `Task timed out`). Anything else fell
through to `in_progress`, so a `task` tool error wrapped by
`ToolErrorHandlingMiddleware` (`Error: Tool 'task' failed ...`) left the card
spinning forever even after the run had ended.
Extract the prefix logic into `parseSubtaskResult` and recognise any leading
`Error:` token as a terminal failure. The extracted function is unit-tested
against the legacy prefixes plus the `AsyncCallbackManager` regression
captured in the upstream issue.
Refs: bytedance/deer-flow#3107 (BUG-007)
* fix(frontend): exclude hidden, reasoning, and tool payloads from chat export
`formatThreadAsMarkdown` / `formatThreadAsJSON` iterated raw messages without
running the UI-level `isHiddenFromUIMessage` filter. Exported transcripts
therefore included `hide_from_ui` system reminders, memory injections,
provider `reasoning_content`, tool calls, and tool result messages — content
that is intentionally hidden in the chat view.
Filter the export to the user-visible transcript by default and gate
reasoning / tool calls / tool messages / hidden messages behind explicit
`ExportOptions` flags so a future debug export can opt back in without
forking the formatter.
Refs: bytedance/deer-flow#3107 (BUG-006)
* fix(gateway): route get_config through get_app_config for mtime hot reload
`get_config(request)` returned the `app.state.config` snapshot captured at
startup. The worker / lead-agent path then threaded that frozen `AppConfig`
through `RunContext` and `agent_factory`, so per-run fields edited in
`config.yaml` (notably `max_tokens`) were ignored until the gateway process
was restarted — even though `get_app_config()` already does mtime-based
reload at the bottom layer.
Route the request dependency through `get_app_config()` directly. Runtime
`ContextVar` overrides (`push_current_app_config`) and test-injected
singletons (`set_app_config`) keep working; `app.state.config` is now only
read at startup for one-shot bootstrap (logging level, IM channels,
`langgraph_runtime` engines).
`tests/test_gateway_deps_config.py` encoded the old snapshot contract and is
removed; `tests/test_gateway_config_freshness.py` replaces it with mtime,
ContextVar, and `set_app_config` coverage. `test_skills_custom_router.py` and
`test_uploads_router.py` now inject test configs via FastAPI
`dependency_overrides[get_config]` instead of mutating `app.state.config`.
Document the hot-reload boundary in `backend/CLAUDE.md` so reviewers know
which fields are picked up on the next request vs. which still require a
restart (`database`, `checkpointer`, `run_events`, `stream_bridge`,
`sandbox.use`, `log_level`, `channels.*`).
Refs: bytedance/deer-flow#3107 (BUG-001)
* fix(gateway): broaden get_config 503 to any config-load failure
Address review feedback on the previous commit:
1. Narrow exception catch removed. The old contract returned 503 whenever
`app.state.config is None`. The first cut only mapped
`FileNotFoundError`, leaving `PermissionError`, YAML parse errors, and
pydantic `ValidationError` to bubble up as 500. At the request boundary
we treat any inability to materialise the config as "configuration not
available" (503) and log the original exception so the operator still
has the stack.
2. Removed the unused `request: Request` parameter and the matching
`# noqa: ARG001`. FastAPI's `Depends()` does not require the dependency
to accept `Request`; the only call site uses the no-arg form.
3. `backend/CLAUDE.md` boundary now lists the *reason* each field is
restart-required (engine binding, singleton caching, one-shot
`apply_logging_level`, etc.), not just the field name, so reviewers do
not have to reverse-engineer the boundary themselves.
Tests parametrise four exception classes (`FileNotFoundError`,
`PermissionError`, `ValueError`, `RuntimeError`) and assert 503 for each.
Refs: bytedance/deer-flow#3107 (BUG-001)
* fix(task-tool): defend _find_usage_recorder against non-list callbacks
Address review feedback. The previous commit handled the two common shapes
LangChain hands to async tool runs — a plain `list[BaseCallbackHandler]` and
a `BaseCallbackManager` subclass — but iterated any other shape directly,
which would still raise `TypeError` if e.g. a single handler instance leaked
through without a list wrapper.
Treat any non-list, non-manager `config["callbacks"]` value as "no recorder"
rather than crash. Docstring now lists all four shapes explicitly. New tests
cover the single-handler-object case, `runtime is None`, `callbacks is None`,
and `runtime.config` being a non-dict — all required to be silent no-ops.
Refs: bytedance/deer-flow#3107 (BUG-002)
* fix(frontend): drop dead identity ternary and add opt-in export tests
Address review feedback on the previous export commit:
1. Removed the no-op `typeof msg.content === "string" ? msg.content : msg.content`
expression in `formatThreadAsJSON`. Both branches returned the same value;
the message content now flows through unchanged whether it is a string or
the rich `MessageContent[]` shape (LangChain JSON-serialises the array
structure correctly already).
2. Expanded the JSDoc on `ExportOptions` to make it clearer that the four
flags are not currently wired to any UI control — callers wanting a debug
export must build the options object explicitly. The default behaviour
continues to match the explicit prescription in
bytedance/deer-flow#3107 BUG-006.
3. Added opt-in coverage. The previous tests only exercised the
`options = {}` default path; the new cases verify each flag flips the
corresponding payload back into the export so a future debug-export
surface does not silently break the contract.
Refs: bytedance/deer-flow#3107 (BUG-006)
* fix(frontend): export subtask prefix constants and document fallback intent
Address review feedback on the previous BUG-007 commit:
1. `SUCCESS_PREFIX`, `FAILURE_PREFIX`, `TIMEOUT_PREFIX`, and the
`ERROR_WRAPPER_PATTERN` regex are now exported. The JSDoc explicitly
pins them as part of the backend↔frontend contract defined in
`task_tool.py` and `tool_error_handling_middleware.py`, so any future
structured-status migration (e.g. backend writing
`additional_kwargs.subagent_status` instead of leading text) can
reference these from one canonical place rather than redefine them.
2. The `in_progress` fallback now carries a docstring explaining the
deliberate choice — LangChain only ever emits a `ToolMessage` once the
tool itself has returned, so unrecognised content means the contract
has drifted and "still running" is the right operator signal (eagerly
marking it terminal-failed would mask the drift).
No behaviour change; this is documentation and an API export.
Refs: bytedance/deer-flow#3107 (BUG-007)
* fix(gateway): drop app.state.config snapshot and freeze run_events_config
Address @ShenAC-SAC's BUG-001 review on #3131. The previous cut still
stored an ``AppConfig`` snapshot on ``app.state.config`` for startup
bootstrap. Two follow-on hazards from that:
1. Future code touching the gateway lifespan could accidentally start
reading ``app.state.config`` again, silently regressing the request
hot path back to a stale snapshot.
2. ``get_run_context()`` paired a freshly-reloaded ``AppConfig`` with the
startup-bound ``event_store`` and a *live* ``run_events_config``
field — so an operator who edited ``run_events.backend`` mid-flight
would have produced a run context whose ``event_store`` and
``run_events_config`` referred to different backends.
Clean approach (aligned with the direction in PR #3128):
- ``lifespan()`` keeps a local ``startup_config`` variable and passes it
explicitly into ``langgraph_runtime(app, startup_config)`` and into
``start_channel_service``. No ``app.state.config`` attribute is set at
any point.
- ``langgraph_runtime`` now accepts ``startup_config`` as a required
parameter, removing the ``getattr(app.state, "config", None)`` lookup
and the "config not initialised" runtime error.
- The matching ``run_events_config`` is frozen onto ``app.state`` next
to ``run_event_store`` so ``get_run_context`` reads the two from the
same startup-time source. ``app_config`` continues to be resolved
live via ``get_app_config()``.
- ``backend/CLAUDE.md`` boundary explanation updated to spell out the
``startup_config`` / ``get_app_config()`` split.
New regression test ``test_run_context_app_config_reflects_yaml_edit``
exercises the worker-feeding path: it asserts that ``ctx.app_config``
follows a mid-flight ``config.yaml`` edit while
``ctx.run_events_config`` stays frozen to the startup snapshot the
event store was built from.
Refs: bytedance/deer-flow#3107 (BUG-001), bytedance/deer-flow#3131 review
* fix(frontend): parse Task cancelled and polling timed out as terminal
Address @ShenAC-SAC's BUG-007 review on #3131. `task_tool.py` actually
emits five terminal strings:
- `Task Succeeded. Result: …`
- `Task failed. …`
- `Task timed out. …`
- `Task cancelled by user.` ← previously matched none
- `Task polling timed out after N minutes …` ← previously matched none
The previous cut handled three; the last two fell through to the
"unknown content" branch and pushed the subtask card back to
`in_progress` even though the backend had already reached a terminal
state. Add explicit matches plus regression tests for both. The
`in_progress` fallback is now reserved for genuinely unrecognised
output (i.e. contract drift), as documented.
Refs: bytedance/deer-flow#3107 (BUG-007), bytedance/deer-flow#3131 review
* fix(frontend): sanitize JSON export content via the Markdown content path
Address @ShenAC-SAC's BUG-006 review and the Copilot inline comment on
#3131. The previous cut filtered hidden/tool messages out of the JSON
export but still serialised `msg.content` verbatim, so:
- inline `<think>…</think>` wrappers stayed in the exported `content`
even with `includeReasoning: false`,
- content-array thinking blocks leaked the `thinking` field,
- `<uploaded_files>…</uploaded_files>` markers leaked the workspace
paths a user uploaded files to.
JSON now goes through the same sanitiser the Markdown path uses
(`extractContentFromMessage` + `stripUploadedFilesTag`). Reasoning and
tool_calls remain gated behind their `ExportOptions` flags. AI / human
rows that sanitise to empty content with no opted-in reasoning or tool
calls are dropped so the JSON matches the Markdown path's `continue`
on empty assistant fragments.
New regression tests cover the three leak shapes the reviewer called
out plus the empty-content-drop case.
Refs: bytedance/deer-flow#3107 (BUG-006), bytedance/deer-flow#3131 review
* test(gateway): align lifespan stub with langgraph_runtime two-arg signature
Codex round-3 review of c0bc7a06 flagged this: changing
`langgraph_runtime` to require `startup_config` as a second positional
argument broke the one-arg stub `_noop_langgraph_runtime(_app)` in
`test_gateway_lifespan_shutdown.py`, which is patched into
`app.gateway.app.langgraph_runtime` by the lifespan shutdown bounded-timeout
regression. Lifespan would then call the stub with two args and raise
`TypeError` before the bounded-shutdown assertion ran.
Update the stub to match the new signature. The shutdown test itself is
unaffected — it only cares about the channel `stop_channel_service` hang
path.
Refs: bytedance/deer-flow#3107 (BUG-001), bytedance/deer-flow#3131 review
* fix(frontend): strip every known backend marker in export, not just uploads
Codex round-3 review of 258ca800 and the matching maintainer feedback on
PR #3131 made the same point: the JSON export now ran the
Markdown-side sanitiser, but that sanitiser only stripped
`<uploaded_files>`. The full set of payloads middleware embeds inside
message `content` is larger:
- `<uploaded_files>` — `UploadsMiddleware`
- `<system-reminder>` — `DynamicContextMiddleware`
- `<memory>` — `DynamicContextMiddleware` (nested inside system-reminder)
- `<current_date>` — `DynamicContextMiddleware`
The primary protection is still `isHiddenFromUIMessage`: the
`<system-reminder>` HumanMessage is marked `hide_from_ui: true` and never
reaches the formatter. This commit adds the second line of defence so a
regression that drops the `hide_from_ui` flag — or any future middleware
that injects the same tag vocabulary into a visible HumanMessage —
cannot leak the payload into the export file.
Concrete changes:
- New `INTERNAL_MARKER_TAGS` constant + `stripInternalMarkers(content)`
helper in `core/messages/utils.ts`. The constant doubles as
documentation for the backend↔frontend contract.
- `formatMessageContent` in `export.ts` now calls `stripInternalMarkers`
instead of `stripUploadedFilesTag`. UI render paths
(`message-list-item.tsx`) keep using the narrower function so a user
legitimately typing `<memory>` in a meta-discussion is preserved.
- The "drop empty rows" guard in `buildJSONMessage` switched from
`=== undefined` to truthy `!` checks. Codex spotted the asymmetry: when
`extractReasoningContentFromMessage` returned the empty string (which it
legitimately can), the JSON path emitted `{reasoning: ""}` while the
Markdown path's `!reasoning` `continue` correctly dropped the row.
New regression tests cover the defence-in-depth strip with a
`<system-reminder><memory><current_date>` payload deliberately *not*
marked `hide_from_ui`; tool-message sanitization under
`includeToolMessages: true`; the mixed-content-array case
(`thinking + text + image_url`); and the opted-in empty-reasoning drop.
Live verification on a real Ultra-mode thread that uploaded a PDF
(`曾鑫民-薪资交易流水.pdf`): backend state's first HumanMessage carries the
`<uploaded_files>` block (with `/mnt/user-data/uploads/...` paths) as part
of a content-array. The Markdown and JSON export blobs both come back
free of `<uploaded_files>`, `<system-reminder>`, `<current_date>`,
`tool_calls`, and reasoning — while preserving the user's `这是什么 ?`
prompt and the assistant's visible answer.
Refs: bytedance/deer-flow#3107 (BUG-006), bytedance/deer-flow#3131 review
* test(frontend): cover trim, varied N, and pre-execution Error: prefixes
Codex round-3 review of 50e2c257 flagged three coverage gaps in the
subtask-status parser:
1. `Task cancelled by user.` and `Task polling timed out` previously had
no whitespace-trim coverage — the original trim test only exercised
the success prefix. Streaming chunks can arrive with leading/trailing
newlines; the regex needed an explicit assertion.
2. The polling-timeout case was tested only at one `N` (15 minutes). The
backend interpolates the live `timeout_seconds // 60` value, so the
matcher must hold for any positive integer. Now we run the case for
1, 5, and 60 minutes.
3. `task_tool.py` also emits three `Error:` strings for pre-execution
failures — unknown subagent type, host-bash disabled, and "task
disappeared from background tasks". They are intentionally handled by
`ERROR_WRAPPER_PATTERN` rather than dedicated prefixes (the wrapper
already produces the right terminal-failed shape) but had no test
coverage proving that wiring. Codex was right that a refactor splitting
one of them off into its own prefix would silently break things.
The JSDoc on the constants block now spells the three pre-execution
errors out so the relationship between `task_tool.py` returns and the
prefix vocabulary is explicit.
No production code change beyond the docstring — this commit is pure
coverage hardening for the contract that already exists.
Refs: bytedance/deer-flow#3107 (BUG-007), bytedance/deer-flow#3131 review
* fix(runtime): bound write_file execution-failure observations
* fix(runtime): preserve write_file error prefixes
* test(runtime): trim write_file prefix assertions
* refactor(runtime): drop redundant exception suffix for permission/directory write errors
Address Copilot review on #3133: the PermissionError and IsADirectoryError
branches now return self-contained, non-redundant messages (e.g.
"Error: Permission denied writing to file: /mnt/...") via direct
truncation, instead of going through _format_write_file_error which
appended a duplicate ": PermissionError: permission denied" suffix.
OSError, SandboxError and the generic Exception branches keep the
unified "Failed to write file '{path}': {ExceptionType}: {detail}"
format so the model still sees a stable, machine-readable error class.
Removes the now-unused message= parameter from _format_write_file_error,
keeping a single code path. Truncation contract (<= 2000 chars) and
host-path sanitization unchanged.
* fix(runtime): handle write_file sandbox init errors
Initialize the requested path before sandbox setup so early sandbox failures can still return a bounded write_file error.
Add a regression test for sandbox initialization failures.
* style(test): format sandbox security tests
* fix(tracing): propagate session_id and user_id into Langfuse traces
Adds Langfuse v4 reserved trace attributes (langfuse_session_id,
langfuse_user_id, langfuse_trace_name, langfuse_tags) to
RunnableConfig.metadata inside the run worker, so the langchain
CallbackHandler can lift them onto the root trace.
- New deerflow.tracing.metadata.build_langfuse_trace_metadata() returns
the reserved keys when Langfuse is in the enabled providers, else {}.
- worker.run_agent merges them with setdefault so caller-supplied keys
win, allowing per-request overrides from upstream metadata.
- session_id mirrors the LangGraph thread_id; user_id reads
get_effective_user_id() (falls back to "default" in no-auth mode).
- trace_name defaults to "lead-agent"; tags carry env and model name
when DEER_FLOW_ENV (or ENVIRONMENT) and a model name are present.
Closes#2930
* fix(tracing): attach Langfuse callback at graph root so metadata propagates
The first commit injected ``langfuse_session_id`` / ``langfuse_user_id`` /
``langfuse_trace_name`` / ``langfuse_tags`` into ``RunnableConfig.metadata``,
but on ``main`` the Langfuse callback is attached at *model* level
(``models/factory.py``). LangChain still threads ``parent_run_id`` through
the contextvar, so the handler sees the model as a nested observation and
``__on_llm_action`` strips the ``langfuse_*`` keys
(``keep_langfuse_trace_attributes=False``). The trace's top-level
``sessionId`` / ``userId`` therefore stayed empty in deer-flow's LangGraph
runtime — confirmed live against a real Langfuse instance.
This commit moves the callback to the **graph invocation root** so the
handler fires ``on_chain_start(parent_run_id=None)`` and runs the
``propagate_attributes`` path that actually lifts ``session_id`` /
``user_id`` onto the trace:
- ``models/factory.py``: add ``attach_tracing`` keyword (default ``True``)
so standalone callers (``MemoryUpdater``, etc.) keep their direct
model-level tracing.
- ``agents/lead_agent/agent.py``: call ``build_tracing_callbacks()`` once
inside ``_make_lead_agent`` and append the result to
``config["callbacks"]``; the four in-graph ``create_chat_model`` sites
(bootstrap, default agent, sync + async summarization) pass
``attach_tracing=False`` to avoid duplicate spans.
- ``agents/middlewares/title_middleware.py``: same ``attach_tracing=False``
for the title-generation model, since it inherits the graph's
RunnableConfig via ``_get_runnable_config``.
Test updates:
- ``tests/test_lead_agent_model_resolution.py`` and
``tests/test_title_middleware_core_logic.py``: extend the fake
``create_chat_model`` signatures / mock assertions to accept the new
``attach_tracing`` kwarg.
- ``tests/test_worker_langfuse_metadata.py``: switch the no-user fallback
test from direct ContextVar mutation to ``monkeypatch.setattr`` on
``get_effective_user_id`` to avoid pollution across the langfuse OTel
global tracer provider.
- ``tests/conftest.py``: add an autouse fixture that resets
``deerflow.config.title_config._title_config`` to its pristine default
after every test. Any test that loads the real ``config.yaml`` (via
``get_app_config()``) calls ``load_title_config_from_dict`` and mutates
the module-level singleton, which previously poisoned the
title-middleware suite when run after, e.g., the new
``test_worker_langfuse_metadata.py`` cases. The fixture is independent
of this PR's main change but unblocks the cross-file test run.
Live verification (same Langfuse instance as before):
- Drove ``worker.run_agent`` against the real ``make_lead_agent`` +
``gpt-4o-mini`` for three distinct ``user_context`` identities
(``fancy-engineer``, ``alice-pm``, ``bob-designer``).
- Each run produced one ``lead-agent`` trace whose top-level
``sessionId`` / ``userId`` / ``tags`` carry the expected values, e.g.
``session=e2e-2930-8f347c-alice-pm user=alice-pm name='lead-agent'
tags=['model:gpt-4o-mini']``.
Refs #2930.
* fix(tracing): extend root-callback + metadata injection to the embedded client
Addresses Copilot review on PR #2944.
Commit 2 disabled model-level tracing for ``TitleMiddleware`` and
``_create_summarization_middleware`` because ``_make_lead_agent`` now
attaches the tracing callbacks at the graph invocation root. But the
embedded ``DeerFlowClient`` does not call ``_make_lead_agent`` — it
calls ``_build_middlewares`` directly and never appends the tracing
handlers to its ``RunnableConfig``. So under the embedded path,
title-generation and summarization LLM calls were left untraced —
a regression introduced by this PR.
This commit mirrors the gateway worker's injection in
``DeerFlowClient.stream``:
- Append ``build_tracing_callbacks()`` to ``config["callbacks"]`` so
the Langfuse handler sees ``on_chain_start(parent_run_id=None)`` at
the graph root and runs the ``propagate_attributes`` path.
- Merge ``build_langfuse_trace_metadata(...)`` into
``config["metadata"]`` with ``setdefault`` so caller-supplied keys
still win.
- ``_ensure_agent`` now creates its main model with
``attach_tracing=False`` to avoid duplicate spans now that the
callback lives at the graph root.
Docs:
- ``backend/CLAUDE.md`` Tracing section rewritten to describe the
graph-root attachment model (replacing the inaccurate
"at model-creation time" wording).
- ``README.md`` Langfuse section now lists both injection points
(worker + client) instead of only the worker path.
Tests:
- ``tests/test_client_langfuse_metadata.py`` (new, 3 cases):
callbacks + metadata are injected when Langfuse is enabled,
caller-supplied metadata overrides win via ``setdefault``, and the
injection is inert when Langfuse is disabled.
Live verification on the real Langfuse instance:
=== user=fancy-client ===
id=cbd22847.. session=client-2930-6b9491-fancy-client user=fancy-client name='lead-agent'
=== user=alice-client ===
id=b4f6f576.. session=client-2930-6b9491-alice-client user=alice-client name='lead-agent'
Refs #2930.
* refactor(tracing): address maintainer review on PR #2944
Addresses @WillemJiang's 5 comments.
1. Duplicated metadata-injection code between worker.py and client.py
New ``deerflow.tracing.inject_langfuse_metadata(config, ...)`` helper
takes the 10-line build + merge + setdefault logic that was duplicated
in ``runtime/runs/worker.py`` and ``client.py``. Both callers now share
a single source of truth, so the two paths cannot drift.
2. Direct private-attribute mutation in conftest.py and tests
Added public ``reset_tracing_config()`` / ``reset_title_config()``
functions. ``tests/conftest.py`` and every test that previously did
``tracing_module._tracing_config = None`` or
``title_module._title_config = TitleConfig()`` now goes through the
public API. A future internal rename will surface as an ImportError
instead of a silent no-op.
3. client.py reading os.environ directly
``DeerFlowClient.__init__`` grows an optional ``environment`` parameter
so programmatic callers can pass the deployment label explicitly.
``stream()`` consults ``self._environment`` first and only falls back
to ``DEER_FLOW_ENV`` / ``ENVIRONMENT`` env vars when nothing was
passed in. Backwards compatible — env-var behaviour preserved for
callers that opt to keep using it.
4. build_tracing_callbacks() cached on hot path
Not implemented. Inspected the langfuse v4 ``langchain.CallbackHandler``
constructor: it only resolves the module-level singleton client via
``get_client()`` and initialises a few dicts (no I/O, no env parsing
at construction time). The build is essentially free. Caching would
trade a non-measurable speedup for two real risks: handler instances
carry per-run state internally (``_run_states``, ``_root_run_states``,
``last_trace_id``), and tracing config can be reloaded by env-var
changes between runs. Will revisit if profiling ever shows it as
a hot spot.
5. attach_tracing=False easy to forget at new in-graph call sites
- Module docstring at the top of ``lead_agent/agent.py`` documents
the invariant ("every in-graph ``create_chat_model`` MUST pass
``attach_tracing=False``") and enumerates the current sites.
- New regression test
``test_make_lead_agent_attaches_tracing_callbacks_at_graph_root`` in
``tests/test_lead_agent_model_resolution.py`` locks both halves of
the invariant: ``config["callbacks"]`` carries the tracing handler
after ``_make_lead_agent``, AND every ``create_chat_model`` call
captured by the test passes ``attach_tracing=False``. A future
in-graph site that forgets the flag will fail this test.
Lint clean. Full touched-suite bundle: 246 passed.
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
SQLAlchemy's DateTime(timezone=True) is a no-op on SQLite (the backend
has no native tz type), so values round-tripped through the DB come
back as naive datetimes. The four SQL _row_to_dict helpers were calling
.isoformat() directly on those naive values, shipping timezone-less
strings like "2026-05-20T06:10:22.970977" out of the API. The browser's
new Date(...) then parses them as local time, shifting recent threads
in /threads/search by the local UTC offset (about 8h in Asia/Shanghai).
Route the four call sites through coerce_iso() instead — it already
normalizes naive values as UTC and emits "+00:00" so the wire format
always carries tz. No data migration is needed; existing SQLite rows
read back via the corrected serializer.
PostgreSQL deployments are unaffected because timestamptz preserves
tzinfo end-to-end.
Closes#3120
* feat(trace):LangGraph -> lead_agent and set user custom agent name to run_name
* feat(trace):follow github copilot suggest
* feat(trace):Refactor run_name resolution and improve test coverage
* fix(loop-detection): defer warn injection to wrap_model_call
The warn branch in LoopDetectionMiddleware injected a HumanMessage
into state from after_model. The tools node had not yet produced
ToolMessage responses to the previous AIMessage(tool_calls=...), so
the new HumanMessage landed *between* the assistant's tool_calls and
their responses. OpenAI/Moonshot reject the next request with
"tool_call_ids did not have response messages" because their
validators require tool_calls to be followed immediately by tool
messages.
Detection now runs in after_model as before, but only enqueues the
warning into a per-thread list. Injection happens in wrap_model_call,
where every prior ToolMessage is already present in request.messages.
The warning is appended at the end as HumanMessage(name="loop_warning")
— pairing intact, AIMessage semantics untouched, no SystemMessage
issues for Anthropic.
Closes#2029, addresses #2255#2293#2304#2511.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(channels): remove loop warning display filter
* feat(loop-detection): scope pending warnings by run
* docs(loop-detection): update docs
* test(loop-detection): assert deferred warnings are queued
* fix(loop-detection): cap transient warning state
* docs: update docs
* add async awrap_model_call test coverage
* docs(loop-detection): document transient warnings
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
When the poll loop's safety-net timeout fires (poll_count > max_poll_count),
the background subagent task was abandoned without cancellation or cleanup,
leaving a stale entry in _background_tasks indefinitely.
The original code had a comment promising "the cleanup will happen when the
executor completes", but run_task() in executor.py never calls
cleanup_background_task after reaching a terminal state -- the promise was
never implemented.
This change mirrors the asyncio.CancelledError path: signal cooperative
cancellation via request_cancel_background_task and schedule
_deferred_cleanup_subagent_task to remove the entry once the background
thread reaches a terminal state.
Direct cleanup at poll-timeout time would introduce a race: run_task() could
remove the entry while the poll loop is still mid-iteration, causing a
spurious "Task disappeared" error. The deferred approach avoids this by
waiting for terminal state before removal.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix env resolution in MCP config lists
* fix:unset env variable and consistent function
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(trace):memory 中文 in trace is unicode
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
A second cancel() call on an interrupted run returned False, causing the
cancel and stream_existing_run router endpoints to raise 409 on double-stop.
Fix: return True inside the lock when record.status == RunStatus.interrupted.
This covers both the POST /cancel and POST /join endpoints without any
re-fetch or extra get() call — the idempotency lives at the source.
Also fixes stream_existing_run (the LangGraph SDK stop-button path), which
had the identical cancel() → 409 pattern and was not covered by the
original PR. Both endpoints share the fix automatically.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): hydrate run history from RunStore and persist cancellation status
fix:
- Make RunManager.get() async and hydrate from RunStore when in-memory record is missing
- Merge store rows into list_by_thread() with in-memory precedence for active runs
- Persist interrupted status to RunStore in cancel() and create_or_reject(interrupt|rollback)
- Extract _persist_status() to reuse the best-effort store update pattern
- Await run_mgr.get() in all gateway endpoints
- Return 409 with distinct message for store-only runs not active on current worker
Closes#2812, Closes#2813
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(harness): consistent sort and guarded hydration in RunManager
fix:
- list_by_thread() now sorts by created_at desc (newest first) even when
no RunStore is configured, matching the store-backed code path
- guard _record_from_store() call sites in get() and list_by_thread()
with best-effort error handling so a single malformed store row cannot
turn read paths into 500s
test:
- update test_list_by_thread assertion to expect newest-first order
- seed MemoryRunStore via public put() API instead of writing to _runs
* fix(harness): guard store-only runs from streaming and fix get() TOCTOU
Add RunRecord.store_only flag set by _record_from_store so callers can
distinguish hydrated history from live in-memory runs. join_run and
stream_existing_run (action=None) now return 409 instead of hanging
forever on an empty MemoryStreamBridge channel.
Re-check _runs under lock after the store await in RunManager.get() so a
concurrent create() that lands between the two checks returns the
authoritative in-memory record rather than a stale store-hydrated copy.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* fix(harness): reorder bridge fetch in join_run and make list_by_thread limit explicit
Move get_stream_bridge() after the store_only guard in join_run so a
missing bridge cannot produce 503 for historical runs before the 409
guard fires.
Add limit parameter to RunManager.list_by_thread (default 100, matching
the store's page size) and pass it explicitly to the store call.
Update docstring to document the limit instead of claiming all runs are
returned.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* fix(harness): cap list_by_thread result to limit after merge
Apply [:limit] to all return paths in list_by_thread so the method
consistently returns at most limit records regardless of how many
in-memory runs exist, making the limit parameter a true upper bound
on the response size rather than just a store-query hint.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* fix `list_by_thread` docstring
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(runtime): add update_model_name to RunStore to prevent SQL integrity errors
RunManager.update_model_name() was calling _persist_to_store() which uses
RunStore.put(), but RunRepository.put() is insert-only. This caused integrity
errors when updating model_name for existing runs in SQL-backed stores.
fix:
- Add abstract update_model_name method to RunStore base class
- Implement update_model_name in MemoryRunStore
- Implement update_model_name in RunRepository with proper normalization
- Add _persist_model_name helper in RunManager
- Update RunManager.update_model_name to use the new method
test:
- Add tests for update_model_name functionality
- Add integration tests for RunManager with SQL-backed store
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(runtime): handle NULL status/on_disconnect in _record_from_store
`dict.get(key, default)` only uses the default when the key is absent,
so a SQL row with an explicit NULL status would pass `None` to
`RunStatus(None)` and raise, breaking hydration for otherwise valid rows.
Switch to `row.get(...) or fallback` so both missing and NULL values
get a safe default. Add tests for get() and list_by_thread() with a
NULL status row to prevent regression.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* fix(runs): address PR review feedback on store consistency changes
- Fix list_by_thread limit semantics: pass store_limit = max(0, limit - len(memory_records)) to store so newer store records are not crowded out by in-memory records
- Remove dead code: cancelled guard after raise is always True, simplify to if wait and record.task
- Document _record_from_store NULL fallback policy (status→pending, on_disconnect→cancel) in docstring
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* Guard subagent terminal state transitions
* fix: publish subagent terminal status last
* Fix subagent timeout test to avoid blocking event loop
* Fix subagent timeout test tracking
* Refine subagent terminal state handling
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(runs): restore historical runs from persistent store after gateway restart
RunManager.list_by_thread() and get() only queried the in-memory _runs
dict, returning empty results after a restart even when PostgreSQL had
the records. Add store fallback to both read paths and a new async
aget() for the API endpoint, keeping sync get() for internal callers
that need live task/abort_event state.
Fixes#2984
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(runs): scope run store fallback reads by user id
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/e73daada-1215-4bc1-ab7d-7117826c5013
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* test(runs): clarify ordering expectation and mock store filters
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/e73daada-1215-4bc1-ab7d-7117826c5013
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* test(runs): make user filter fallback assertions explicit
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/e73daada-1215-4bc1-ab7d-7117826c5013
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* test(runs): verify user-isolated fallback behavior with memory store
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/e73daada-1215-4bc1-ab7d-7117826c5013
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* update the code with feedback from issue-2984
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
The moderation model's response was silently falling through to a
conservative block when LLMs wrapped structured output in markdown
code fences, added prose around the JSON, returned case-variant
decisions (e.g. "Allow"), or included nested braces in the reason
field. The greedy `\{.*\}` regex also over-matched on nested braces.
- Rewrite _extract_json_object() with markdown fence stripping and
brace-balanced string-aware extraction
- Normalize decision field to lowercase for case-insensitive matching
- Distinguish "model unavailable" from "unparseable output" in fallback
- Strengthen system prompt to explicitly forbid code fences and prose
- Add 15 tests covering all reported scenarios
Fixes#2985
* fix(sandbox): uphold /mnt/user-data contract at Sandbox API boundary (#2873)
LocalSandboxProvider used a process-wide singleton with no /mnt/user-data
mapping, forcing every caller to translate virtual paths via tools.py
before invoking the public Sandbox API. AIO already exposes /mnt/user-data
natively (per-thread bind mounts), so the same code path behaved
differently across implementations — and direct callers like
uploads.py:282 / feishu.py:389 only worked thanks to the
`uses_thread_data_mounts` workaround flag.
Switch the provider to a dual-track cache: keep the `"local"` singleton
for legacy acquire(None) callers (backward-compat for existing tests and
scripts), and create a per-thread LocalSandbox with id `"local:{tid}"`
for acquire(thread_id). Each per-thread instance carries PathMapping
entries for /mnt/user-data, its three subdirs, and /mnt/acp-workspace,
mirroring how AioSandboxProvider mounts those paths into its container.
is_local_sandbox() now recognises both id formats. `_agent_written_paths`
becomes per-thread (it was a process-wide set that leaked across
threads — a latent isolation bug also fixed by this change).
Verified via TDD: a new contract test suite hits the public Sandbox API
directly (write/read/list/exec/glob/grep/update + per-thread isolation +
lifecycle). 3212 backend tests still pass, ruff is clean.
* fix(sandbox): address Copilot review on #2881
Three follow-ups from Copilot's review of the LocalSandboxProvider refactor:
1. Synchronisation: ``acquire`` / ``get`` / ``reset`` mutated the cache without
any lock, so concurrent acquire of the same ``thread_id`` could create two
``LocalSandbox`` instances and lose one's ``_agent_written_paths`` state.
Add a provider-wide ``threading.Lock`` (matching ``AioSandboxProvider``) and
build per-thread mappings outside the lock to avoid holding it during the
``ensure_thread_dirs`` filesystem touch.
2. Memory bound: ``_thread_sandboxes`` grew monotonically. Replace the plain
dict with an ``OrderedDict`` LRU capped at
``DEFAULT_MAX_CACHED_THREAD_SANDBOXES`` (256, configurable per provider
instance). ``get`` promotes touched threads to the MRU end so an active
thread isn't evicted under load. Eviction is graceful: the next ``acquire``
rebuilds a fresh sandbox; only ``_agent_written_paths`` (reverse-resolve
hint) is lost.
3. Docs: update ``CLAUDE.md`` to reflect the new per-thread architecture, the
LRU cap, and that ``is_local_sandbox`` recognises both id formats.
New regression tests:
- Concurrent ``acquire("alpha")`` from 8 threads yields a single instance
(slow-init injection forces the race window wide open).
- Concurrent ``acquire`` of distinct thread_ids yields distinct instances.
- The cache evicts the least-recently-used thread once the cap is exceeded.
- ``get`` promotes recency so a polled thread survives a later acquire-storm.
* fix(middleware): Prevent todo completion reminder IMMessage leak (#2892)
* make format
* fix(middleware): Clear stale todo reminder counts (#2892)
* add size guard for _completion_reminder_counts and add a integration test
* fix(memory): isolate queued memory updates by agent
* fix(memory): include user in queue identity
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* Fix the lint error
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* feat: real-time subagent token usage display in header and per-turn
Backend:
- Persist subagent token usage to AIMessage.usage_metadata via
TokenUsageMiddleware, so accumulateUsage() naturally includes
subagent tokens without frontend state management
- Cache subagent usage by tool_call_id in task_tool, write back
to the dispatching AIMessage on next model response
- Emit subagent token usage on all terminal task events
(task_completed, task_failed, task_cancelled, task_timed_out)
- Report subagent usage to parent RunJournal for API totals
- Search backward from ToolMessage to find dispatching AIMessage
for correct multi-tool-call attribution
Frontend:
- Remove subagentUsage state, custom event handling, and prop
threading — subagent tokens are now embedded in message metadata
- Simplify selectHeaderTokenUsage (no subagentUsage parameter)
- Per-turn inline badges show turn-specific usage via message
accumulation
- Remove isLoading guard from MessageTokenUsageList for dynamic
updates during streaming
* fix: prevent header token double counting from baseline reset race
onFinish, onError, and thread-switch useEffect all reset
pendingUsageBaselineMessageIdsRef to an empty Set. If
thread.isLoading is still true on the next render, all messages
pass the getMessagesAfterBaseline filter and their tokens are
added to backendUsage (which already includes them), causing
the header to display up to 2× the actual token count.
Capture current message IDs instead of using an empty Set so
that getMessagesAfterBaseline correctly returns no pending
messages even if thread.isLoading lags behind the stream end.
* fix: write back subagent tokens for all concurrent task tool calls
TokenUsageMiddleware only processed messages[-2], so when a
single model response dispatched multiple task tool calls only
the last ToolMessage had its cached subagent usage written back
to the dispatch AIMessage.usage_metadata. Earlier tasks' usage
stayed in _subagent_usage_cache indefinitely (leak) and never
appeared in the per-turn inline token display.
Walk backward through all consecutive ToolMessages before the
new AIMessage, and accumulate updates targeting the same
dispatch message into one state update so overlapping writes
don't clobber each other.
* fix: clean up subagent usage cache entry on task cancellation
When a task_tool invocation is cancelled via CancelledError, any
cached subagent usage entry leaked because the TokenUsageMiddleware
writeback path never fires after cancellation. Pop the cache entry
before re-raising to prevent unbounded growth of the module-level
_subagent_usage_cache dict.
* fix: address token usage review feedback
* fix: handle missing config for subagent usage cache
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(tools): preserve tool_search promotions across re-entrant get_available_tools
Closes#2884.
``get_available_tools`` used to unconditionally call
``reset_deferred_registry()`` and rebuild a fresh ``DeferredToolRegistry``
on every invocation. That works for the first call of a request (the
ContextVar starts at its default of ``None``), but any RE-ENTRANT call
during the same async context — e.g. ``task_tool`` building a subagent's
toolset, or a custom middleware that rebuilds tools mid-run — wiped any
``tool_search`` promotions the parent agent had already made. The
``DeferredToolFilterMiddleware`` would then re-hide those tools from the
next model call, leaving the agent able to see a tool's name (via the
prior ``tool_search`` result that's still in conversation history) but
unable to invoke it.
Fix: when the ContextVar already holds a registry, reuse it instead of
rebuilding. Fresh requests still get a fresh registry because each new
graph run starts in a new asyncio task with the ContextVar at ``None``.
## Verification
- Unit-level reproduction (``test_get_available_tools_resets_registry_wiping_promotion``):
promote a tool in the registry, call ``get_available_tools`` again, assert
the promotion is preserved. Fails on main, passes on this branch.
- Graph-execution reproduction (two tests): drive a real
``langchain.agents.create_agent`` graph with the real
``DeferredToolFilterMiddleware`` through two model turns, including one
that issues a re-entrant ``get_available_tools`` call to simulate the
task_tool subagent path.
- Real-LLM end-to-end (``test_deferred_tool_promotion_real_llm.py``,
opt-in via ``ONEAPI_E2E=1``): drives the same flow against a real
OpenAI-compatible model (verified on GPT-5.4-mini through the one-api
gateway), watches the model call the promoted ``fake_calculator``
through the deferred-filter middleware, and asserts the right arithmetic
result. Passes against the fixed branch.
- Companion update to ``test_tool_deduplication.py``: dropped the
``@patch("deerflow.tools.tools.reset_deferred_registry")`` decorators
because the symbol is no longer imported there.
- Test fixtures in the new files patch ``deerflow.tools.tools.get_app_config``
with a minimal ``model_construct``-ed ``AppConfig`` instead of calling
the real loader, so they never trigger ``_apply_singleton_configs`` and
never leak ``_memory_config``/``_title_config``/… mutations into the
rest of the suite.
Full backend suite: 3208 passed / 14 skipped / 0 failed. ruff check + format clean.
* fix(tools): address Copilot review on #2885
- tools.py: rewrite the reuse-path comment to spell out (a) why we don't
reconcile the registry against the current ``mcp_tools`` snapshot — the
MCP cache doesn't refresh mid-graph-run, the lead agent's ``ToolNode``
is already bound to the previous tool set anyway, and ``promote()``
drops the entry so a naive re-sync misclassifies promotions as new
tools — and (b) why the log uses ``max(0, …)`` to avoid negative
counts when the cache shrinks between snapshots.
- Replace direct ``ts_mod._registry_var.set(None)`` in test fixtures with
the public ``reset_deferred_registry()`` helper so tests don't couple
to module internals.
- Correct the docstring path in ``test_deferred_tool_registry_promotion.py``
to match the actual monkeypatch target (``deerflow.mcp.cache.get_cached_mcp_tools``).
- Rename
``test_get_available_tools_resets_registry_wiping_promotion`` to
``test_get_available_tools_preserves_promotions_across_reentrant_calls``
so the test name describes the contract being asserted, not the bug it
originally reproduced.
Full backend suite: 3208 passed / 14 skipped. Real-LLM e2e: 1 passed.
* perf(harness): push thread metadata filters into SQL
Replace Python-side metadata filtering (5x overfetch + in-memory match)
with database-side json_extract predicates so LIMIT/OFFSET pagination
is exact regardless of match density.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix(harness): add dialect-aware JsonMatch compiler for type-safe metadata SQL filters
Replace SQLAlchemy JSON index/comparator APIs with a custom JsonMatch
ColumnElement that compiles to json_type/json_extract on SQLite and
jsonb_typeof/->>/-> on PostgreSQL. Tighten key validation regex to
single-segment identifiers, handle None/bool/numeric value types with
json_type-based discrimination, and strengthen test coverage for edge
cases and discriminability.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix(harness): address Copilot review comments on JSON metadata filters
- Use json_typeof instead of jsonb_typeof in PostgreSQL compiler; the
metadata_json column is JSON not JSONB so jsonb_typeof would error at
runtime on any PostgreSQL backend
- Align _is_safe_json_key with json_match's _KEY_CHARSET_RE so keys
containing hyphens or leading digits are not silently skipped
- Add thread_id as secondary ORDER BY in search() to make pagination
deterministic when updated_at values collide; remove asyncio.sleep
from the pagination regression test
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* fix(harness): address remaining review comments on metadata SQL filters
- Remove _is_safe_json_key() and reuse json_match ValueError to avoid
validator drift (Copilot #3217603895, #3217411616)
- Raise ValueError when all metadata keys are rejected so callers never
get silent unfiltered results (WillemJiang)
- Fix integer precision: split int/float branches, bind int as Integer()
with INTEGER/BIGINT CAST instead of float() coercion (Copilot #3217603972)
- Fix jsonb_typeof -> json_typeof on JSON column (Copilot #3217411579)
- Replace manual _cleanup() calls with async yield fixture so teardown
always runs (Copilot #3217604019)
- Remove asyncio.sleep(0.01) pagination ordering; use thread_id secondary
sort instead (Copilot #3217411636)
- Add type annotations to _bind/_build_clause/_compile_* and remove EOL
comments from _Dialect fields (coding.mdc)
- Expand test coverage: boolean/null/mixed-type/large-int precision,
partial unsafe-key skip with caplog assertion
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): address third-round Copilot review comments on JsonMatch
- Reject unsupported value types (list, dict, ...) in JsonMatch.__init__
with TypeError so inherit_cache=True never receives an unhashable value
and callers get an explicit error instead of silent str() coercion
(Copilot #3217933201)
- Upgrade int bindparam from Integer() to BigInteger() to align with
BIGINT CAST and avoid overflow on large integers (Copilot #3217933252)
- Catch TypeError alongside ValueError in search() so non-string metadata
keys are warned and skipped rather than raising unexpectedly
(Copilot #3217933300)
- Add three tests: json_match rejects unsupported value types, search()
warns and raises on non-string key, search() warns and raises on
unsupported value type
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): address fourth-round Copilot review comments on JsonMatch
- Add CASE WHEN guard for PostgreSQL integer matching: json_typeof returns
'number' for both ints and floats; wrap CAST in CASE with regex guard
'^-?[0-9]+$' so float rows never trigger CAST error (Copilot #3218413860)
- Validate isinstance(key, str) before regex match in JsonMatch.__init__
so non-string keys raise ValueError consistently instead of TypeError
from re.match (Copilot #3218413900)
- Include exception message in metadata filter skip warning so callers
can distinguish invalid key from unsupported value type (Copilot #3218413924)
- Update tests: assert CASE WHEN guard in PG int compilation, cover
non-string key ValueError in test_json_match_rejects_unsafe_key
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): align ThreadMetaStore.search() signature with sql.py implementation
Use `dict[str, Any]` for `metadata` and `list[dict[str, Any]]` as return
type in base class and MemoryThreadMetaStore to resolve an LSP signature
mismatch; also correct a test docstring that cited the wrong exception type.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): surface InvalidMetadataFilterError as HTTP 400 in search endpoint
Replace bare ValueError with a domain-specific InvalidMetadataFilterError
(subclass of ValueError) so the Gateway handler can catch it and return
HTTP 400 instead of letting it bubble up as a 500.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix(harness): sanitize metadata keys in log output to prevent log injection
Use ascii() instead of %r to escape control characters in client-supplied
metadata keys before logging, preventing multiline/forged log entries.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(harness): validate metadata filters at API boundary and dedupe key/value rules
- Add Pydantic ``field_validator`` on ``ThreadSearchRequest.metadata`` so
unsafe keys / unsupported value types are rejected with HTTP 422 from
both SQL and memory backends (closes Copilot review 3218830849).
- Export ``validate_metadata_filter_key`` / ``validate_metadata_filter_value``
(and ``ALLOWED_FILTER_VALUE_TYPES``) from ``json_compat`` and have
``JsonMatch.__init__`` reuse them — the Gateway-side validator and the
SQL-side ``JsonMatch`` constructor now share one admission rule and
cannot drift.
- Format ``InvalidMetadataFilterError`` rejected-keys list as a
comma-separated plain string instead of a Python list repr so the
surfaced HTTP 400 detail is readable (closes Copilot review 3218830899).
- Update router tests to cover both 422 boundary paths plus the 400
defense-in-depth path when a backend still raises the error.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(harness): harden JsonMatch compile-time key validation against __init__ bypass
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* fix: address review feedback on metadata filter SQL push-down
- Add signed 64-bit range check to validate_metadata_filter_value; give
out-of-range ints a distinct TypeError message.
- Replace assert guards in _compile_sqlite/_compile_pg with explicit
if/raise so they survive python -O optimisation.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(agents): make update_agent honor runtime.context user_id like setup_agent
PR #2784 hardened setup_agent to prefer runtime.context["user_id"] (set by
inject_authenticated_user_context from the auth-validated request) over the
contextvar, so an agent created during the bootstrap flow always lands under
users/<auth_uid>/agents/<name>. update_agent was left calling
get_effective_user_id() unconditionally — the same class of bug that produced
issues #2782 / #2862 still applies whenever the contextvar is not available
on the executing task (background work, future cross-process drivers,
checkpoint resume on a different task). In that regime update_agent silently
routes writes to users/default/agents/<name>, corrupting the shared default
bucket and losing the user's edit.
Extract the resolution policy into a shared resolve_runtime_user_id helper
on deerflow.runtime.user_context and route both setup_agent and update_agent
through it so the two halves of the lifecycle stay in lockstep.
Add load-bearing end-to-end tests that drive a real langchain.agents
create_agent graph with a fake LLM, exercising the full pipeline:
HTTP wire format
-> app.gateway.services.start_run config-assembly
-> deerflow.runtime.runs.worker._build_runtime_context
-> langchain.agents create_agent graph
-> ToolNode dispatch (sync + async + sub-graph + ContextThreadPoolExecutor)
-> setup_agent / update_agent
The negative-control tests intentionally land in users/default/ to prove the
positive tests are actually load-bearing rather than vacuously passing.
The new test_update_agent_e2e_user_isolation suite included a test that
failed against main and now passes after this fix.
* style: ruff format on new e2e tests
* test(e2e): real-server HTTP test driving setup_agent through the full ASGI stack
Adds tests/test_setup_agent_http_e2e_real_server.py — a single load-bearing
test that drives the entire FastAPI gateway through starlette.testclient.
TestClient with no mocks above the LLM:
- lifespan boots (config, sqlite engine, LangGraph runtime, channels)
- POST /api/v1/auth/register (real password hash, real sqlite write,
issues access_token + csrf_token cookies)
- POST /api/threads (real thread_meta + checkpoint creation)
- POST /api/threads/{id}/runs/stream with the exact wire shape the React
frontend sends (assistant_id + input + config + context with
agent_name/is_bootstrap)
- AuthMiddleware -> CSRFMiddleware -> require_permission ->
start_run -> inject_authenticated_user_context ->
asyncio.create_task(run_agent) -> worker._build_runtime_context ->
Runtime injection -> ToolNode dispatch -> real setup_agent
- Asserts SOUL.md is under users/<authenticated_uid>/agents/<name>/
and NOT under users/default/agents/<name>/.
DEER_FLOW_HOME and the sqlite path are redirected into tmp_path so the test
never touches the real .deer-flow directory or developer database. The only
patch above the LLM boundary is replacing create_chat_model with a fake that
emits a single setup_agent tool_call.
This is the "真实验证" answer: it reproduces what curl-against-uvicorn would
do, minus the network socket layer.
* test: address Copilot review on user-isolation e2e tests
- Drop "currently expected to FAIL" wording from update_agent e2e docstring
and header (Copilot review): the fix is in this PR, the test pins the
corrected behaviour rather than driving a future change.
- Rephrase the assertion failure messages from "BUG:" to "REGRESSION:" to
match the test's role on the fixed branch.
- Bound _drain_stream with a wall-clock timeout, a max-bytes cap, and an
early break on the "event: end" SSE frame (Copilot review). Stops the
test from hanging on a stuck run or runaway heartbeat loop.
- Replace the misleading "patch both module aliases" comment with an
explanation of why patching lead_agent.agent.create_chat_model is the
only correct target (Copilot review): lead_agent rebinds the symbol
into its own namespace at import time, so patching deerflow.models is
too late.
* test(refactor): address WillemJiang review on user-isolation e2e tests
- Extract the duplicated FakeToolCallingModel (and a
build_single_tool_call_model helper) into tests/_agent_e2e_helpers.py.
All three e2e files now import from the shared module instead of
redefining the shim locally.
- Convert the manual p.start() / p.stop() try/finally blocks in
test_update_agent_e2e_user_isolation.py to contextlib.ExitStack so
patch lifecycle is Pythonic and exception-safe.
- Lift the isolated_app fixture's private-attribute resets into a
named _reset_process_singletons helper with a comment block
explaining why each singleton has to be invalidated for true e2e
isolation, and why raising=False is intentional. Makes the
fragility visible and the intent self-documenting rather than
leaving the resets inline as opaque monkeypatch calls.
Net change: -59 lines (143 -> 84) across the three test files, with
every assertion intact. Full suite remains 69 passed / lint clean.
* test(e2e): make real-server test self-supply its config
CI's actions/checkout only ships config.example.yaml (the real config.yaml
is gitignored), so the production config-discovery search
(./config.yaml -> ../config.yaml -> $DEER_FLOW_CONFIG_PATH) finds nothing
and the test fails at lifespan boot with FileNotFoundError. The dev-machine
run passed only because a local config.yaml happened to exist.
Write a minimal AppConfig-valid yaml into tmp_path and pin
DEER_FLOW_CONFIG_PATH to it. The yaml carries just what the schema requires
(a single fake-test-model entry, LocalSandboxProvider, sqlite database).
The LLM never gets instantiated because the test patches create_chat_model
on the lead agent module, so the api_key/base_url stay placeholders.
Verified by hiding the local config.yaml to mirror the CI checkout — the
test now passes in both environments.
* feat(run): propagate model_name from gateway request context to persistence layer
Pass model_name through the full run creation pipeline — from
RunCreateRequest.context in the gateway, through RunManager, to the
RunStore interface and SQL persistence. This enables client-specified
model selection to be recorded per-run in the database.
* feat(run): add model allowlist validation and effective model name capture
- Validate model_name against allowlist in gateway services.py using
get_app_config().get_model_config()
- Truncate model_name to 128 chars to match DB column constraint
- In worker.py, capture effective model name from agent.metadata after
agent creation and persist if resolved differently than requested
* feat(run): add defense-in-depth model_name normalization and round-trip persistence tests
- Add _normalize_model_name() to RunRepository for whitespace stripping
and 128-char truncation before DB writes.
- Add round-trip unit tests for model_name creation and default None
in test_run_manager.py.
* fix(run): coerce non-string model_name values before strip/truncate in _normalize_model_name
* fix(gateway): add runtime type guard for model_name coercion in gateway services
Add isinstance check and str() coercion before calling .strip() to prevent
AttributeError when non-string types (int, None, etc.) flow through the
gateway. Paired with SQL integration test for end-to-end model_name
persistence across gateway → langgraph → persistence layer.
* fix(run): drop Alembic migration for model_name (no-op) and expose public update method on RunManager
- Drop a1b2c3d4e5f6 migration: model_name already exists in RunRow schema
and is auto-created via Base.metadata.create_all() at startup
- Add update_model_name() public method to RunManager to replace the private
_persist_to_store call in worker.py, preserving internal locking/persistence
* fix(subagents): consolidate system_prompt and skills into single SystemMessage
Some LLM APIs (vLLM, Xinference, Chinese LLM providers) reject multiple
system messages with \”System message must be at the beginning.\” The
subagent executor was sending separate SystemMessages for the configured
system_prompt and each loaded skill, which caused failures when calling
task tool with sub-agents.
Merge system_prompt and all skill content into one SystemMessage in the
initial state, and pass system_prompt=None to create_agent() so the
factory doesn't prepend a second one.
Fixes#2693
* fix(subagents): update SubagentConfig.system_prompt to str | None and add astream regression test
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/2ee03a26-e19b-4106-abc5-c76a2906383b
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* fixed the lint error
* fix the lint error in the backend
* fix the unit test error of test_subagent_executor
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
* Fix local sandbox singleton reset on provider lifecycle
* Fix local sandbox singleton reset on provider reset
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix: make tool argument behavior discoverable
The write_file tool already supported append=false by default with append=true for end-of-file writes, but the parsed docstring did not describe append in the model-facing schema. This records the overwrite default and append path in the tool description, adds resilient schema regression coverage, and keeps backend sandbox docs aligned.
The regression now also checks that every public parameter in the existing tool schema test matrix has a description. Enabling docstring parsing on setup_agent and update_agent fills the two existing gaps with their existing Args docs instead of duplicating descriptions elsewhere.
Constraint: Issue #2831 asks for a small docstring/schema discoverability fix without changing runtime file-writing behavior
Rejected: Changing write_file defaults | would alter existing overwrite semantics and broaden the fix beyond schema discoverability
Rejected: Exact phrase assertions | too brittle for future docstring rewording while testing the same behavior
Confidence: high
Scope-risk: narrow
Directive: Keep model-facing tool parameters documented through parsed docstrings or equivalent schema descriptions
Tested: cd backend && uv run pytest tests/test_setup_agent_tool.py tests/test_update_agent_tool.py tests/test_tool_args_schema_no_pydantic_warning.py tests/test_sandbox_tools_security.py::test_str_replace_and_append_on_same_path_should_preserve_both_updates -q
Tested: cd backend && uv run ruff check packages/harness/deerflow/sandbox/tools.py packages/harness/deerflow/tools/builtins/setup_agent_tool.py packages/harness/deerflow/tools/builtins/update_agent_tool.py tests/test_tool_args_schema_no_pydantic_warning.py
Not-tested: Full backend test suite
Co-authored-by: OmX <omx@oh-my-codex.dev>
* Fix the lint error
---------
Co-authored-by: OmX <omx@oh-my-codex.dev>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix: bucket subagent token usage into RunRow.subagent_tokens
Add caller-bucketed token tracking to RunJournal so subagent and
middleware LLM calls are written to the correct RunRow columns instead
of all falling into lead_agent_tokens (default 0).
- RunJournal: accumulate _lead_agent_tokens / _subagent_tokens /
_middleware_tokens in on_llm_end, deduped by langchain run_id.
Add record_external_llm_usage_records() for external sources
(respects track_token_usage flag). Return caller buckets from
get_completion_data().
- SubagentTokenCollector: new lightweight callback handler that
collects LLM usage within subagent execution.
- SubagentExecutor: wire collector into subagent run_config and sync
records to SubagentResult on every chunk (timeout/cancel safe).
- SubagentResult: add token_usage_records and usage_reported fields.
- task_tool: report subagent usage to parent RunJournal on every
terminal status (COMPLETED/FAILED/CANCELLED/TIMED_OUT), including
the CancelledError path, guarded against double-reporting.
No DB migration needed — RunRow columns already exist.
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix: address token usage review feedback
* Address review follow-ups
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
`make dev` ran `uv sync` unconditionally on every restart, wiping any
optional extras the user had installed manually with
`uv sync --all-packages --extra postgres`. The Docker image-build path
already solved this via the `UV_EXTRAS` build-arg in backend/Dockerfile;
the local serve.sh path and the docker-compose-dev startup command
were the remaining outliers.
`scripts/serve.sh` now resolves extras before `uv sync`:
1. honors `UV_EXTRAS` (parity with backend/Dockerfile and
docker/docker-compose.yaml — no new convention introduced);
2. falls back to parsing config.yaml — `database.backend: postgres`
or legacy `checkpointer.type: postgres` auto-pins
`--extra postgres`, so the common case needs zero extra config.
3. detector stderr is no longer suppressed, so whitelist warnings or
crashes surface to the dev terminal (review feedback).
Detection lives in `scripts/detect_uv_extras.py` (stdlib-only — has to
run before the venv exists). Extra names are validated against
`^[A-Za-z][A-Za-z0-9_-]*$` so a stray shell metacharacter in `.env`
cannot reach `uv sync` downstream (defense in depth).
`docker/docker-compose-dev.yaml`'s startup command is now extracted to
`docker/dev-entrypoint.sh` (review feedback — the inline command had
grown to a ~350-char one-liner). The script:
- parses comma/whitespace-separated UV_EXTRAS, applying the same
`^[A-Za-z][A-Za-z0-9_-]*$` whitelist as the local detector;
- emits one `--extra X` flag per token, so `UV_EXTRAS=postgres,ollama`
works in Docker dev too (harmonized with local — review feedback);
- calls `uv sync --all-packages` (PR #2584) so workspace member
extras (deerflow-harness's postgres extra) are installed;
- keeps the existing self-heal `(uv sync || (recreate venv && retry))`
branch;
- exposes `--print-extras` for dry-run testing.
The compose file mounts the script read-only at runtime, so script
edits take effect on `make docker-restart` without an image rebuild.
The `--no-sync` alternative (a separate suggestion in the issue thread)
was considered but rejected for dev paths because it would drop the
self-heal branch and the auto-pickup of new pyproject deps. `--no-sync`
is already in use for the production CMD (`backend/Dockerfile:101`)
where it's appropriate.
Updates the asyncpg-missing error message to include the
`--all-packages` flag (matching #2584) plus the persistent install flow,
and expands `config.example.yaml` so all three install paths
(local / docker dev / docker image build) are documented with their
multi-extra capabilities.
Tests:
- `tests/test_detect_uv_extras.py` (21 tests) — local-path env parsing,
YAML edge cases, env-vs-config precedence, whitelist rejection of
shell metacharacters.
- `tests/test_dev_entrypoint.py` (15 tests) — docker-path validation
via `--print-extras`, multi-extra parsing, metacharacter abort.
- `tests/test_persistence_scaffold.py` (22 tests, unchanged) — passes
with the merged `--all-packages --extra postgres` error message.
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* feat(middleware): inject dynamic context via DynamicContextMiddleware
Move memory and current date out of the system prompt and into a
dedicated <system-reminder> HumanMessage injected once per session
(frozen-snapshot pattern) via a new DynamicContextMiddleware.
This keeps the system prompt byte-exact across all users and sessions,
enabling maximum Anthropic/Bedrock prefix-cache reuse.
Key design decisions:
- ID-swap technique: reminder takes the first HumanMessage's ID
(replacing it in-place via add_messages), original content gets a
derived `{id}__user` ID (appended after). Preserves correct ordering.
- hide_from_ui: True on reminder messages so frontend filters them out.
- Midnight crossing: date-update reminder injected before the current
turn's HumanMessage when the conversation spans midnight.
- INFO-level logging for production diagnostics.
Also adds prompt-caching breakpoint budget enforcement tests and
updates ClaudeChatModel docs to reference the new pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(token-usage): log input/output token detail breakdown in middleware
Extend the LLM token usage log line to include input_token_details and
output_token_details (cache_creation, cache_read, reasoning, audio, etc.)
when present. Adds tests covering Anthropic cache detail logging from
both usage_metadata and response_metadata.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: fix nginx
* fix(middleware): always inject date; gate memory on injection_enabled
Date injection is now unconditional — it is part of the static system
prompt replacement and should always be present. Memory injection
remains gated by `memory.injection_enabled` in the app config.
Previously the entire DynamicContextMiddleware was skipped when
injection_enabled was False, which also suppressed the date.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): format files and correct test assertions for token usage middleware
- ruff format dynamic_context_middleware.py and test_claude_provider_prompt_caching.py
- Remove unused pytest import from test_dynamic_context_middleware.py
- Fix two tests that asserted response_metadata fallback logic that
doesn't exist: replace with tests that match actual middleware behavior
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(middleware): address Copilot review comments on DynamicContextMiddleware
- Use additional_kwargs flag for reminder detection instead of content
substring matching, so user messages containing '<system-reminder>'
are not mistakenly treated as injected reminders
- Generate stable UUID when original HumanMessage.id is None to prevent
ambiguous 'None__user' derived IDs and message collisions
- Downgrade per-turn no-op log to DEBUG; keep actual injection events at INFO
- Add two new tests: missing-id UUID fallback and user-text false-positive
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(task): remove max_turns parameter from task tool interface
Subagents should always use their configured max_turns value. Exposing
this parameter allowed callers to override the admin-configured limit,
which is undesirable. The value is now exclusively driven by subagent
config (per-agent overrides and global defaults in config.yaml).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(tools): introduce Runtime type alias to eliminate Pydantic serialization warning
Add deerflow/tools/types.py with:
Runtime = ToolRuntime[dict[str, Any], ThreadState]
Replace every runtime: ToolRuntime[ContextT, ThreadState] and
runtime: ToolRuntime[dict[str, Any], ThreadState] annotation in
sandbox/tools.py, present_file_tool.py, task_tool.py, view_image_tool.py,
and skill_manage_tool.py with the new Runtime alias.
The unbound ContextT TypeVar (default None) caused
PydanticSerializationUnexpectedValue warnings on every tool call because
LangChain's BaseTool._parse_input calls model_dump() on the auto-generated
args_schema while DeerFlow passes a dict as runtime context.
Binding the context to dict[str, Any] aligns Pydantic's serialization
expectations with reality and removes the noise from all run modes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(tools): extend Runtime alias to setup_agent and update_agent tools
Replace bare ToolRuntime annotations in setup_agent_tool.py and
update_agent_tool.py with the shared Runtime alias introduced in the
previous commit, and add both tools to the Pydantic serialization
warning regression test (13 cases total).
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(tools): loosen Pydantic warning filter to avoid version-specific format
Replace the brittle "field_name='context'" substring check with a looser
"context" match so the assertion stays valid if Pydantic changes its
internal warning format across versions.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(tools): simplify warning filter and clean up docstring
Remove the "context" substring condition from the Pydantic warning
filter — asserting that no PydanticSerializationUnexpectedValue fires
at all is both simpler and more comprehensive, since the test payload
contains only the tool's own args plus runtime.
Also update the module docstring to remove the version-specific warning
format example that was inconsistent with the looser filter.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* Make loop detection configurable
Expose LoopDetectionMiddleware thresholds through config.yaml while preserving existing defaults and allowing the middleware to be disabled.
Refs bytedance/deer-flow#2517
* feat(loop-detection): add per-tool tool_freq_overrides to Phase 1
Adds ToolFreqOverride model and tool_freq_overrides field to
LoopDetectionConfig, wires it through LoopDetectionMiddleware, and
documents the option in config.example.yaml.
Resolves the gap flagged in the #2586 review: without per-tool overrides,
users hit by #2510/#2511 (RNA-seq workflows exceeding the bash hard limit)
had no way to raise thresholds for one tool without loosening the global
limit for every tool.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* docs(loop-detection): document tool_freq_overrides in LoopDetectionMiddleware docstring
Add the missing Args entry for tool_freq_overrides, explaining the
(warn, hard_limit) tuple structure and how per-tool thresholds supersede
the global tool_freq_warn / tool_freq_hard_limit for named tools.
Also run ruff format on the three files flagged by the lint check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(loop-detection): validate LoopDetectionMiddleware __init__ params eagerly
Raise clear ValueError at construction time instead of crashing at
unpack-time inside _track_and_check when bad values are passed:
- tool_freq_overrides: must be 2-tuples of positive ints with hard_limit >= warn
- scalar thresholds: warn_threshold, hard_limit, tool_freq_warn,
tool_freq_hard_limit must be >= 1 and hard limits must >= their warn pairs
- window_size, max_tracked_threads must be >= 1
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(test): isolate credential loader directory-path test from real ~/.claude
The test didn't monkeypatch HOME, so on any machine with real Claude Code
credentials at ~/.claude/.credentials.json the function fell through to
those credentials and the assertion failed. Adding HOME redirect ensures
the default credential path doesn't exist during the test.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* style(test): add blank lines after import pytest in TestInitValidation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(loop-detection): collapse dual validation to LoopDetectionConfig
Modifications
- LoopDetectionMiddleware.__init__: stripped of all ValueError raises;
becomes a plain field-assignment constructor.
- LoopDetectionMiddleware.from_config: classmethod that builds the
middleware from a Pydantic-validated LoopDetectionConfig and handles
the ToolFreqOverride -> tuple[int, int] conversion.
- agents/factory.py: SDK construction routed through
LoopDetectionMiddleware.from_config(LoopDetectionConfig()) so the
defaults path is Pydantic-validated too.
- agents/lead_agent/agent.py: uses from_config instead of unpacking
config fields by hand.
- tests/test_loop_detection_middleware.py: deleted TestInitValidation
(16 methods exercising the removed __init__ checks); added
TestFromConfig (4 tests: scalar field mapping, override tuple
conversion, empty overrides, behavioral smoke test).
Result: one validation layer (Pydantic), zero duplication, no __new__
hacks. Both production construction sites flow through LoopDetectionConfig.
Test results
make test -> 2977 passed, 18 skipped, 0 failed (137s)
make format -> All checks passed; 411 files left unchanged
* feat(agents): make loop_detection configurable in create_deerflow_agent
Adds a `loop_detection: bool | AgentMiddleware = True` field to
RuntimeFeatures, mirroring the existing pattern used by `sandbox`,
`memory`, and `vision`. SDK users can now disable LoopDetectionMiddleware
or replace it with a custom instance built from their own
LoopDetectionConfig — e.g.
`LoopDetectionMiddleware.from_config(my_cfg)` — instead of being stuck
with the hardcoded defaults previously installed by the SDK factory.
The lead-agent path (which already reads AppConfig.loop_detection) is
unchanged, and the default `True` preserves prior always-on behavior for
all existing callers.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: knight0940 <631532668@qq.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Amorend <142649913+knight0940@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* feat(agent): add update_agent tool for in-chat custom-agent self-updates (#2616)
Custom agents had no built-in way to persist updates to their own SOUL.md /
config.yaml from a normal chat — `setup_agent` was only bound during the
bootstrap flow, so when the user asked the agent to refine its description
or personality, the agent would shell out via bash/write_file and the edits
landed in a temporary sandbox/tool workspace instead of
`{base_dir}/agents/{agent_name}/`.
Changes:
- New `update_agent` builtin tool with partial-update semantics (only the
fields you pass are written) and atomic temp-file + os.replace writes so
a failed update never corrupts existing SOUL.md / config.yaml.
- Lead agent now binds `update_agent` in the non-bootstrap path whenever
`agent_name` is set in the runtime context. Default agent (no
agent_name) and bootstrap flow are unchanged.
- New `<self_update>` system-prompt section is injected for custom agents,
instructing them to use `update_agent` — and explicitly NOT bash /
write_file — to persist self-updates.
- Tests: 11 new cases in `tests/test_update_agent_tool.py` covering
validation (missing/invalid agent_name, unknown agent, no fields),
partial updates (soul-only, description-only, skills=[] vs omitted),
no-op detection, atomic-write safety, and AgentConfig round-tripping;
plus 2 new cases in `tests/test_lead_agent_prompt.py` covering the
self-update prompt section.
- Docs: updated backend/CLAUDE.md builtin tools list and tools.mdx
(en/zh) with the new tool description.
* feat(agent): isolate custom agents per user
Store custom agent definitions under the effective user, keep legacy agents readable until migration, and cover API/tool/migration behavior with tests.
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat: consistent write/delete targets & add --user-id to migration
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(loop-detection): keep tool-call pairing on warn injection (#2724)
* make format
* fix(loop-detection): avoid IMMessage leak to downstream consumer
* fix(channels): filter loop warning text from IM replies
* fix(harness): restore legacy skills path fallback (#2694)
* fix(format): make format
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* feat(community): add Serper web search provider
Add a new community search provider backed by the Serper Google Search
API (https://serper.dev). Serper returns real-time Google results via a
simple JSON API and requires only an API key — no extra Python package.
Changes:
- backend/packages/harness/deerflow/community/serper/__init__.py
- backend/packages/harness/deerflow/community/serper/tools.py
Implements web_search_tool using httpx (already a project dependency).
API key is read from config.yaml `api_key` field or SERPER_API_KEY env var.
Follows the same interface / output shape as the existing ddg_search provider.
Exposes max_results parameter (default 5) with config override logic.
- backend/tests/test_serper_tools.py
Unit tests covering API key resolution, config overrides, HTTP errors,
empty results, and parameter passing.
- config.example.yaml: add commented-out Serper example alongside other providers
- .env.example: add SERPER_API_KEY placeholder
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix the lint error
* Fix the lint error
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(gateway): return ISO 8601 timestamps from threads endpoints (#2594)
ThreadResponse documents created_at / updated_at as ISO timestamps,
matching the LangGraph Platform schema (langgraph_sdk.schema.Thread
exposes them as datetime, JSON-encoded as ISO 8601). The gateway
threads router was instead emitting str(time.time()) — unix-second
floats — breaking frontend new Date() parsing and producing a mixed
ISO/unix wire format that also corrupted the search sort order.
Centralize timestamp generation in deerflow.utils.time:
- now_iso() — datetime.now(UTC).isoformat()
- coerce_iso(x) — heals legacy unix-timestamp strings on read so the
store converges to ISO without a one-shot migration
threads.py: replace 6 time.time() call sites with now_iso(); wrap all
read paths and Phase-2 checkpoint metadata with coerce_iso(); _store_upsert
opportunistically heals legacy created_at on update; drop unused time import.
thread_runs.py: reuse now_iso() instead of a private duplicate _now_iso(),
preventing future drift between the two timestamp call sites.
Tests: 9 unit tests for the helper; 5 integration tests pinning the ISO
contract for create/get/patch/search and the legacy-healing path on the
internal store upsert. Full suite: 2144 passed, 15 skipped, 0 failed.
Closes#2594
* fix(gateway): coerce checkpoint metadata timestamps to ISO on read
After the merge with main, three additional read paths in ``threads.py``
were still emitting raw ``str(metadata.get("created_at", ""))`` —
``get_thread_state``, ``update_thread_state``, and ``get_thread_history``.
Same root cause as #2594: when the checkpoint metadata's ``created_at``
is a unix-second float (legacy data, or a checkpoint written by an older
Gateway version), ``str(float)`` produces ``"1777252410.411327"`` and the
frontend's ``new Date(...)`` returns ``Invalid Date``. The fix on the
``/threads/{id}`` GET path was already in place; these three sibling
endpoints needed the same treatment.
All four call sites now flow through ``coerce_iso``, so:
- legacy float metadata heals to ISO on the way out,
- ISO metadata passes through unchanged,
- ``datetime`` instances (which the new ``coerce_iso`` branch handles
explicitly) emit with the ``T`` separator instead of falling through
to the space-separated ``str(datetime)`` form.
Coverage added for the two endpoints not already pinned by the merge:
- ``test_get_thread_state_returns_iso_for_legacy_checkpoint_metadata``
- ``test_get_thread_history_returns_iso_for_legacy_checkpoint_metadata``
Both pre-seed a checkpoint whose metadata carries the literal float
from the issue body and assert the wire format is ISO.
* refactor: thread app config through lead prompt
* fix: honor explicit app config across runtime paths
* style: format subagent executor tests
* fix: thread resolved app config and guard subagents-only fallback
Address two PR review findings:
1. _create_summarization_middleware passed the original (possibly None)
app_config into create_chat_model, forcing the model factory back to
ambient get_app_config() and risking config drift between the
middleware's resolved view and the model's view. Pass the resolved
AppConfig instance through end-to-end.
2. get_available_subagent_names accepted Any-typed config and forwarded
it to is_host_bash_allowed, which reads ``.sandbox``. A
SubagentsAppConfig (also accepted upstream as a sum-type input) has
no ``.sandbox`` attribute and would be silently treated as "no
sandbox configured", incorrectly disabling the bash subagent. Guard
on hasattr and fall back to ambient lookup otherwise.
Adds regression tests for both paths.
* chore: simplify hasattr guard and tighten regression tests
- Collapse if/else into ternary in get_available_subagent_names; hasattr(None, ...) is False so the explicit None check was redundant.
- Drop comments that narrate the change rather than explain non-obvious WHY (test names already convey intent).
- Replace stringly-typed sentinel "no-arg" in regression test with direct args tuple comparison.
---------
Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com>
* fix(sandbox): pass no_change_timeout to exec_command to prevent 120s premature termination
The agent_sandbox library's shell API defaults no_change_timeout to 120
seconds. When AioSandbox.execute_command() called exec_command() without
this parameter, commands producing no output for 120s would return with
NO_CHANGE_TIMEOUT status even though the script was still running.
Pass no_change_timeout=600 to all exec_command calls (matching the
client-level HTTP timeout) so long-running commands are not cut short.
Fixes#2668
* test(sandbox): add assertions for no_change_timeout in execute_command and list_dir
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/2f37bc72-0826-4443-a6ba-e5b78c22fb5a
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
* fix(subagents): use model override for tools and middleware
* fix(config): resolve effective subagent model
* fix(subagents): defer app config loading
* fix(subagents): fully defer config.yaml load in executor __init__
The previous attempt only relocated the explicit get_app_config() call,
but left resolve_subagent_model_name(...) running eagerly in __init__.
That helper has its own internal get_app_config() fallback, which still
fired when both app_config and parent_model were None and
config.model == "inherit" — exactly the path unit tests hit, breaking
21 tests in CI with FileNotFoundError: config.yaml.
Skip the eager resolve in __init__ when it would require loading the
config file, and defer to _create_agent (which already has the
app_config or get_app_config() fallback).
* fix(harness): resolve runtime paths from project root
* docs(config): update
* fix(config): address runtime path review feedback
* test(config): fix skills path e2e root
* test(config): cover legacy config fallback when project root lacks config files
Verifies that when DEER_FLOW_PROJECT_ROOT is unset and cwd has no
config.yaml/extensions_config.json, AppConfig and ExtensionsConfig fall back
to the legacy backend/repo-root candidates — the backward-compat path
requested in PR #2642 review.
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(agents): propagate agent_name into ToolRuntime.context for setup_agent (#2677)
When creating a custom agent via the web UI, SOUL.md was always written
to the global base_dir/SOUL.md instead of agents/<name>/SOUL.md.
Root cause: the bootstrap flow sends agent_name via body.context, but
two layers were broken:
1. services.py only forwarded body.context keys into config["configurable"];
config["context"] was never populated.
2. worker.py constructed the parent Runtime with a hard-coded
{thread_id, run_id} context, ignoring config["context"] entirely.
After the langgraph >= 1.1.9 bump (#98a5b34f), ToolRuntime.context no
longer falls back to configurable, so setup_agent's
runtime.context.get("agent_name") returned None and the tool's silent
agent_name=None -> base_dir fallback kicked in, overwriting the global
SOUL.md.
Fix:
- services.py: extract merge_run_context_overrides() and write the
whitelisted context keys into both configurable (legacy readers) and
context (langgraph 1.1+ ToolRuntime consumers).
- worker.py: extract _build_runtime_context() and merge config["context"]
into the Runtime's context (without letting callers override
thread_id/run_id).
The base_dir fallback in setup_agent_tool.py is left in place because
the IM /bootstrap channel command depends on it. That code path can
be tightened in a follow-up.
Adds regression tests covering both helpers.
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Centralize log level parsing in `logging_level_from_config()` and
application in `apply_logging_level()` within `deerflow.config.app_config`.
- Gateway lifespan applies configured log level on startup
- `debug.py` uses shared helpers instead of local duplicates
- `apply_logging_level()` targets only `deerflow`/`app` logger hierarchies
so third-party library verbosity is not affected; root handler levels
are only lowered (never raised) to allow configured loggers through
without suppressing third-party output; root logger level is not modified
- Config field description updated to clarify scope
- Tests save/restore global logging state to avoid test pollution
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix(memory): replace short-lived asyncio.run() with persistent event loop to prevent zombie httpx connections
The memory updater used asyncio.run() inside daemon threads, creating
and destroying short-lived event loops on every update. Langchain
providers (e.g. langchain-anthropic) cache httpx AsyncClient instances
globally via @lru_cache, so SSL connections created on a loop that is
subsequently destroyed become zombie connections in the shared pool.
When the main agent's lead run later reuses one of these connections,
httpx/anyio triggers RuntimeError: Event loop is closed during
connection cleanup.
Replace the ThreadPoolExecutor + asyncio.run() pattern with a
_MemoryLoopRunner that maintains a single persistent event loop in a
daemon thread for the process lifetime. Since the loop never closes,
connections bound to it never become invalid. The _run_async_update_sync
function now submits coroutines to this persistent loop via
run_coroutine_threadsafe instead of creating throwaway loops.
* update the code to address the review comments
* Fix the review comments of 2615
P1 — user_id forwarded through sync path: Added user_id parameter to _prepare_update_prompt, _finalize_update, and _do_update_memory_sync, and forwarded it to get_memory_data(agent_name, user_id=user_id) and
save(..., user_id=user_id). The update_memory() entry point now passes user_id through both the executor.submit path and the direct call path. Added TestUserIdForwarding with two regression tests (sync + async)
verifying get_memory_data and save receive the correct user_id.
P2 — aupdate_memory() delegates to sync: Replaced the model.ainvoke() call with asyncio.to_thread(self._do_update_memory_sync, ...). This eliminates the unsafe async provider client path entirely — all memory
updater entry points now use the isolated sync model.invoke() path. Updated the test from asserting ainvoke is awaited to asserting invoke is called and ainvoke is not.
Nit — duplicate comment removed: Removed the duplicated # Matches sentences... comment on line 230.
* Chore(test): update the code of test_memory_updater
---------
Co-authored-by: rayhpeng <rayhpeng@gmail.com>
* refactor: thread app_config through middleware factories
Continues the incremental config-refactor sequence (#2611 root, #2612 lead
path) one layer deeper into the middleware factories. Two ambient lookups
inside _build_runtime_middlewares are eliminated and the LLMErrorHandling
band-aid removed:
- _build_runtime_middlewares / build_lead_runtime_middlewares /
build_subagent_runtime_middlewares now require app_config: AppConfig.
- get_guardrails_config() inside the factory is replaced with
app_config.guardrails (semantically identical — same default-factory
GuardrailsConfig — verified by direct equality check).
- LLMErrorHandlingMiddleware.__init__ now requires app_config and reads
circuit_breaker fields directly. The class-level
circuit_failure_threshold / circuit_recovery_timeout_sec defaults are
removed along with the try/except (FileNotFoundError, RuntimeError):
pass band-aid — the let-it-crash invariant the rest of the refactor
enforces.
Caller chain (already-resolved app_config sources):
- _build_middlewares in lead_agent/agent.py: reorder so
resolved_app_config = app_config or get_app_config() is computed BEFORE
build_lead_runtime_middlewares is called, then passed as kwarg.
- SubagentExecutor: optional app_config parameter (mirrors the lead-agent
pattern); _create_agent does the same `or get_app_config()` fallback at
agent-build time, so task_tool callers don't need to plumb app_config
through yet (typed-context plumbing for tool runtimes is a separate
refactor).
Tests:
- test_llm_error_handling_middleware: _make_app_config helper using
AppConfig(sandbox=SandboxConfig(use="test")) — same minimal-config
pattern conftest already uses. Three direct LLMErrorHandlingMiddleware()
calls each followed by post-construction circuit_breaker mutation fold
cleanly into _build_middleware(circuit_failure_threshold=...,
circuit_recovery_timeout_sec=...).
Verification:
- tests/test_llm_error_handling_middleware.py — 14 passed
- tests/test_subagent_executor.py — 28 passed
- tests/test_tool_error_handling_middleware.py — 6 passed
- tests/test_task_tool_core_logic.py — 18 passed (verifies task_tool
unchanged behavior)
- Full suite: 2697 passed, 3 skipped. The single intermittent failure in
tests/test_client_e2e.py::test_tool_call_produces_events is pre-existing
LLM flakiness (the test asserts the model decided to call a tool;
reproduces 1/3 on unchanged main as well).
* fix: address middleware app config review comments
* fix: satisfy app config annotation lint
* test: cover explicit app config middleware wiring
---------
Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com>
* feat(models): 适配 MindIE引擎的模型
* test: add unit tests for MindIEChatModel adapter and fix PR review comments
* chore: update uv.lock with pytest-asyncio
* build: add pytest-asyncio to test dependencies
* fix: address PR review comments (lazy import, cache clients, safe newline escape, strict xml regex)
* fix(mindie): preserve string args without JSON quotes in XML tool call serialization
* fix(mindie): preserve string args without JSON quotes in XML tool call serialization
* test_mindie_provider:format
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(mindie): prevent nested tool_call params from leaking into outer args
* fixed by escaping XML entities in _fix_messages and unescaping during parse, with regression tests added.
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(sandbox): block host bash traversal escapes
Fixes#2535
* fix(sandbox): harden local bash path guards
* fix(sandbox): avoid bash cd argument false positives
* Fix the lint error
Add function to resolve and validate user data path.
* Fix the lint error
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>