deerflow2/backend/app/gateway
Xinmin Zeng 268fdd6968
fix(gateway): drain in-flight runs before closing checkpointer on shutdown (#3381)
* fix(gateway): drain in-flight runs before closing checkpointer on shutdown

Chat runs execute in fire-and-forget background asyncio tasks that write
checkpoints through a shared checkpointer. On shutdown, langgraph_runtime's
AsyncExitStack tore down the checkpointer's postgres connection pool while
those run tasks were still mid-graph. langgraph's
AsyncPregelLoop._checkpointer_put_after_previous then ran its
`finally: await checkpointer.aput(...)` against the closed pool, raising
psycopg_pool.PoolClosed. Because that put runs in a langgraph-internal task
(not on run_agent's call stack), run_agent's try/except cannot catch it and it
surfaces as "unhandled exception during asyncio.run() shutdown".

Add RunManager.shutdown() to cancel and bounded-await all in-flight runs, and
call it from langgraph_runtime BEFORE the AsyncExitStack closes the
checkpointer, so the final checkpoint write lands while the pool is still open.
The drain is bounded by a timeout so a stuck run cannot hang worker shutdown,
and is shielded so a second shutdown signal cannot abandon it mid-drain and
reopen the race.

Closes #3373

* fix(gateway): address review — preserve completed-run status, bound drain persistence

Addresses Copilot review on #3381:

- RunManager.shutdown(): decide run status AFTER the drain. Under the lock it
  now only requests cancellation; after asyncio.wait it marks/persists
  `interrupted` only for runs still pending or ended cancelled. A run that
  completes (e.g. `success`) during the drain window keeps its real terminal
  status instead of being unconditionally overwritten.
- Bound the trailing status persistence within the timeout budget
  (deadline = loop.time()+timeout; gather wrapped in asyncio.wait_for) so a slow
  store backing off under DB pressure cannot push shutdown past the deadline.
- deps: use asyncio.create_task instead of asyncio.ensure_future.
- tests: wait deterministically for the run to be in-flight (poll the first
  checkpoint) instead of a fixed sleep; init shutdown_calls explicitly in the
  recovery test double; add regression test asserting a run completing during
  the drain keeps its status (in memory and in the store).

* fix(gateway): address maintainer review — surface failed drain persists, clarify timeout constant

Addresses @WillemJiang review on #3381:

- shutdown(): inspect the gather result of the trailing interrupted-status
  persistence. _persist_status is best-effort (it catches + logs its own
  failure with exc_info and returns False, so it never raises out of the
  gather), but the aggregate result was never checked — a partial failure had
  no shutdown-level visibility. Now any escaped Exception is logged, and any
  False (a persist that did not confirm) is logged with the run_id. Added
  regression test test_shutdown_surfaces_failed_interrupted_persist.
- deps: clarify the _RUN_DRAIN_TIMEOUT_SECONDS comment — state the actual value
  of _SHUTDOWN_HOOK_TIMEOUT_SECONDS (5.0s) and that both count toward the
  lifespan shutdown window. Kept as two separate constants (independent teardown
  steps that may diverge) rather than one shared "must match" value.
- Verified no other test fake needs the shutdown stub: _FakeRunManager in
  test_worker_langfuse_metadata.py is a run_agent() argument (worker path),
  never injected into langgraph_runtime, so it never receives shutdown().
2026-06-07 11:24:30 +08:00
..
auth fix(auth): persist auto-generated JWT secret to survive restarts (#2933) 2026-05-16 09:24:40 +08:00
routers fix upload file size contract (#3408) 2026-06-06 15:12:17 +08:00
__init__.py refactor: split backend into harness (deerflow.*) and app (app.*) (#1131) 2026-03-14 22:55:52 +08:00
app.py fix(stability): resolve P0 blockers from v2.0-m1-rc1 stability audit (#3107) (#3131) 2026-05-21 21:18:10 +08:00
auth_middleware.py feat: implement process-local internal authentication for Gateway and enhance CSRF handling 2026-04-26 22:20:57 +08:00
authz.py fix(security): harden auth system and fix run journal logic bug (#2593) 2026-04-28 11:34:07 +08:00
config.py fix(nginx): defer CORS to gateway allowlist (#2861) 2026-05-11 17:38:37 +08:00
csrf_middleware.py fix(nginx): defer CORS to gateway allowlist (#2861) 2026-05-11 17:38:37 +08:00
deps.py fix(gateway): drain in-flight runs before closing checkpointer on shutdown (#3381) 2026-06-07 11:24:30 +08:00
internal_auth.py fix(mcp): add auth interceptor with channel user_id and keep header propagation to mcp tools (#3294) 2026-06-03 15:48:19 +08:00
langgraph_auth.py docs: clarify LangGraph compatibility entrypoints (#2914) 2026-05-12 23:15:11 +08:00
pagination.py fix: load paginated run history messages (#3305) 2026-06-01 15:50:39 +08:00
path_utils.py feat(persistence): per-user filesystem isolation, run-scoped APIs, and state/history simplification (#2153) 2026-04-26 11:13:01 +08:00
services.py fix(mcp): add auth interceptor with channel user_id and keep header propagation to mcp tools (#3294) 2026-06-03 15:48:19 +08:00
utils.py feat(persistence): add unified persistence layer with event store, token tracking, and feedback (#1930) 2026-04-26 11:05:47 +08:00