deerflow2/backend/app/gateway/routers
Nan Gao 8c0830aea1
fix(channels): add operational guardrails (#3584)
* fix(channels): add operational guardrails

* make format

* fix(channels): converge with #3582 to avoid merge-order conflicts

Drop this PR's DingTalk INFO-log redaction and hand it to #3582, which
already restructures that handler and will redact the same log there. This
PR no longer touches dingtalk.py, so the two PRs can merge to main in any
order without a conflict.

For WeChat, drop the contested thread_ts priority reorder (review #3) and
keep only what inbound dedupe needs: a server-stable message_id in the
inbound metadata (message_id/msg_id, no client_id per review #6). This is a
single added line inside the metadata dict, a region #3582 never touches, so
it auto-merges regardless of order.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): address three correctness review findings

1. Connect-code cap was racy (willem #1): _create_state ran delete-expired,
   count, and insert as three separate transactions, so concurrent connect
   POSTs from one owner could each see count < cap and all insert past it. Add
   ChannelConnectionRepository.create_oauth_state_within_cap which does
   delete+count+insert in a single transaction serialized per (owner,
   provider) — Postgres via pg_advisory_xact_lock, SQLite via the write lock
   the leading DELETE takes — and have the router use it.

2. Inbound dedupe key fell back to "" workspace (willem #3): two workspaces
   delivering without team/guild/aibotid would collapse to the same key and
   dedupe each other's messages. _inbound_dedupe_key now fails closed
   (returns None) when no workspace identifier is present.

3. Dedupe key was recorded on receipt and never released on failure
   (ShenAC #1): a transient error (DB blip, Gateway 503) left the key in place
   for the full TTL, so a provider redelivery of the same message_id — exactly
   the retry dedupe should absorb — was silently dropped. _handle_message now
   releases the key in the unexpected-exception branch so redelivery can
   recover, while keeping record-on-receipt so retries during handling are
   still deduped.

Tests: repo cap enforcement incl. concurrent-issuance non-leak; dedupe
fail-closed; dedupe key release-on-failure redelivery recovery.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): address cleanup/efficiency and test review findings

Efficiency / cleanup:
- Dedupe key set drops client-generated ids (client_msg_id, client_id);
  keep only server-stable event_id/message_id/msg_id, which a provider's own
  redelivery preserves (ShenAC #6). Every provider already emits message_id.
- TTL/overflow pruning of _recent_inbound_events is now O(k): switch to an
  OrderedDict and popitem(last=False) from the front instead of scanning all
  4096 entries on every inbound (willem #4).
- Log "received inbound" only after the dedupe check so a provider retrying N
  times no longer logs N accepts; document that manager dedupe covers the
  agent run/final answer, not provider ack side-effects (willem #5, ShenAC #2).
- Slack drops the redundant `team_id or event.get("team")` fallback the caller
  already resolved (willem #6).
- create_oauth_state_within_cap prunes only this owner/provider's expired codes
  instead of a global DELETE on every connect POST; global cleanup still runs
  on consume_oauth_state (willem #7).

Tests:
- Dedupe test uses tmp_path instead of a leaked mkdtemp, uses distinct objects
  per publish, and adds a negative control: a different message_id is still
  processed, catching over-dedupe regressions (willem #8, ShenAC #4).
- Slack HTTP-mode rejection test supplies app_token so the missing-token early
  return can't mask the guard, giving the state assertions teeth (ShenAC #3).
- count_oauth_states test pins that the active row survives, not just the count
  (ShenAC #5).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* make format

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 10:09:46 +08:00
..
__init__.py feat(gateway): implement LangGraph Platform API in Gateway, replace langgraph-cli (#1403) 2026-03-30 16:02:23 +08:00
agents.py fix(agents): offload blocking filesystem IO in the custom-agent router off the event loop (#3457) 2026-06-09 22:24:53 +08:00
artifacts.py fix(gateway): cap skill artifact preview size (#2963) 2026-05-15 22:15:58 +08:00
assistants_compat.py feat(gateway): implement LangGraph Platform API in Gateway, replace langgraph-cli (#1403) 2026-03-30 16:02:23 +08:00
auth.py fix: align auth-disabled mode and mock history loading (#3471) 2026-06-10 16:11:00 +08:00
channel_connections.py fix(channels): add operational guardrails (#3584) 2026-06-18 10:09:46 +08:00
channels.py refactor: split backend into harness (deerflow.*) and app (app.*) (#1131) 2026-03-14 22:55:52 +08:00
feedback.py feat(persistence):Unified persistence layer with event store, feedback, and rebase cleanup (#2134) 2026-04-26 11:09:55 +08:00
mcp.py fix: add MCP tools cache reset endpoint (#3602) 2026-06-16 23:20:20 +08:00
memory.py feat(memory): add memory.token_counting config to avoid tiktoken network dependency (#3429) (#3465) 2026-06-10 23:26:15 +08:00
models.py refactor: thread release config through lead path (#2612) 2026-04-28 14:53:18 +08:00
runs.py fix(history): strip base64 image data from REST endpoint responses (#3535) 2026-06-13 08:58:19 +08:00
skills.py refactor(skills): Unified skill storage capability (#2613) 2026-05-01 13:23:26 +08:00
suggestions.py make ai follow-up suggestions optional (#3591) 2026-06-15 17:59:25 +08:00
thread_runs.py fix(history): strip base64 image data from REST endpoint responses (#3535) 2026-06-13 08:58:19 +08:00
threads.py fix(history): strip base64 image data from REST endpoint responses (#3535) 2026-06-13 08:58:19 +08:00
uploads.py fix upload file size contract (#3408) 2026-06-06 15:12:17 +08:00