deerflow2/backend/packages/harness/deerflow/config
SHIYAO ZHANG 2da2c686c3 feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload (#1727)
* feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload

- Introduce pymupdf4llm as an optional PDF converter with better heading
  detection and table preservation than MarkItDown
- Auto mode: prefer pymupdf4llm when installed; fall back to MarkItDown
  when output is suspiciously sparse (image-based / scanned PDFs)
- Sparsity check uses chars-per-page (< 50 chars/page) rather than an
  absolute threshold, correctly handling both short and long documents
- Large files (> 1 MB) are offloaded to asyncio.to_thread() to avoid
  blocking the event loop (related: #1569)
- Add UploadsConfig with pdf_converter field (auto/pymupdf4llm/markitdown)
- Add pymupdf4llm as optional dependency: pip install deerflow-harness[pymupdf]
- Add 14 unit tests covering sparsity heuristic, routing logic, and async path

* fix(uploads): address Copilot review comments on PDF converter

- Fix docstring: MIN_CHARS_PYMUPDF -> _MIN_CHARS_PER_PAGE (typo)
- Fix file handle leak: wrap pymupdf.open in try/finally to ensure doc.close()
- Fix silent fallback gap: _convert_pdf_with_pymupdf4llm now catches all
  conversion exceptions (not just ImportError), so encrypted/corrupt PDFs
  fall back to MarkItDown instead of propagating
- Tighten type: pdf_converter field changed from str to Literal[auto|pymupdf4llm|markitdown]
- Normalize config value: _get_pdf_converter() strips and lowercases the raw
  config string, warns and falls back to 'auto' on unknown values
2026-04-03 21:59:45 +08:00
..
__init__.py feat(tracing): add optional Langfuse support (#1717) 2026-04-02 13:06:10 +08:00
acp_config.py feat(acp): add env field to ACPAgentConfig for subprocess env injection (#1447) 2026-03-27 20:03:30 +08:00
agents_config.py feat/per agent skill filter (#1650) 2026-04-02 15:02:09 +08:00
app_config.py feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload (#1727) 2026-04-03 21:59:45 +08:00
checkpointer_config.py refactor: split backend into harness (deerflow.*) and app (app.*) (#1131) 2026-03-14 22:55:52 +08:00
extensions_config.py refactor: split backend into harness (deerflow.*) and app (app.*) (#1131) 2026-03-14 22:55:52 +08:00
guardrails_config.py feat(guardrails): add pre-tool-call authorization middleware with pluggable providers (#1240) 2026-03-23 18:07:33 +08:00
memory_config.py feat(memory): Introduce configurable memory storage abstraction (#1353) 2026-03-27 07:41:06 +08:00
model_config.py feat(codex): support explicit OpenAI Responses API config (#1235) 2026-03-22 20:39:26 +08:00
paths.py fix Windows Docker sandbox path mounting (#1634) 2026-03-31 22:19:27 +08:00
sandbox_config.py feat(sandbox): truncate oversized bash and read_file tool outputs (#1677) 2026-04-02 09:22:41 +08:00
skills_config.py refactor: split backend into harness (deerflow.*) and app (app.*) (#1131) 2026-03-14 22:55:52 +08:00
stream_bridge_config.py feat(gateway): implement LangGraph Platform API in Gateway, replace langgraph-cli (#1403) 2026-03-30 16:02:23 +08:00
subagents_config.py refactor: split backend into harness (deerflow.*) and app (app.*) (#1131) 2026-03-14 22:55:52 +08:00
summarization_config.py refactor: split backend into harness (deerflow.*) and app (app.*) (#1131) 2026-03-14 22:55:52 +08:00
title_config.py refactor: split backend into harness (deerflow.*) and app (app.*) (#1131) 2026-03-14 22:55:52 +08:00
token_usage_config.py feat: add configurable log level and token usage tracking (#1301) 2026-03-25 08:13:26 +08:00
tool_config.py refactor: split backend into harness (deerflow.*) and app (app.*) (#1131) 2026-03-14 22:55:52 +08:00
tool_search_config.py feat(tools): add tool_search for deferred MCP tool loading (#1176) 2026-03-17 20:43:55 +08:00
tracing_config.py feat(tracing): add optional Langfuse support (#1717) 2026-04-02 13:06:10 +08:00