deerflow2/backend/packages/harness/deerflow
SHIYAO ZHANG ddfc988bef
feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload (#1727)
* feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload

- Introduce pymupdf4llm as an optional PDF converter with better heading
  detection and table preservation than MarkItDown
- Auto mode: prefer pymupdf4llm when installed; fall back to MarkItDown
  when output is suspiciously sparse (image-based / scanned PDFs)
- Sparsity check uses chars-per-page (< 50 chars/page) rather than an
  absolute threshold, correctly handling both short and long documents
- Large files (> 1 MB) are offloaded to asyncio.to_thread() to avoid
  blocking the event loop (related: #1569)
- Add UploadsConfig with pdf_converter field (auto/pymupdf4llm/markitdown)
- Add pymupdf4llm as optional dependency: pip install deerflow-harness[pymupdf]
- Add 14 unit tests covering sparsity heuristic, routing logic, and async path

* fix(uploads): address Copilot review comments on PDF converter

- Fix docstring: MIN_CHARS_PYMUPDF -> _MIN_CHARS_PER_PAGE (typo)
- Fix file handle leak: wrap pymupdf.open in try/finally to ensure doc.close()
- Fix silent fallback gap: _convert_pdf_with_pymupdf4llm now catches all
  conversion exceptions (not just ImportError), so encrypted/corrupt PDFs
  fall back to MarkItDown instead of propagating
- Tighten type: pdf_converter field changed from str to Literal[auto|pymupdf4llm|markitdown]
- Normalize config value: _get_pdf_converter() strips and lowercases the raw
  config string, warns and falls back to 'auto' on unknown values
2026-04-03 21:59:45 +08:00
..
agents feat(uploads): inject document outline into agent context for converted files (#1738) 2026-04-03 20:52:47 +08:00
community feat(sandbox): add built-in grep and glob tools (#1784) 2026-04-03 16:03:06 +08:00
config feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload (#1727) 2026-04-03 21:59:45 +08:00
guardrails feat(guardrails): add pre-tool-call authorization middleware with pluggable providers (#1240) 2026-03-23 18:07:33 +08:00
mcp feat(harness): integration ACP agent tool (#1344) 2026-03-26 14:20:18 +08:00
models feat(tracing): add optional Langfuse support (#1717) 2026-04-02 13:06:10 +08:00
reflection refactor: split backend into harness (deerflow.*) and app (app.*) (#1131) 2026-03-14 22:55:52 +08:00
runtime fix: guarantee END sentinel delivery when stream bridge queue is full (#1695) 2026-04-03 20:12:30 +08:00
sandbox feat(sandbox): add read-only support for local sandbox path mappings (#1808) 2026-04-03 19:46:22 +08:00
skills fix(skills): support parsing multiline YAML strings in SKILL.md frontmatter (#1703) 2026-04-01 23:08:30 +08:00
subagents fix: surface configured sandbox mounts to agents (#1638) 2026-03-31 22:22:30 +08:00
tools fix ACP mcpServers payload (#1735) 2026-04-03 15:28:56 +08:00
tracing feat(tracing): add optional Langfuse support (#1717) 2026-04-02 13:06:10 +08:00
uploads feat(harness): integration ACP agent tool (#1344) 2026-03-26 14:20:18 +08:00
utils feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload (#1727) 2026-04-03 21:59:45 +08:00
__init__.py refactor: split backend into harness (deerflow.*) and app (app.*) (#1131) 2026-03-14 22:55:52 +08:00
client.py feat(client): add `available_skills` parameter to DeerFlowClient (#1779) 2026-04-03 11:22:58 +08:00