* feat(skills): add systematic-literature-review skill for multi-paper SLR workflows
Adds a new skill that produces a structured systematic literature review (SLR)
across multiple academic papers on a topic. Addresses #1862 with a pure skill
approach: no new tools, no architectural changes, no new dependencies.
Skill layout:
- SKILL.md — 4+1 phase workflow (plan, search, extract, synthesize, present)
- scripts/arxiv_search.py — arXiv API client, stdlib only, with a
requests->urllib fallback shim modeled after github-deep-research's
github_api.py
- templates/{apa,ieee,bibtex}.md — citation format templates selected
dynamically in Phase 4, mirroring podcast-generation's templates/ pattern
Design notes:
- Multi-paper synthesis uses the existing `task` tool to dispatch extraction
subagents in parallel. SKILL.md's Phase 3 includes a fixed decision table
for batch splitting to respect the runtime's MAX_CONCURRENT_SUBAGENTS = 3
cap, and explicitly tells the agent to strip the "Task Succeeded. Result: "
prefix before parsing subagent JSON output.
- arXiv only, by design. Semantic Scholar and PubMed adapters would push the
scope toward a standalone MCP server (see #933) and are intentionally out
of scope for this skill.
- Coexists with the existing `academic-paper-review` skill: this skill does
breadth-first synthesis across many papers, academic-paper-review does
single-paper peer review. The two are routed via distinct triggers and
can compose (SLR on many + deep review on 1-2 important ones).
- Hard upper bound of 50 papers, tied to the Phase 3 concurrency strategy.
Larger surveys degrade in synthesis quality and are better split by
sub-topic.
BibTeX template explicitly uses @misc for arXiv preprints (not @article),
which is the most common mistake when generating BibTeX for arXiv papers.
arxiv_search.py was smoke-tested end-to-end against the live arXiv API with
two query shapes (relevance sort, submittedDate sort with category filter);
all returned JSON fields parse correctly (id normalization, Atom namespace
handling, URL encoding for multi-word queries).
* fix(skills): prevent LLM from saving intermediate search results to file
Adds an explicit "do not save" instruction at the end of Phase 2.
Observed during Test 1 with DeepSeek: the model saved search results
to a markdown file before proceeding to Phase 3, wasting 2-3 tool call
rounds and increasing the risk of hitting the graph recursion limit.
The search JSON should stay in context for Phase 3, not be persisted.
* fix(skills): use relevance+start-date instead of submittedDate sorting
Test 2 revealed that arXiv's submittedDate sorting returns the most
recently submitted papers in the category regardless of query relevance.
Searching "diffusion models" with sortBy=submittedDate in cs.CV returned
papers on spatial memory, Navier-Stokes, and photon-counting CT — none
about diffusion models. The LLM then retried with 4 different queries,
wasting tool calls and approaching the recursion limit.
Fix: always sort by relevance; when the user wants "recent" papers,
combine relevance sorting with --start-date to constrain the time window.
Also add an explicit "run the search exactly once" instruction to prevent
the retry loop.
* fix(skills): wrap multi-word arXiv queries in double quotes for phrase matching
Without quotes, `all:diffusion model` is parsed by arXiv's Lucene as
`all:diffusion OR model`, pulling in unrelated papers from physics
(thermal diffusion) and other fields. Wrapping in double quotes forces
phrase matching: `all:"diffusion model"`.
Also fixes date filtering: the previous bug caused 2011 papers to appear
in results despite --start-date 2024-04-09, because the unquoted query
words were OR'd with the date constraint.
Verified: "diffusion models" --category cs.CV --start-date 2024-04-09
now returns only relevant diffusion model papers published after April
2024.
* fix(skills): add query phrasing guide and enforce subagent delegation
Two fixes from Test 2 observations with DeepSeek:
1. Query phrasing: add a table showing good vs bad query examples.
The script wraps multi-word queries in double quotes for phrase
matching, so long queries like "diffusion models in computer vision"
return 0 results. Guide the LLM to use 2-3 core keywords + --category
instead.
2. Subagent enforcement: DeepSeek was extracting metadata inline via
python -c scripts instead of using the task tool. Strengthen Phase 3
to explicitly name the task tool, say "do not extract metadata
yourself", and explain why (token budget, isolation). This is more
direct than the previous natural-language-only approach while still
providing the reasoning behind the constraint.
* fix(skills): strengthen search keyword guidance and subagent enforcement
Address two issues found during end-to-end testing with DeepSeek:
1. Search retry: LLM passed full topic descriptions as queries (e.g.
"diffusion models in computer vision"), which returned 0 results due
to exact phrase matching and triggered retries. Added explicit
instruction to extract 2-3 core keywords before searching.
2. Subagent bypass: LLM used python -c to extract metadata instead of
dispatching via task tool. Added explicit prohibition list (python -c,
bash scripts, inline extraction) with ❌ markers for clarity.
* fix(skills): address Copilot review feedback on SLR skill
- Fix legacy arXiv ID parsing: preserve archive prefix for pre-2007
papers (e.g. hep-th/9901001 instead of just 9901001)
- Fix phase count: "four phases" -> "five phases"
- Add subagent_enabled prerequisite note to SKILL.md Notes section
- Remove PR-specific references ("PR 1") from ieee.md and bibtex.md
templates, replace with workflow-scoped wording
- Fix script header: "stdlib only" -> "no additional dependencies
required", fix relative path to github_api.py reference
- Remove reference to non-existent docs/enhancement/ path in header
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>