Files
initiative/.claude/skills/bundle-bestiary/SKILL.md
T
Lukas c343fd3cd0 Add bundled-bestiary mechanism for shipping creatures with the app
D&D creatures listed in data/bestiary/dnd-bundled.json are now merged into
the search index and pre-loaded into creatureMap, so they appear alongside
5etools creatures with no "Load source" step. Source codes are derived from
the JSON itself (each creature carries source + sourceDisplayName), so adding
a new book is a pure data change. Bundled sources are excluded from
getAllSourceCodes() so bulk-import skips them, and they never appear in the
source manager (which only lists cached sources).

Includes a reference extractor (scripts/extract-great-labors.py) for the
5.5e revised stat-block format and a /bundle-bestiary skill that future
agents can follow to add monsters from other PDF books.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:49:34 +02:00

8.3 KiB
Raw Blame History

name, description
name description
bundle-bestiary Bundle creatures from a third-party PDF into the app's D&D bestiary so they appear in search alongside 5etools creatures, with no "Load source" step. Use when the user asks to add monsters from a PDF book / adventure / supplement to the bundled bestiary.

Instructions

Add the creatures from a PDF to data/bestiary/dnd-bundled.json so they appear in the D&D search index and render as normal stat blocks. Bundled creatures bypass the fetch/cache flow — they're shipped in the JS bundle and pre-loaded into creatureMap on startup.

How the bundling works

  • data/bestiary/dnd-bundled.json is an array of normalized Creature objects (the same shape produced by bestiary-adapter.ts for 5etools creatures).
  • apps/web/src/adapters/dnd-bundled-adapter.ts static-imports the JSON and derives:
    • loadBundledDndCreatures() — full stat blocks for the in-memory creature map
    • loadBundledDndIndexEntries() — compact summaries for the search index
    • getBundledDndSources() — source code → display name map, derived from the JSON itself (each creature carries its own source + sourceDisplayName)
  • bestiary-index-adapter.ts merges the bundled entries into the search index and excludes bundled sources from getAllSourceCodes() (so bulk-import skips them).
  • use-bestiary.ts merges bundled full creatures into creatureMap on init/refresh.

This means adding a new bundled book is purely a data change: append creatures to dnd-bundled.json with the new source's code and display name. No adapter or index code needs editing.

Step 1 — Confirm scope and source code

Ask the user (don't guess):

  1. PDF path and the page range containing the stat blocks. Many PDFs have hundreds of pages; only a slice has the bestiary.
  2. Source code abbreviation — short uppercase letters, e.g., TGL for The Great Labors. Used in creature IDs and the index.
  3. Display name — the human-readable book title shown in the source column.
  4. Edition / system — confirm this is D&D (5e or 5.5e). Bundled creatures show in both 5e and 5.5e modes (the bestiary index only differentiates pf2e vs not). PF2e isn't currently supported by the bundled flow — if requested, this would need a parallel pf2e-bundled-adapter.ts.
  5. Licensing — verify the user has the right to bundle the book's content. Don't make assumptions.

Step 2 — Inspect the PDF

Check Python's PyPDF2 is available:

python3 -c "from PyPDF2 import PdfReader; print('ok')"

If not, the user has pdftotext-equivalent tooling configured at ~/Nextcloud/dnd/D&D/PROMPT_prep.md worth checking.

Then dump and skim the target pages to learn the stat-block format:

python3 - <<'EOF'
from PyPDF2 import PdfReader
import os
r = PdfReader(os.path.expanduser('PATH/TO/PDF'))
for i in range(START-1, END):
    print(f"\n===PAGE {i+1}===\n{r.pages[i].extract_text()}")
EOF

Look for the layout — the existing extractor (scripts/extract-great-labors.py) assumes the 5.5e/2024 revised format:

  • <Name> line, then
  • <Size> <Type>(optional subtype), <Alignment>, then
  • AC X Initiative ±Y (Z), then
  • HP N (NdN + N), then
  • Speed X ft., …, then
  • A MOD SAVE MOD SAVE MOD SAVE header followed by two ability-score rows, then
  • Optional meta lines: Skills, Saving Throws, Resistances, Immunities, Vulnerabilities, Senses, Languages, then
  • Challenge X (NN XP; PB +N), then
  • Section blocks: Traits / Actions / Bonus Actions / Reactions / Legendary Actions, each containing entries shaped like Name. body....

If the PDF format matches, adapt the existing extractor. If it's a different format (5e 2014 with STR DEX CON … column layout, an older publisher's layout, a homebrew layout), expect to rework the parser more substantively.

Step 3 — Adapt or extend the extractor

Copy scripts/extract-great-labors.py to a new script per book (e.g., scripts/extract-<book-slug>.py) and update:

  • SOURCE_CODE, SOURCE_DISPLAY, PAGE_START, PAGE_END constants.
  • The output path (data/bestiary/dnd-bundled.json). Don't overwrite — merge. The simplest pattern: read the existing file, drop any entries with the same source, then append the new ones.
  • The PROSE_TAIL_PATTERNS list — every book has its own running headers (<PageNumber>APPENDIX B … MONSTERS-style), section-header phrases, and quote-attribution dashes. Run the extractor, audit the output (see Step 4), and add curated trim patterns for any prose tails that bleed in.

Run it:

python3 scripts/extract-<book-slug>.py PATH/TO/PDF

Step 4 — Audit the output

PyPDF text extraction is messy. Always audit before claiming done:

python3 - <<'EOF'
import json, re
data = json.load(open('data/bestiary/dnd-bundled.json'))
new = [c for c in data if c['source'] == 'XXX']  # replace XXX with your code
for c in new:
    print(f"{c['name']}: CR {c['cr']}, AC {c['ac']}, HP {c['hp']['average']} ({c['hp']['formula']})")
    abs_ = c['abilities']
    print(f"  STR {abs_['str']} DEX {abs_['dex']} CON {abs_['con']} INT {abs_['int']} WIS {abs_['wis']} CHA {abs_['cha']}, PP {c['passive']}")
# Then audit bodies for prose-tail bleed and weird splits.
for c in new:
    for sec in ('traits', 'actions', 'bonusActions', 'reactions'):
        for e in c.get(sec, []):
            body = e['segments'][0]['value']
            issues = []
            if len(body) > 600: issues.append(f"long({len(body)})")
            if re.search(r'\.[A-Z][a-z]', body): issues.append("dot-Capital")
            if 'APPENDIX' in body: issues.append("APPENDIX")
            if re.search(r'—\s*[A-Z]\w+,\s', body): issues.append("attribution")
            if issues:
                print(f"  {c['name']} [{sec}] {e['name']}: {', '.join(issues)}")
                print(f"    ...{body[-200:]}")
EOF

Common PDF extraction problems to fix in the parser:

  • PDF kerning quirks: multi-digit values rendered with spaces (e.g., "Passive Perception 1 1" → 11, "Wis 81 1" with no space before negative). The existing parser handles most; check for new ones.
  • Smushed section headers: lines like ...plants.Actions where the section header for the next block was concatenated. Handle via SECTION_HEADER_SMUSH_RE preprocessing.
  • Cross-page prose bleed: text from the next page's flavor prose absorbed into the last entry's body. Catch via PROSE_TAIL_PATTERNS — add curated phrases observed in this specific book.
  • Sibling-entry inline smush: damage.Ram. Melee Attack Roll: … where two entries got concatenated. Already handled by the mid-line entry boundary regex in the existing parser.
  • Title-cased false positives: words like Bloodied., Restrained., Frightened. at sentence ends would otherwise match the entry-name pattern. Filtered via NAME_FALSE_POSITIVES — add to it if the new book uses condition names you haven't seen yet.

Step 5 — Verify in the app

pnpm check

Then start the dev server and search for one of the new creatures by name:

pnpm --filter web dev

Confirm in the browser:

  1. Search finds the creature with the right book name as the source label.
  2. Clicking it shows the full stat block immediately — no "Load source" prompt.
  3. The source manager UI does not list the bundled book (it only shows cached sources).
  4. Bulk import skips the bundled book.

Notes for future agents

  • No need to edit dnd-bundled-adapter.ts or bestiary-index-adapter.ts when adding a new book — the adapter derives source codes from the JSON.
  • data/bestiary/index.json is regenerated from 5etools and should not be edited to add bundled entries. The merge happens at runtime in bestiary-index-adapter.ts.
  • Each bundled creature must have:
    • A unique id like <sourcecode>:<slug> (e.g., tgl:anarch-boar).
    • source field matching the source code (e.g., "TGL").
    • sourceDisplayName field matching the book's display name (e.g., "The Great Labors").
    • All the required Creature fields from packages/domain/src/creature-types.ts.
  • The script approach is preferred over hand-editing JSON for >5 creatures. For a single creature or two, hand-editing the JSON is reasonable; just match an existing entry's shape exactly.
  • After any change to dnd-bundled.json, run pnpm typecheck — the static import in the adapter will catch shape mismatches at compile time.