c343fd3cd0
D&D creatures listed in data/bestiary/dnd-bundled.json are now merged into the search index and pre-loaded into creatureMap, so they appear alongside 5etools creatures with no "Load source" step. Source codes are derived from the JSON itself (each creature carries source + sourceDisplayName), so adding a new book is a pure data change. Bundled sources are excluded from getAllSourceCodes() so bulk-import skips them, and they never appear in the source manager (which only lists cached sources). Includes a reference extractor (scripts/extract-great-labors.py) for the 5.5e revised stat-block format and a /bundle-bestiary skill that future agents can follow to add monsters from other PDF books. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
149 lines
8.3 KiB
Markdown
149 lines
8.3 KiB
Markdown
---
|
||
name: bundle-bestiary
|
||
description: Bundle creatures from a third-party PDF into the app's D&D bestiary so they appear in search alongside 5etools creatures, with no "Load source" step. Use when the user asks to add monsters from a PDF book / adventure / supplement to the bundled bestiary.
|
||
---
|
||
|
||
## Instructions
|
||
|
||
Add the creatures from a PDF to `data/bestiary/dnd-bundled.json` so they appear in the D&D search index and render as normal stat blocks. Bundled creatures bypass the fetch/cache flow — they're shipped in the JS bundle and pre-loaded into `creatureMap` on startup.
|
||
|
||
### How the bundling works
|
||
|
||
- `data/bestiary/dnd-bundled.json` is an array of normalized `Creature` objects (the same shape produced by `bestiary-adapter.ts` for 5etools creatures).
|
||
- `apps/web/src/adapters/dnd-bundled-adapter.ts` static-imports the JSON and derives:
|
||
- `loadBundledDndCreatures()` — full stat blocks for the in-memory creature map
|
||
- `loadBundledDndIndexEntries()` — compact summaries for the search index
|
||
- `getBundledDndSources()` — source code → display name map, **derived from the JSON itself** (each creature carries its own `source` + `sourceDisplayName`)
|
||
- `bestiary-index-adapter.ts` merges the bundled entries into the search index and excludes bundled sources from `getAllSourceCodes()` (so bulk-import skips them).
|
||
- `use-bestiary.ts` merges bundled full creatures into `creatureMap` on init/refresh.
|
||
|
||
This means **adding a new bundled book is purely a data change**: append creatures to `dnd-bundled.json` with the new source's code and display name. No adapter or index code needs editing.
|
||
|
||
### Step 1 — Confirm scope and source code
|
||
|
||
Ask the user (don't guess):
|
||
|
||
1. **PDF path** and the **page range** containing the stat blocks. Many PDFs have hundreds of pages; only a slice has the bestiary.
|
||
2. **Source code abbreviation** — short uppercase letters, e.g., `TGL` for *The Great Labors*. Used in creature IDs and the index.
|
||
3. **Display name** — the human-readable book title shown in the source column.
|
||
4. **Edition / system** — confirm this is D&D (5e or 5.5e). Bundled creatures show in both 5e and 5.5e modes (the bestiary index only differentiates pf2e vs not). PF2e isn't currently supported by the bundled flow — if requested, this would need a parallel `pf2e-bundled-adapter.ts`.
|
||
5. **Licensing** — verify the user has the right to bundle the book's content. Don't make assumptions.
|
||
|
||
### Step 2 — Inspect the PDF
|
||
|
||
Check Python's PyPDF2 is available:
|
||
|
||
```bash
|
||
python3 -c "from PyPDF2 import PdfReader; print('ok')"
|
||
```
|
||
|
||
If not, the user has `pdftotext`-equivalent tooling configured at `~/Nextcloud/dnd/D&D/PROMPT_prep.md` worth checking.
|
||
|
||
Then dump and skim the target pages to learn the stat-block format:
|
||
|
||
```bash
|
||
python3 - <<'EOF'
|
||
from PyPDF2 import PdfReader
|
||
import os
|
||
r = PdfReader(os.path.expanduser('PATH/TO/PDF'))
|
||
for i in range(START-1, END):
|
||
print(f"\n===PAGE {i+1}===\n{r.pages[i].extract_text()}")
|
||
EOF
|
||
```
|
||
|
||
Look for the layout — the existing extractor (`scripts/extract-great-labors.py`) assumes the 5.5e/2024 revised format:
|
||
|
||
- `<Name>` line, then
|
||
- `<Size> <Type>(optional subtype), <Alignment>`, then
|
||
- `AC X Initiative ±Y (Z)`, then
|
||
- `HP N (NdN + N)`, then
|
||
- `Speed X ft., …`, then
|
||
- A `MOD SAVE MOD SAVE MOD SAVE` header followed by two ability-score rows, then
|
||
- Optional meta lines: `Skills`, `Saving Throws`, `Resistances`, `Immunities`, `Vulnerabilities`, `Senses`, `Languages`, then
|
||
- `Challenge X (NN XP; PB +N)`, then
|
||
- Section blocks: `Traits` / `Actions` / `Bonus Actions` / `Reactions` / `Legendary Actions`, each containing entries shaped like `Name. body...`.
|
||
|
||
If the PDF format matches, adapt the existing extractor. If it's a different format (5e 2014 with `STR DEX CON …` column layout, an older publisher's layout, a homebrew layout), expect to rework the parser more substantively.
|
||
|
||
### Step 3 — Adapt or extend the extractor
|
||
|
||
Copy `scripts/extract-great-labors.py` to a new script per book (e.g., `scripts/extract-<book-slug>.py`) and update:
|
||
|
||
- `SOURCE_CODE`, `SOURCE_DISPLAY`, `PAGE_START`, `PAGE_END` constants.
|
||
- The output path (`data/bestiary/dnd-bundled.json`). **Don't overwrite — merge.** The simplest pattern: read the existing file, drop any entries with the same `source`, then append the new ones.
|
||
- The `PROSE_TAIL_PATTERNS` list — every book has its own running headers (`<PageNumber>APPENDIX B … MONSTERS`-style), section-header phrases, and quote-attribution dashes. Run the extractor, audit the output (see Step 4), and add curated trim patterns for any prose tails that bleed in.
|
||
|
||
Run it:
|
||
|
||
```bash
|
||
python3 scripts/extract-<book-slug>.py PATH/TO/PDF
|
||
```
|
||
|
||
### Step 4 — Audit the output
|
||
|
||
PyPDF text extraction is messy. Always audit before claiming done:
|
||
|
||
```bash
|
||
python3 - <<'EOF'
|
||
import json, re
|
||
data = json.load(open('data/bestiary/dnd-bundled.json'))
|
||
new = [c for c in data if c['source'] == 'XXX'] # replace XXX with your code
|
||
for c in new:
|
||
print(f"{c['name']}: CR {c['cr']}, AC {c['ac']}, HP {c['hp']['average']} ({c['hp']['formula']})")
|
||
abs_ = c['abilities']
|
||
print(f" STR {abs_['str']} DEX {abs_['dex']} CON {abs_['con']} INT {abs_['int']} WIS {abs_['wis']} CHA {abs_['cha']}, PP {c['passive']}")
|
||
# Then audit bodies for prose-tail bleed and weird splits.
|
||
for c in new:
|
||
for sec in ('traits', 'actions', 'bonusActions', 'reactions'):
|
||
for e in c.get(sec, []):
|
||
body = e['segments'][0]['value']
|
||
issues = []
|
||
if len(body) > 600: issues.append(f"long({len(body)})")
|
||
if re.search(r'\.[A-Z][a-z]', body): issues.append("dot-Capital")
|
||
if 'APPENDIX' in body: issues.append("APPENDIX")
|
||
if re.search(r'—\s*[A-Z]\w+,\s', body): issues.append("attribution")
|
||
if issues:
|
||
print(f" {c['name']} [{sec}] {e['name']}: {', '.join(issues)}")
|
||
print(f" ...{body[-200:]}")
|
||
EOF
|
||
```
|
||
|
||
Common PDF extraction problems to fix in the parser:
|
||
|
||
- **PDF kerning quirks**: multi-digit values rendered with spaces (e.g., "Passive Perception 1 1" → 11, "Wis 8–1 –1" with no space before negative). The existing parser handles most; check for new ones.
|
||
- **Smushed section headers**: lines like `...plants.Actions` where the section header for the next block was concatenated. Handle via `SECTION_HEADER_SMUSH_RE` preprocessing.
|
||
- **Cross-page prose bleed**: text from the next page's flavor prose absorbed into the last entry's body. Catch via `PROSE_TAIL_PATTERNS` — add curated phrases observed in this specific book.
|
||
- **Sibling-entry inline smush**: `damage.Ram. Melee Attack Roll: …` where two entries got concatenated. Already handled by the mid-line entry boundary regex in the existing parser.
|
||
- **Title-cased false positives**: words like `Bloodied.`, `Restrained.`, `Frightened.` at sentence ends would otherwise match the entry-name pattern. Filtered via `NAME_FALSE_POSITIVES` — add to it if the new book uses condition names you haven't seen yet.
|
||
|
||
### Step 5 — Verify in the app
|
||
|
||
```bash
|
||
pnpm check
|
||
```
|
||
|
||
Then start the dev server and search for one of the new creatures by name:
|
||
|
||
```bash
|
||
pnpm --filter web dev
|
||
```
|
||
|
||
Confirm in the browser:
|
||
|
||
1. Search finds the creature with the right book name as the source label.
|
||
2. Clicking it shows the full stat block immediately — **no "Load source" prompt**.
|
||
3. The source manager UI does **not** list the bundled book (it only shows cached sources).
|
||
4. Bulk import skips the bundled book.
|
||
|
||
### Notes for future agents
|
||
|
||
- **No need to edit `dnd-bundled-adapter.ts` or `bestiary-index-adapter.ts`** when adding a new book — the adapter derives source codes from the JSON.
|
||
- `data/bestiary/index.json` is regenerated from 5etools and should **not** be edited to add bundled entries. The merge happens at runtime in `bestiary-index-adapter.ts`.
|
||
- Each bundled creature must have:
|
||
- A unique `id` like `<sourcecode>:<slug>` (e.g., `tgl:anarch-boar`).
|
||
- `source` field matching the source code (e.g., `"TGL"`).
|
||
- `sourceDisplayName` field matching the book's display name (e.g., `"The Great Labors"`).
|
||
- All the required `Creature` fields from `packages/domain/src/creature-types.ts`.
|
||
- The script approach is preferred over hand-editing JSON for >5 creatures. For a single creature or two, hand-editing the JSON is reasonable; just match an existing entry's shape exactly.
|
||
- After any change to `dnd-bundled.json`, run `pnpm typecheck` — the static import in the adapter will catch shape mismatches at compile time.
|