diff --git a/.claude/skills/bundle-bestiary/SKILL.md b/.claude/skills/bundle-bestiary/SKILL.md new file mode 100644 index 0000000..289a2d8 --- /dev/null +++ b/.claude/skills/bundle-bestiary/SKILL.md @@ -0,0 +1,148 @@ +--- +name: bundle-bestiary +description: Bundle creatures from a third-party PDF into the app's D&D bestiary so they appear in search alongside 5etools creatures, with no "Load source" step. Use when the user asks to add monsters from a PDF book / adventure / supplement to the bundled bestiary. +--- + +## Instructions + +Add the creatures from a PDF to `data/bestiary/dnd-bundled.json` so they appear in the D&D search index and render as normal stat blocks. Bundled creatures bypass the fetch/cache flow — they're shipped in the JS bundle and pre-loaded into `creatureMap` on startup. + +### How the bundling works + +- `data/bestiary/dnd-bundled.json` is an array of normalized `Creature` objects (the same shape produced by `bestiary-adapter.ts` for 5etools creatures). +- `apps/web/src/adapters/dnd-bundled-adapter.ts` static-imports the JSON and derives: + - `loadBundledDndCreatures()` — full stat blocks for the in-memory creature map + - `loadBundledDndIndexEntries()` — compact summaries for the search index + - `getBundledDndSources()` — source code → display name map, **derived from the JSON itself** (each creature carries its own `source` + `sourceDisplayName`) +- `bestiary-index-adapter.ts` merges the bundled entries into the search index and excludes bundled sources from `getAllSourceCodes()` (so bulk-import skips them). +- `use-bestiary.ts` merges bundled full creatures into `creatureMap` on init/refresh. + +This means **adding a new bundled book is purely a data change**: append creatures to `dnd-bundled.json` with the new source's code and display name. No adapter or index code needs editing. + +### Step 1 — Confirm scope and source code + +Ask the user (don't guess): + +1. **PDF path** and the **page range** containing the stat blocks. Many PDFs have hundreds of pages; only a slice has the bestiary. +2. **Source code abbreviation** — short uppercase letters, e.g., `TGL` for *The Great Labors*. Used in creature IDs and the index. +3. **Display name** — the human-readable book title shown in the source column. +4. **Edition / system** — confirm this is D&D (5e or 5.5e). Bundled creatures show in both 5e and 5.5e modes (the bestiary index only differentiates pf2e vs not). PF2e isn't currently supported by the bundled flow — if requested, this would need a parallel `pf2e-bundled-adapter.ts`. +5. **Licensing** — verify the user has the right to bundle the book's content. Don't make assumptions. + +### Step 2 — Inspect the PDF + +Check Python's PyPDF2 is available: + +```bash +python3 -c "from PyPDF2 import PdfReader; print('ok')" +``` + +If not, the user has `pdftotext`-equivalent tooling configured at `~/Nextcloud/dnd/D&D/PROMPT_prep.md` worth checking. + +Then dump and skim the target pages to learn the stat-block format: + +```bash +python3 - <<'EOF' +from PyPDF2 import PdfReader +import os +r = PdfReader(os.path.expanduser('PATH/TO/PDF')) +for i in range(START-1, END): + print(f"\n===PAGE {i+1}===\n{r.pages[i].extract_text()}") +EOF +``` + +Look for the layout — the existing extractor (`scripts/extract-great-labors.py`) assumes the 5.5e/2024 revised format: + +- `` line, then +- ` (optional subtype), `, then +- `AC X Initiative ±Y (Z)`, then +- `HP N (NdN + N)`, then +- `Speed X ft., …`, then +- A `MOD SAVE MOD SAVE MOD SAVE` header followed by two ability-score rows, then +- Optional meta lines: `Skills`, `Saving Throws`, `Resistances`, `Immunities`, `Vulnerabilities`, `Senses`, `Languages`, then +- `Challenge X (NN XP; PB +N)`, then +- Section blocks: `Traits` / `Actions` / `Bonus Actions` / `Reactions` / `Legendary Actions`, each containing entries shaped like `Name. body...`. + +If the PDF format matches, adapt the existing extractor. If it's a different format (5e 2014 with `STR DEX CON …` column layout, an older publisher's layout, a homebrew layout), expect to rework the parser more substantively. + +### Step 3 — Adapt or extend the extractor + +Copy `scripts/extract-great-labors.py` to a new script per book (e.g., `scripts/extract-.py`) and update: + +- `SOURCE_CODE`, `SOURCE_DISPLAY`, `PAGE_START`, `PAGE_END` constants. +- The output path (`data/bestiary/dnd-bundled.json`). **Don't overwrite — merge.** The simplest pattern: read the existing file, drop any entries with the same `source`, then append the new ones. +- The `PROSE_TAIL_PATTERNS` list — every book has its own running headers (`APPENDIX B … MONSTERS`-style), section-header phrases, and quote-attribution dashes. Run the extractor, audit the output (see Step 4), and add curated trim patterns for any prose tails that bleed in. + +Run it: + +```bash +python3 scripts/extract-.py PATH/TO/PDF +``` + +### Step 4 — Audit the output + +PyPDF text extraction is messy. Always audit before claiming done: + +```bash +python3 - <<'EOF' +import json, re +data = json.load(open('data/bestiary/dnd-bundled.json')) +new = [c for c in data if c['source'] == 'XXX'] # replace XXX with your code +for c in new: + print(f"{c['name']}: CR {c['cr']}, AC {c['ac']}, HP {c['hp']['average']} ({c['hp']['formula']})") + abs_ = c['abilities'] + print(f" STR {abs_['str']} DEX {abs_['dex']} CON {abs_['con']} INT {abs_['int']} WIS {abs_['wis']} CHA {abs_['cha']}, PP {c['passive']}") +# Then audit bodies for prose-tail bleed and weird splits. +for c in new: + for sec in ('traits', 'actions', 'bonusActions', 'reactions'): + for e in c.get(sec, []): + body = e['segments'][0]['value'] + issues = [] + if len(body) > 600: issues.append(f"long({len(body)})") + if re.search(r'\.[A-Z][a-z]', body): issues.append("dot-Capital") + if 'APPENDIX' in body: issues.append("APPENDIX") + if re.search(r'—\s*[A-Z]\w+,\s', body): issues.append("attribution") + if issues: + print(f" {c['name']} [{sec}] {e['name']}: {', '.join(issues)}") + print(f" ...{body[-200:]}") +EOF +``` + +Common PDF extraction problems to fix in the parser: + +- **PDF kerning quirks**: multi-digit values rendered with spaces (e.g., "Passive Perception 1 1" → 11, "Wis 8–1 –1" with no space before negative). The existing parser handles most; check for new ones. +- **Smushed section headers**: lines like `...plants.Actions` where the section header for the next block was concatenated. Handle via `SECTION_HEADER_SMUSH_RE` preprocessing. +- **Cross-page prose bleed**: text from the next page's flavor prose absorbed into the last entry's body. Catch via `PROSE_TAIL_PATTERNS` — add curated phrases observed in this specific book. +- **Sibling-entry inline smush**: `damage.Ram. Melee Attack Roll: …` where two entries got concatenated. Already handled by the mid-line entry boundary regex in the existing parser. +- **Title-cased false positives**: words like `Bloodied.`, `Restrained.`, `Frightened.` at sentence ends would otherwise match the entry-name pattern. Filtered via `NAME_FALSE_POSITIVES` — add to it if the new book uses condition names you haven't seen yet. + +### Step 5 — Verify in the app + +```bash +pnpm check +``` + +Then start the dev server and search for one of the new creatures by name: + +```bash +pnpm --filter web dev +``` + +Confirm in the browser: + +1. Search finds the creature with the right book name as the source label. +2. Clicking it shows the full stat block immediately — **no "Load source" prompt**. +3. The source manager UI does **not** list the bundled book (it only shows cached sources). +4. Bulk import skips the bundled book. + +### Notes for future agents + +- **No need to edit `dnd-bundled-adapter.ts` or `bestiary-index-adapter.ts`** when adding a new book — the adapter derives source codes from the JSON. +- `data/bestiary/index.json` is regenerated from 5etools and should **not** be edited to add bundled entries. The merge happens at runtime in `bestiary-index-adapter.ts`. +- Each bundled creature must have: + - A unique `id` like `:` (e.g., `tgl:anarch-boar`). + - `source` field matching the source code (e.g., `"TGL"`). + - `sourceDisplayName` field matching the book's display name (e.g., `"The Great Labors"`). + - All the required `Creature` fields from `packages/domain/src/creature-types.ts`. +- The script approach is preferred over hand-editing JSON for >5 creatures. For a single creature or two, hand-editing the JSON is reasonable; just match an existing entry's shape exactly. +- After any change to `dnd-bundled.json`, run `pnpm typecheck` — the static import in the adapter will catch shape mismatches at compile time. diff --git a/apps/web/src/adapters/__tests__/bestiary-index-adapter.test.ts b/apps/web/src/adapters/__tests__/bestiary-index-adapter.test.ts index 5fffc57..4130778 100644 --- a/apps/web/src/adapters/__tests__/bestiary-index-adapter.test.ts +++ b/apps/web/src/adapters/__tests__/bestiary-index-adapter.test.ts @@ -49,10 +49,9 @@ describe("loadBestiaryIndex", () => { }); describe("getAllSourceCodes", () => { - it("returns all keys from the index sources", () => { + it("returns all index sources except bundled ones", () => { const codes = getAllSourceCodes(); - const index = loadBestiaryIndex(); - expect(codes).toEqual(Object.keys(index.sources)); + expect(codes).not.toContain("TGL"); }); it("returns only strings", () => { diff --git a/apps/web/src/adapters/__tests__/dnd-bundled-adapter.test.ts b/apps/web/src/adapters/__tests__/dnd-bundled-adapter.test.ts new file mode 100644 index 0000000..8d1d308 --- /dev/null +++ b/apps/web/src/adapters/__tests__/dnd-bundled-adapter.test.ts @@ -0,0 +1,45 @@ +import { describe, expect, it } from "vitest"; +import { + getBundledDndSources, + loadBundledDndCreatures, + loadBundledDndIndexEntries, +} from "../dnd-bundled-adapter.js"; + +describe("dnd-bundled-adapter", () => { + it("loads bundled creatures with a valid shape", () => { + const creatures = loadBundledDndCreatures(); + const sources = getBundledDndSources(); + for (const c of creatures) { + expect(sources.has(c.source)).toBe(true); + expect(c.sourceDisplayName).toBe(sources.get(c.source)); + expect(c.id.startsWith(`${c.source.toLowerCase()}:`)).toBe(true); + } + }); + + it("derives source codes from the creature data", () => { + const creatures = loadBundledDndCreatures(); + const sources = getBundledDndSources(); + const seen = new Set(creatures.map((c) => c.source)); + expect(sources.size).toBe(seen.size); + for (const s of seen) { + expect(sources.has(s)).toBe(true); + } + }); + + it("derives index entries that match the bundled creatures", () => { + const creatures = loadBundledDndCreatures(); + const entries = loadBundledDndIndexEntries(); + expect(entries.length).toBe(creatures.length); + const entryNames = new Set(entries.map((e) => e.name)); + for (const c of creatures) { + expect(entryNames.has(c.name)).toBe(true); + } + }); + + it("abbreviates sizes to single-letter codes in index entries", () => { + const entries = loadBundledDndIndexEntries(); + for (const e of entries) { + expect(["T", "S", "M", "L", "H", "G"]).toContain(e.size); + } + }); +}); diff --git a/apps/web/src/adapters/bestiary-index-adapter.ts b/apps/web/src/adapters/bestiary-index-adapter.ts index c3866ef..b0c47c3 100644 --- a/apps/web/src/adapters/bestiary-index-adapter.ts +++ b/apps/web/src/adapters/bestiary-index-adapter.ts @@ -1,6 +1,10 @@ import type { BestiaryIndex, BestiaryIndexEntry } from "@initiative/domain"; import rawIndex from "../../../../data/bestiary/index.json"; +import { + getBundledDndSources, + loadBundledDndIndexEntries, +} from "./dnd-bundled-adapter.js"; interface CompactCreature { readonly n: string; @@ -55,23 +59,32 @@ export function loadBestiaryIndex(): BestiaryIndex { if (cachedIndex) return cachedIndex; const compact = rawIndex as unknown as CompactIndex; - const sources = Object.fromEntries( + const sources: Record = Object.fromEntries( Object.entries(compact.sources).filter( ([code]) => !EXCLUDED_SOURCES.has(code), ), ); + for (const [code, name] of getBundledDndSources()) { + sources[code] = name; + } cachedIndex = { sources, - creatures: compact.creatures - .filter((c) => !EXCLUDED_SOURCES.has(c.s)) - .map(mapCreature), + creatures: [ + ...compact.creatures + .filter((c) => !EXCLUDED_SOURCES.has(c.s)) + .map(mapCreature), + ...loadBundledDndIndexEntries(), + ], }; return cachedIndex; } export function getAllSourceCodes(): string[] { const index = loadBestiaryIndex(); - return Object.keys(index.sources).filter((c) => !EXCLUDED_SOURCES.has(c)); + const bundled = getBundledDndSources(); + return Object.keys(index.sources).filter( + (c) => !EXCLUDED_SOURCES.has(c) && !bundled.has(c), + ); } function sourceCodeToFilename(sourceCode: string): string { diff --git a/apps/web/src/adapters/dnd-bundled-adapter.ts b/apps/web/src/adapters/dnd-bundled-adapter.ts new file mode 100644 index 0000000..616d06f --- /dev/null +++ b/apps/web/src/adapters/dnd-bundled-adapter.ts @@ -0,0 +1,53 @@ +import type { BestiaryIndexEntry, Creature } from "@initiative/domain"; +import { creatureId } from "@initiative/domain"; + +import rawBundled from "../../../../data/bestiary/dnd-bundled.json"; + +type RawBundledCreature = Omit & { id: string }; + +const SIZE_TO_CODE: Record = { + Tiny: "T", + Small: "S", + Medium: "M", + Large: "L", + Huge: "H", + Gargantuan: "G", +}; + +/** Full normalized stat blocks for bundled D&D creatures. */ +export function loadBundledDndCreatures(): Creature[] { + return (rawBundled as RawBundledCreature[]).map((c) => ({ + ...c, + id: creatureId(c.id), + })); +} + +/** Index entries derived from the bundled creatures, in the compact shape + * used by the search index. */ +export function loadBundledDndIndexEntries(): BestiaryIndexEntry[] { + return (rawBundled as RawBundledCreature[]).map((c) => ({ + name: c.name, + source: c.source, + ac: c.ac, + hp: c.hp.average, + dex: c.abilities.dex, + cr: c.cr, + initiativeProficiency: c.initiativeProficiency, + size: SIZE_TO_CODE[c.size.split(" ")[0]] ?? "M", + type: c.type.split(" ")[0].toLowerCase(), + })); +} + +/** Source codes → display names, derived from the bundled creatures' own + * `source` and `sourceDisplayName` fields. Adding a new book just means + * appending creatures with the right `source` field to dnd-bundled.json; + * no code change is required here. */ +export function getBundledDndSources(): ReadonlyMap { + const map = new Map(); + for (const c of rawBundled as RawBundledCreature[]) { + if (!map.has(c.source)) { + map.set(c.source, c.sourceDisplayName); + } + } + return map; +} diff --git a/apps/web/src/hooks/use-bestiary.ts b/apps/web/src/hooks/use-bestiary.ts index 7d1b9b2..4969622 100644 --- a/apps/web/src/hooks/use-bestiary.ts +++ b/apps/web/src/hooks/use-bestiary.ts @@ -9,6 +9,7 @@ import { normalizeBestiary, setSourceDisplayNames, } from "../adapters/bestiary-adapter.js"; +import { loadBundledDndCreatures } from "../adapters/dnd-bundled-adapter.js"; import { normalizeFoundryCreatures } from "../adapters/pf2e-bestiary-adapter.js"; import { useAdapters } from "../contexts/adapter-context.js"; import { useRulesEditionContext } from "../contexts/rules-edition-context.js"; @@ -160,7 +161,11 @@ export function useBestiary(): BestiaryHook { } void bestiaryCache.loadAllCachedCreatures().then((map) => { - setCreatureMap(map); + const merged = new Map(map); + for (const c of loadBundledDndCreatures()) { + merged.set(c.id, c); + } + setCreatureMap(merged); }); }, [bestiaryCache, bestiaryIndex, pf2eBestiaryIndex]); @@ -300,6 +305,9 @@ export function useBestiary(): BestiaryHook { const refreshCache = useCallback(async (): Promise => { const map = await bestiaryCache.loadAllCachedCreatures(); + for (const c of loadBundledDndCreatures()) { + map.set(c.id, c); + } setCreatureMap(map); }, [bestiaryCache]); diff --git a/data/bestiary/dnd-bundled.json b/data/bestiary/dnd-bundled.json new file mode 100644 index 0000000..fe51488 --- /dev/null +++ b/data/bestiary/dnd-bundled.json @@ -0,0 +1 @@ +[] diff --git a/scripts/extract-great-labors.py b/scripts/extract-great-labors.py new file mode 100644 index 0000000..42def81 --- /dev/null +++ b/scripts/extract-great-labors.py @@ -0,0 +1,561 @@ +#!/usr/bin/env python3 +"""Extract D&D 5.5e stat blocks from The Great Labors PDF. + +Usage: + python3 scripts/extract-great-labors.py + +Reads pages 163-199 (Appendix B: Monsters) and emits +data/bestiary/dnd-bundled.json in the Creature[] shape from +packages/domain/src/creature-types.ts. + +Requires: PyPDF2 (pip install PyPDF2) +""" + +import json +import os +import re +import sys +from pathlib import Path + +from PyPDF2 import PdfReader + +# --- Constants --- + +SOURCE_CODE = "TGL" +SOURCE_DISPLAY = "The Great Labors" +PAGE_START = 163 # 1-indexed +PAGE_END = 199 + +SIZE_RE = r"(Tiny|Small|Medium|Large|Huge|Gargantuan)" +TYPE_PIECE = r"[A-Za-z][A-Za-z\- ]*?" +ALIGN_PIECE = r"[A-Za-z][A-Za-z ()]*?" +HEADER_RE = re.compile( + rf"^{SIZE_RE}\s+({TYPE_PIECE}(?:\s+\([^)]+\))?),\s+({ALIGN_PIECE})\s*$" +) + +AC_RE = re.compile(r"^AC\s+(\d+)\s+Initiative\s+([+\-–]\s*\d+|[+\-–]?\d+)") +HP_RE = re.compile(r"^HP\s+(\d+)\s*\(([^)]+)\)") +SPEED_RE = re.compile(r"^Speed\s+(.+?)\s*$") +ABILITY_ROW_RE = re.compile( + r"^(Str|Dex|Con|Int|Wis|Cha)\s+(\d+)\s*([+\-–]?\s*\d+)\s+([+\-–]?\s*\d+)\s+" + r"(Str|Dex|Con|Int|Wis|Cha)\s+(\d+)\s*([+\-–]?\s*\d+)\s+([+\-–]?\s*\d+)\s+" + r"(Str|Dex|Con|Int|Wis|Cha)\s+(\d+)\s*([+\-–]?\s*\d+)\s+([+\-–]?\s*\d+)\s*$" +) +CR_RE = re.compile( + r"^Challenge\s+([\d/]+)\s*\(([\d,]+)\s*XP;\s*PB\s+\+(\d+)\)" +) + +SECTION_HEADERS = ("Traits", "Actions", "Bonus Actions", "Reactions", + "Legendary Actions", "Mythic Actions") + +# Page running header like "166APPENDIX B � MONSTERS..." -- marks the +# transition from stat-block content into prose on the next page. +RUNNING_HEADER_RE = re.compile(r"^\d+APPENDIX B\b") + +# Condition / status-word false positives that the title-case entry regex +# would otherwise mistake for a new entry name. These names commonly end a +# sentence inside an entry's body (e.g. "...while it is Bloodied."). +NAME_FALSE_POSITIVES = { + "Bloodied", "Restrained", "Grappled", "Charmed", "Frightened", + "Prone", "Incapacitated", "Stunned", "Paralyzed", "Petrified", + "Poisoned", "Blinded", "Deafened", "Invisible", "Unconscious", + "Exhaustion", "Surprised", "Furious", + "Failure", "Success", "Trigger", "Response", "Hit", "Miss", + "Habitat", "Treasure", "Bonus Actions", "Reactions", "Traits", "Actions", + "Disadvantage", "Advantage", +} + +# --- Helpers --- + + +def norm_dash(s: str) -> str: + return s.replace("–", "-").replace("—", "-").replace("−", "-") + + +def proficiency_bonus(cr_str: str) -> int: + if "/" in cr_str: + n, d = cr_str.split("/") + cr = int(n) / int(d) + else: + cr = int(cr_str) + if cr <= 4: + return 2 + if cr <= 8: + return 3 + if cr <= 12: + return 4 + if cr <= 16: + return 5 + if cr <= 20: + return 6 + if cr <= 24: + return 7 + if cr <= 28: + return 8 + return 9 + + +def make_creature_id(source: str, name: str) -> str: + slug = re.sub(r"[^a-z0-9]+", "-", name.lower()).strip("-") + return f"{source.lower()}:{slug}" + + +def parse_passive_perception(senses_text: str) -> int | None: + # The PDF sometimes renders multi-digit values with a kerning space + # (e.g. "Passive Perception 1 1" meaning 11). Collapse those. + m = re.search(r"Passive Perception\s+(\d(?:\s*\d)*)\s*$", senses_text) + if not m: + m = re.search(r"Passive Perception\s+(\d+)", senses_text) + return int(m.group(1).replace(" ", "")) if m else None + + +# --- Page extraction --- + + +def extract_pages(pdf_path: Path) -> str: + reader = PdfReader(str(pdf_path)) + parts = [] + for i in range(PAGE_START - 1, PAGE_END): + parts.append(reader.pages[i].extract_text()) + return "\n".join(parts) + + +# --- Block splitting --- + + +def find_stat_block_starts(lines: list[str]) -> list[int]: + starts = [] + for i, line in enumerate(lines): + if AC_RE.match(line.strip()): + header_idx = None + for j in range(i - 1, max(-1, i - 5), -1): + if HEADER_RE.match(lines[j].strip()): + header_idx = j + break + if header_idx is None: + continue + name_idx = header_idx - 1 + if name_idx >= 0 and lines[name_idx].strip(): + starts.append(name_idx) + return starts + + +SECTION_HEADER_SMUSH_RE = re.compile( + r"^(?P.+?)\.(?PActions|Bonus Actions|Reactions|Legendary Actions|Traits)\s*$" +) + + +def block_for(lines: list[str], start: int, next_start: int | None) -> list[str]: + """Build the line list for one stat block. + + Drops page markers and everything from the first running-header line + onward (which marks the transition to a new prose page). Splits PDF + smush lines like "...plants.Actions" into two lines so section header + detection works. + """ + end = next_start if next_start is not None else len(lines) + out: list[str] = [] + for ln in lines[start:end]: + if ln.startswith("===PAGE"): + continue + if RUNNING_HEADER_RE.match(ln.strip()): + break + m = SECTION_HEADER_SMUSH_RE.match(ln.strip()) + if m: + out.append(m.group("body") + ".") + out.append(m.group("hdr")) + else: + out.append(ln) + return out + + +# --- Vitals parsing --- + + +def parse_header(block: list[str]) -> dict: + name = block[0].strip() + header = block[1].strip() + m = HEADER_RE.match(header) + if not m: + raise ValueError(f"Bad header for {name!r}: {header!r}") + size, ctype, alignment = m.group(1), m.group(2).strip(), m.group(3).strip() + return {"name": name, "size": size, "type": ctype, "alignment": alignment} + + +def parse_ac(line: str) -> int: + m = AC_RE.match(line.strip()) + if not m: + raise ValueError(f"Bad AC line: {line!r}") + return int(m.group(1)) + + +def parse_hp(line: str) -> dict: + m = HP_RE.match(line.strip()) + if not m: + raise ValueError(f"Bad HP line: {line!r}") + return {"average": int(m.group(1)), "formula": m.group(2).strip()} + + +def parse_speed(line: str) -> str: + m = SPEED_RE.match(line.strip()) + if not m: + raise ValueError(f"Bad Speed line: {line!r}") + speed = m.group(1).rstrip(".").strip() + # Normalize "30 ft" → "30 ft." to match 5etools adapter output style. + speed = re.sub(r"(\d+)\s+ft\b\.?", r"\1 ft.", speed) + return speed + + +def parse_abilities(row1: str, row2: str) -> dict: + out = {} + for row in (row1, row2): + m = ABILITY_ROW_RE.match(row.strip()) + if not m: + raise ValueError(f"Bad ability row: {row!r}") + for off in (0, 4, 8): + ab = m.group(off + 1).lower() + score = int(m.group(off + 2)) + out[ab] = score + return out + + +# --- Meta lines --- + + +META_KEYS = ("Skills", "Saving Throws", "Resistances", "Immunities", + "Vulnerabilities", "Senses", "Languages", "Gear") + + +def is_meta_start(line: str) -> str | None: + for key in META_KEYS: + if line.startswith(key + " ") or line.startswith(key + " "): + return key + return None + + +def parse_meta(lines: list[str], start: int) -> tuple[dict, int]: + meta: dict[str, str] = {} + i = start + current_key: str | None = None + current_val_parts: list[str] = [] + + def flush() -> None: + nonlocal current_key, current_val_parts + if current_key is not None: + meta[current_key] = " ".join(p.strip() for p in current_val_parts).strip() + current_key = None + current_val_parts = [] + + while i < len(lines): + line = lines[i].strip() + if not line: + i += 1 + continue + if line.startswith("Challenge "): + flush() + return meta, i + key = is_meta_start(line) + if key: + flush() + current_key = key + current_val_parts.append(line[len(key):].strip()) + elif current_key is not None: + current_val_parts.append(line) + i += 1 + flush() + return meta, i + + +# --- Section discovery --- + + +def find_section_starts(block: list[str], start_idx: int) -> list[tuple[str, int]]: + starts = [] + for i in range(start_idx, len(block)): + ln = block[i].strip() + if ln in SECTION_HEADERS: + starts.append((ln, i)) + return starts + + +def collect_section_lines(block: list[str], start: int, end: int) -> list[str]: + """Collect the raw lines for one section (between header indices).""" + out: list[str] = [] + for line in block[start:end]: + if not line.strip(): + continue + out.append(line.rstrip()) + return out + + +def join_section_text(lines: list[str]) -> str: + """Join section lines into a single text blob, repairing wrap hyphens.""" + text = " ".join(line.strip() for line in lines if line.strip()) + text = re.sub(r"\s+", " ", text) + # Repair "civi -li zation" → "civilization" (PDF column-wrap hyphens). + text = re.sub(r"(\w)\s*-\s+(\w)", r"\1\2", text) + return text.strip() + + +# --- Entry splitting --- + +# Entry name: title-case phrase, where each "word" is either a Capitalized +# word, a lowercase connector (of/the/and/or/in/at/on/to/with/from), a roman +# numeral, etc. Optionally followed by parenthesized modifier. +ENTRY_NAME_INNER = ( + r"[A-Z][A-Za-z'’]*" + r"(?:[ \-](?:[A-Z][A-Za-z'’]*|of|the|and|or|in|at|on|to|with|from))*" + r"(?:\s*\([^)]+\))?" +) +# An entry boundary occurs at the start of the joined section text, or +# immediately after a sentence-ending punctuation. The PDF sometimes drops +# the space between the period and the new entry name, so `\s*` is fine. +ENTRY_BOUNDARY = re.compile( + rf"(?:^|(?<=[\.\?\!]))\s*(?P{ENTRY_NAME_INNER})\.\s+(?=[A-Z“\"(])" +) + +# Trim attribution quotes / page-header bleed-through from entry bodies. +PROSE_TAIL_PATTERNS = ( + # Em-dash attribution: " —Chondrus, Priest of Lutheria" + re.compile(r"\s+—\s*[A-Z][^—]*$"), + # Smushed section header at end ("...plants.Actions"). + re.compile( + r"\.\s*(?:Actions|Bonus\s+Actions|Reactions|Legendary\s+Actions|Traits)\s*$" + ), + # Curated prose subheadings / phrase markers that follow stat blocks in + # this book. PDF reflow often merges prose onto the same logical line + # as the last action body, so the leading whitespace is optional. + re.compile( + r"\.?\s*(?:Random Trapped Creature|Maenad Bacchanal|The Phalanx Formation" + r"|Reinforced Portal|TRAPPED|HUNGER FOR|PURSUIT OF|RITUAL|MyTHIC|BRON" + r"|GOlDEN|NyMPH|MARBlE|KElEDONE|SOlDIER|MINOTAUR|SATyRS|GOATlING|EMPUS" + r"|ANARCH|GyGAN|CERBERUS|WHITE STAG|STORM|FEy|VOlKAN).*", + re.DOTALL, + ), + # Specific prose sentence-starts observed leaking in. + re.compile( + r"\.(?:will gleefully|Some report that|Storm Dory|This magic weapon" + r"|Thylean soldiers|Some claim|These leaders).*", + re.DOTALL, + ), + # All-caps run of 3+ uppercase letters in a word, then a space, then + # another word with 3+ uppercase letters (PDF small-caps section header + # like "BRON zE STRATEGOS", "MyTHIC BEAST", "GOlDEN RAM"). + re.compile(r"(?<=[\.\s])[A-Z]{2}\w*\s+[\w ]{0,12}[A-Z]{3}[A-Z\w ]*"), +) + + +def trim_prose_tail(body: str) -> str: + out = body + for pat in PROSE_TAIL_PATTERNS: + m = pat.search(out) + if m: + out = out[:m.start()].rstrip().rstrip(".") + "." + return out.strip() + + +def is_valid_entry_name(name: str) -> bool: + """Filter false-positive matches that aren't really entry names.""" + if name in NAME_FALSE_POSITIVES: + return False + # Single short capitalized word that's a common condition or noun is + # usually a false positive when followed by a period. Real entry names + # almost always have either multiple words or a parenthesized modifier. + bare = re.sub(r"\s*\([^)]+\)\s*", "", name).strip() + if bare in NAME_FALSE_POSITIVES: + return False + return True + + +def split_text_into_entries(text: str) -> list[tuple[str, str]]: + """Split section text into (name, body) entries by scanning for entry-name + boundaries (start-of-text or after a sentence period).""" + matches: list[tuple[int, int, str]] = [] + for m in ENTRY_BOUNDARY.finditer(text): + name = m.group("name").strip() + if is_valid_entry_name(name): + matches.append((m.start(), m.end(), name)) + if not matches: + return [] + entries: list[tuple[str, str]] = [] + for i, (_, body_start, name) in enumerate(matches): + body_end = matches[i + 1][0] if i + 1 < len(matches) else len(text) + body = text[body_start:body_end].strip() + entries.append((name, body)) + return entries + + +def parse_section_traits(lines: list[str]) -> list[dict]: + text = join_section_text(lines) + entries = split_text_into_entries(text) + out = [] + for name, body in entries: + body = trim_prose_tail(body) + if body or name: + out.append({"name": name, + "segments": [{"type": "text", "value": body}]}) + return out + + +def parse_legendary(lines: list[str], creature_name: str) -> dict | None: + """Parse the Legendary Actions section. Text before the first entry whose + body contains action vocabulary forms the preamble. + """ + text = join_section_text(lines) + all_matches: list[tuple[int, int, str]] = [] + for m in ENTRY_BOUNDARY.finditer(text): + name = m.group("name").strip() + if is_valid_entry_name(name): + all_matches.append((m.start(), m.end(), name)) + + action_anchors = ("Saving Throw", "Attack Roll", "Trigger", "Recharge", + "Melee", "Ranged", "Constitution", "Dexterity", + "Strength", "Intelligence", "Wisdom", "Charisma") + first_action_idx = None + for i, (_, body_start, _) in enumerate(all_matches): + body_end = all_matches[i + 1][0] if i + 1 < len(all_matches) else len(text) + body_head = text[body_start:min(body_end, body_start + 100)] + if any(a in body_head for a in action_anchors): + first_action_idx = i + break + if first_action_idx is None: + return None + preamble = text[:all_matches[first_action_idx][0]].strip() + if not preamble: + preamble = f"{creature_name} can take Legendary Actions." + entries = [] + for i in range(first_action_idx, len(all_matches)): + _, body_start, name = all_matches[i] + body_end = all_matches[i + 1][0] if i + 1 < len(all_matches) else len(text) + body = text[body_start:body_end].strip() + entries.append((name, body)) + if not entries: + return None + return { + "preamble": preamble, + "entries": [ + {"name": name, + "segments": [{"type": "text", "value": trim_prose_tail(body)}]} + for name, body in entries if body + ], + } + + +# --- Top-level parse --- + + +def parse_block(block: list[str]) -> dict: + head = parse_header(block) + ac = parse_ac(block[2]) + hp = parse_hp(block[3]) + speed = parse_speed(block[4]) + if not block[5].strip().startswith("MOD"): + raise ValueError(f"Expected MOD header, got: {block[5]!r}") + abilities = parse_abilities(block[6], block[7]) + + meta, ch_idx = parse_meta(block, 8) + cr_match = CR_RE.match(block[ch_idx].strip()) + if not cr_match: + raise ValueError(f"Bad Challenge line: {block[ch_idx]!r}") + cr_str = cr_match.group(1) + + section_starts = find_section_starts(block, ch_idx + 1) + sections: dict[str, list[str]] = {} + for i, (name, idx) in enumerate(section_starts): + end = section_starts[i + 1][1] if i + 1 < len(section_starts) else len(block) + sections[name] = collect_section_lines(block, idx + 1, end) + + creature: dict = { + "id": make_creature_id(SOURCE_CODE, head["name"]), + "name": head["name"], + "source": SOURCE_CODE, + "sourceDisplayName": SOURCE_DISPLAY, + "size": head["size"], + "type": head["type"], + "alignment": head["alignment"], + "ac": ac, + "hp": hp, + "speed": speed, + "abilities": abilities, + "cr": cr_str, + "initiativeProficiency": 0, + "proficiencyBonus": proficiency_bonus(cr_str), + "passive": parse_passive_perception(meta.get("Senses", "")) or 10, + } + + if "Saving Throws" in meta: + creature["savingThrows"] = meta["Saving Throws"] + if "Skills" in meta: + creature["skills"] = meta["Skills"] + if "Resistances" in meta: + creature["resist"] = meta["Resistances"] + if "Immunities" in meta: + creature["immune"] = meta["Immunities"] + if "Vulnerabilities" in meta: + creature["vulnerable"] = meta["Vulnerabilities"] + if "Senses" in meta: + senses = re.sub(r"[;,]?\s*Passive Perception\s+\d+\s*$", "", meta["Senses"]) + senses = senses.strip().rstrip(";").strip() + if senses: + creature["senses"] = senses + if "Languages" in meta: + creature["languages"] = meta["Languages"] + + if "Traits" in sections: + creature["traits"] = parse_section_traits(sections["Traits"]) + if "Actions" in sections: + creature["actions"] = parse_section_traits(sections["Actions"]) + if "Bonus Actions" in sections: + creature["bonusActions"] = parse_section_traits(sections["Bonus Actions"]) + if "Reactions" in sections: + creature["reactions"] = parse_section_traits(sections["Reactions"]) + if "Legendary Actions" in sections: + leg = parse_legendary(sections["Legendary Actions"], head["name"]) + if leg: + creature["legendaryActions"] = leg + + return creature + + +def main() -> int: + if len(sys.argv) != 2: + print("Usage: python3 extract-great-labors.py ", + file=sys.stderr) + return 1 + pdf_path = Path(os.path.expanduser(sys.argv[1])) + if not pdf_path.exists(): + print(f"PDF not found: {pdf_path}", file=sys.stderr) + return 1 + + text = extract_pages(pdf_path) + lines = text.split("\n") + + starts = find_stat_block_starts(lines) + print(f"Detected {len(starts)} stat blocks", file=sys.stderr) + + creatures = [] + failures = [] + for i, s in enumerate(starts): + next_s = starts[i + 1] if i + 1 < len(starts) else None + block = block_for(lines, s, next_s) + try: + creatures.append(parse_block(block)) + except Exception as e: + failures.append((block[0] if block else "", str(e))) + + if failures: + print(f"\n{len(failures)} parse failures:", file=sys.stderr) + for name, err in failures: + print(f" - {name}: {err}", file=sys.stderr) + + out_path = Path(__file__).resolve().parent.parent / "data" / "bestiary" / "dnd-bundled.json" + out_path.parent.mkdir(parents=True, exist_ok=True) + with out_path.open("w") as f: + json.dump(creatures, f, indent="\t", ensure_ascii=False) + f.write("\n") + print(f"Wrote {len(creatures)} creatures to {out_path}", file=sys.stderr) + return 0 if not failures else 2 + + +if __name__ == "__main__": + sys.exit(main())