Add bundled-bestiary mechanism for shipping creatures with the app
D&D creatures listed in data/bestiary/dnd-bundled.json are now merged into the search index and pre-loaded into creatureMap, so they appear alongside 5etools creatures with no "Load source" step. Source codes are derived from the JSON itself (each creature carries source + sourceDisplayName), so adding a new book is a pure data change. Bundled sources are excluded from getAllSourceCodes() so bulk-import skips them, and they never appear in the source manager (which only lists cached sources). Includes a reference extractor (scripts/extract-great-labors.py) for the 5.5e revised stat-block format and a /bundle-bestiary skill that future agents can follow to add monsters from other PDF books. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,148 @@
|
||||
---
|
||||
name: bundle-bestiary
|
||||
description: Bundle creatures from a third-party PDF into the app's D&D bestiary so they appear in search alongside 5etools creatures, with no "Load source" step. Use when the user asks to add monsters from a PDF book / adventure / supplement to the bundled bestiary.
|
||||
---
|
||||
|
||||
## Instructions
|
||||
|
||||
Add the creatures from a PDF to `data/bestiary/dnd-bundled.json` so they appear in the D&D search index and render as normal stat blocks. Bundled creatures bypass the fetch/cache flow — they're shipped in the JS bundle and pre-loaded into `creatureMap` on startup.
|
||||
|
||||
### How the bundling works
|
||||
|
||||
- `data/bestiary/dnd-bundled.json` is an array of normalized `Creature` objects (the same shape produced by `bestiary-adapter.ts` for 5etools creatures).
|
||||
- `apps/web/src/adapters/dnd-bundled-adapter.ts` static-imports the JSON and derives:
|
||||
- `loadBundledDndCreatures()` — full stat blocks for the in-memory creature map
|
||||
- `loadBundledDndIndexEntries()` — compact summaries for the search index
|
||||
- `getBundledDndSources()` — source code → display name map, **derived from the JSON itself** (each creature carries its own `source` + `sourceDisplayName`)
|
||||
- `bestiary-index-adapter.ts` merges the bundled entries into the search index and excludes bundled sources from `getAllSourceCodes()` (so bulk-import skips them).
|
||||
- `use-bestiary.ts` merges bundled full creatures into `creatureMap` on init/refresh.
|
||||
|
||||
This means **adding a new bundled book is purely a data change**: append creatures to `dnd-bundled.json` with the new source's code and display name. No adapter or index code needs editing.
|
||||
|
||||
### Step 1 — Confirm scope and source code
|
||||
|
||||
Ask the user (don't guess):
|
||||
|
||||
1. **PDF path** and the **page range** containing the stat blocks. Many PDFs have hundreds of pages; only a slice has the bestiary.
|
||||
2. **Source code abbreviation** — short uppercase letters, e.g., `TGL` for *The Great Labors*. Used in creature IDs and the index.
|
||||
3. **Display name** — the human-readable book title shown in the source column.
|
||||
4. **Edition / system** — confirm this is D&D (5e or 5.5e). Bundled creatures show in both 5e and 5.5e modes (the bestiary index only differentiates pf2e vs not). PF2e isn't currently supported by the bundled flow — if requested, this would need a parallel `pf2e-bundled-adapter.ts`.
|
||||
5. **Licensing** — verify the user has the right to bundle the book's content. Don't make assumptions.
|
||||
|
||||
### Step 2 — Inspect the PDF
|
||||
|
||||
Check Python's PyPDF2 is available:
|
||||
|
||||
```bash
|
||||
python3 -c "from PyPDF2 import PdfReader; print('ok')"
|
||||
```
|
||||
|
||||
If not, the user has `pdftotext`-equivalent tooling configured at `~/Nextcloud/dnd/D&D/PROMPT_prep.md` worth checking.
|
||||
|
||||
Then dump and skim the target pages to learn the stat-block format:
|
||||
|
||||
```bash
|
||||
python3 - <<'EOF'
|
||||
from PyPDF2 import PdfReader
|
||||
import os
|
||||
r = PdfReader(os.path.expanduser('PATH/TO/PDF'))
|
||||
for i in range(START-1, END):
|
||||
print(f"\n===PAGE {i+1}===\n{r.pages[i].extract_text()}")
|
||||
EOF
|
||||
```
|
||||
|
||||
Look for the layout — the existing extractor (`scripts/extract-great-labors.py`) assumes the 5.5e/2024 revised format:
|
||||
|
||||
- `<Name>` line, then
|
||||
- `<Size> <Type>(optional subtype), <Alignment>`, then
|
||||
- `AC X Initiative ±Y (Z)`, then
|
||||
- `HP N (NdN + N)`, then
|
||||
- `Speed X ft., …`, then
|
||||
- A `MOD SAVE MOD SAVE MOD SAVE` header followed by two ability-score rows, then
|
||||
- Optional meta lines: `Skills`, `Saving Throws`, `Resistances`, `Immunities`, `Vulnerabilities`, `Senses`, `Languages`, then
|
||||
- `Challenge X (NN XP; PB +N)`, then
|
||||
- Section blocks: `Traits` / `Actions` / `Bonus Actions` / `Reactions` / `Legendary Actions`, each containing entries shaped like `Name. body...`.
|
||||
|
||||
If the PDF format matches, adapt the existing extractor. If it's a different format (5e 2014 with `STR DEX CON …` column layout, an older publisher's layout, a homebrew layout), expect to rework the parser more substantively.
|
||||
|
||||
### Step 3 — Adapt or extend the extractor
|
||||
|
||||
Copy `scripts/extract-great-labors.py` to a new script per book (e.g., `scripts/extract-<book-slug>.py`) and update:
|
||||
|
||||
- `SOURCE_CODE`, `SOURCE_DISPLAY`, `PAGE_START`, `PAGE_END` constants.
|
||||
- The output path (`data/bestiary/dnd-bundled.json`). **Don't overwrite — merge.** The simplest pattern: read the existing file, drop any entries with the same `source`, then append the new ones.
|
||||
- The `PROSE_TAIL_PATTERNS` list — every book has its own running headers (`<PageNumber>APPENDIX B … MONSTERS`-style), section-header phrases, and quote-attribution dashes. Run the extractor, audit the output (see Step 4), and add curated trim patterns for any prose tails that bleed in.
|
||||
|
||||
Run it:
|
||||
|
||||
```bash
|
||||
python3 scripts/extract-<book-slug>.py PATH/TO/PDF
|
||||
```
|
||||
|
||||
### Step 4 — Audit the output
|
||||
|
||||
PyPDF text extraction is messy. Always audit before claiming done:
|
||||
|
||||
```bash
|
||||
python3 - <<'EOF'
|
||||
import json, re
|
||||
data = json.load(open('data/bestiary/dnd-bundled.json'))
|
||||
new = [c for c in data if c['source'] == 'XXX'] # replace XXX with your code
|
||||
for c in new:
|
||||
print(f"{c['name']}: CR {c['cr']}, AC {c['ac']}, HP {c['hp']['average']} ({c['hp']['formula']})")
|
||||
abs_ = c['abilities']
|
||||
print(f" STR {abs_['str']} DEX {abs_['dex']} CON {abs_['con']} INT {abs_['int']} WIS {abs_['wis']} CHA {abs_['cha']}, PP {c['passive']}")
|
||||
# Then audit bodies for prose-tail bleed and weird splits.
|
||||
for c in new:
|
||||
for sec in ('traits', 'actions', 'bonusActions', 'reactions'):
|
||||
for e in c.get(sec, []):
|
||||
body = e['segments'][0]['value']
|
||||
issues = []
|
||||
if len(body) > 600: issues.append(f"long({len(body)})")
|
||||
if re.search(r'\.[A-Z][a-z]', body): issues.append("dot-Capital")
|
||||
if 'APPENDIX' in body: issues.append("APPENDIX")
|
||||
if re.search(r'—\s*[A-Z]\w+,\s', body): issues.append("attribution")
|
||||
if issues:
|
||||
print(f" {c['name']} [{sec}] {e['name']}: {', '.join(issues)}")
|
||||
print(f" ...{body[-200:]}")
|
||||
EOF
|
||||
```
|
||||
|
||||
Common PDF extraction problems to fix in the parser:
|
||||
|
||||
- **PDF kerning quirks**: multi-digit values rendered with spaces (e.g., "Passive Perception 1 1" → 11, "Wis 8–1 –1" with no space before negative). The existing parser handles most; check for new ones.
|
||||
- **Smushed section headers**: lines like `...plants.Actions` where the section header for the next block was concatenated. Handle via `SECTION_HEADER_SMUSH_RE` preprocessing.
|
||||
- **Cross-page prose bleed**: text from the next page's flavor prose absorbed into the last entry's body. Catch via `PROSE_TAIL_PATTERNS` — add curated phrases observed in this specific book.
|
||||
- **Sibling-entry inline smush**: `damage.Ram. Melee Attack Roll: …` where two entries got concatenated. Already handled by the mid-line entry boundary regex in the existing parser.
|
||||
- **Title-cased false positives**: words like `Bloodied.`, `Restrained.`, `Frightened.` at sentence ends would otherwise match the entry-name pattern. Filtered via `NAME_FALSE_POSITIVES` — add to it if the new book uses condition names you haven't seen yet.
|
||||
|
||||
### Step 5 — Verify in the app
|
||||
|
||||
```bash
|
||||
pnpm check
|
||||
```
|
||||
|
||||
Then start the dev server and search for one of the new creatures by name:
|
||||
|
||||
```bash
|
||||
pnpm --filter web dev
|
||||
```
|
||||
|
||||
Confirm in the browser:
|
||||
|
||||
1. Search finds the creature with the right book name as the source label.
|
||||
2. Clicking it shows the full stat block immediately — **no "Load source" prompt**.
|
||||
3. The source manager UI does **not** list the bundled book (it only shows cached sources).
|
||||
4. Bulk import skips the bundled book.
|
||||
|
||||
### Notes for future agents
|
||||
|
||||
- **No need to edit `dnd-bundled-adapter.ts` or `bestiary-index-adapter.ts`** when adding a new book — the adapter derives source codes from the JSON.
|
||||
- `data/bestiary/index.json` is regenerated from 5etools and should **not** be edited to add bundled entries. The merge happens at runtime in `bestiary-index-adapter.ts`.
|
||||
- Each bundled creature must have:
|
||||
- A unique `id` like `<sourcecode>:<slug>` (e.g., `tgl:anarch-boar`).
|
||||
- `source` field matching the source code (e.g., `"TGL"`).
|
||||
- `sourceDisplayName` field matching the book's display name (e.g., `"The Great Labors"`).
|
||||
- All the required `Creature` fields from `packages/domain/src/creature-types.ts`.
|
||||
- The script approach is preferred over hand-editing JSON for >5 creatures. For a single creature or two, hand-editing the JSON is reasonable; just match an existing entry's shape exactly.
|
||||
- After any change to `dnd-bundled.json`, run `pnpm typecheck` — the static import in the adapter will catch shape mismatches at compile time.
|
||||
@@ -49,10 +49,9 @@ describe("loadBestiaryIndex", () => {
|
||||
});
|
||||
|
||||
describe("getAllSourceCodes", () => {
|
||||
it("returns all keys from the index sources", () => {
|
||||
it("returns all index sources except bundled ones", () => {
|
||||
const codes = getAllSourceCodes();
|
||||
const index = loadBestiaryIndex();
|
||||
expect(codes).toEqual(Object.keys(index.sources));
|
||||
expect(codes).not.toContain("TGL");
|
||||
});
|
||||
|
||||
it("returns only strings", () => {
|
||||
|
||||
@@ -0,0 +1,45 @@
|
||||
import { describe, expect, it } from "vitest";
|
||||
import {
|
||||
getBundledDndSources,
|
||||
loadBundledDndCreatures,
|
||||
loadBundledDndIndexEntries,
|
||||
} from "../dnd-bundled-adapter.js";
|
||||
|
||||
describe("dnd-bundled-adapter", () => {
|
||||
it("loads bundled creatures with a valid shape", () => {
|
||||
const creatures = loadBundledDndCreatures();
|
||||
const sources = getBundledDndSources();
|
||||
for (const c of creatures) {
|
||||
expect(sources.has(c.source)).toBe(true);
|
||||
expect(c.sourceDisplayName).toBe(sources.get(c.source));
|
||||
expect(c.id.startsWith(`${c.source.toLowerCase()}:`)).toBe(true);
|
||||
}
|
||||
});
|
||||
|
||||
it("derives source codes from the creature data", () => {
|
||||
const creatures = loadBundledDndCreatures();
|
||||
const sources = getBundledDndSources();
|
||||
const seen = new Set(creatures.map((c) => c.source));
|
||||
expect(sources.size).toBe(seen.size);
|
||||
for (const s of seen) {
|
||||
expect(sources.has(s)).toBe(true);
|
||||
}
|
||||
});
|
||||
|
||||
it("derives index entries that match the bundled creatures", () => {
|
||||
const creatures = loadBundledDndCreatures();
|
||||
const entries = loadBundledDndIndexEntries();
|
||||
expect(entries.length).toBe(creatures.length);
|
||||
const entryNames = new Set(entries.map((e) => e.name));
|
||||
for (const c of creatures) {
|
||||
expect(entryNames.has(c.name)).toBe(true);
|
||||
}
|
||||
});
|
||||
|
||||
it("abbreviates sizes to single-letter codes in index entries", () => {
|
||||
const entries = loadBundledDndIndexEntries();
|
||||
for (const e of entries) {
|
||||
expect(["T", "S", "M", "L", "H", "G"]).toContain(e.size);
|
||||
}
|
||||
});
|
||||
});
|
||||
@@ -1,6 +1,10 @@
|
||||
import type { BestiaryIndex, BestiaryIndexEntry } from "@initiative/domain";
|
||||
|
||||
import rawIndex from "../../../../data/bestiary/index.json";
|
||||
import {
|
||||
getBundledDndSources,
|
||||
loadBundledDndIndexEntries,
|
||||
} from "./dnd-bundled-adapter.js";
|
||||
|
||||
interface CompactCreature {
|
||||
readonly n: string;
|
||||
@@ -55,23 +59,32 @@ export function loadBestiaryIndex(): BestiaryIndex {
|
||||
if (cachedIndex) return cachedIndex;
|
||||
|
||||
const compact = rawIndex as unknown as CompactIndex;
|
||||
const sources = Object.fromEntries(
|
||||
const sources: Record<string, string> = Object.fromEntries(
|
||||
Object.entries(compact.sources).filter(
|
||||
([code]) => !EXCLUDED_SOURCES.has(code),
|
||||
),
|
||||
);
|
||||
for (const [code, name] of getBundledDndSources()) {
|
||||
sources[code] = name;
|
||||
}
|
||||
cachedIndex = {
|
||||
sources,
|
||||
creatures: compact.creatures
|
||||
.filter((c) => !EXCLUDED_SOURCES.has(c.s))
|
||||
.map(mapCreature),
|
||||
creatures: [
|
||||
...compact.creatures
|
||||
.filter((c) => !EXCLUDED_SOURCES.has(c.s))
|
||||
.map(mapCreature),
|
||||
...loadBundledDndIndexEntries(),
|
||||
],
|
||||
};
|
||||
return cachedIndex;
|
||||
}
|
||||
|
||||
export function getAllSourceCodes(): string[] {
|
||||
const index = loadBestiaryIndex();
|
||||
return Object.keys(index.sources).filter((c) => !EXCLUDED_SOURCES.has(c));
|
||||
const bundled = getBundledDndSources();
|
||||
return Object.keys(index.sources).filter(
|
||||
(c) => !EXCLUDED_SOURCES.has(c) && !bundled.has(c),
|
||||
);
|
||||
}
|
||||
|
||||
function sourceCodeToFilename(sourceCode: string): string {
|
||||
|
||||
@@ -0,0 +1,53 @@
|
||||
import type { BestiaryIndexEntry, Creature } from "@initiative/domain";
|
||||
import { creatureId } from "@initiative/domain";
|
||||
|
||||
import rawBundled from "../../../../data/bestiary/dnd-bundled.json";
|
||||
|
||||
type RawBundledCreature = Omit<Creature, "id"> & { id: string };
|
||||
|
||||
const SIZE_TO_CODE: Record<string, string> = {
|
||||
Tiny: "T",
|
||||
Small: "S",
|
||||
Medium: "M",
|
||||
Large: "L",
|
||||
Huge: "H",
|
||||
Gargantuan: "G",
|
||||
};
|
||||
|
||||
/** Full normalized stat blocks for bundled D&D creatures. */
|
||||
export function loadBundledDndCreatures(): Creature[] {
|
||||
return (rawBundled as RawBundledCreature[]).map((c) => ({
|
||||
...c,
|
||||
id: creatureId(c.id),
|
||||
}));
|
||||
}
|
||||
|
||||
/** Index entries derived from the bundled creatures, in the compact shape
|
||||
* used by the search index. */
|
||||
export function loadBundledDndIndexEntries(): BestiaryIndexEntry[] {
|
||||
return (rawBundled as RawBundledCreature[]).map((c) => ({
|
||||
name: c.name,
|
||||
source: c.source,
|
||||
ac: c.ac,
|
||||
hp: c.hp.average,
|
||||
dex: c.abilities.dex,
|
||||
cr: c.cr,
|
||||
initiativeProficiency: c.initiativeProficiency,
|
||||
size: SIZE_TO_CODE[c.size.split(" ")[0]] ?? "M",
|
||||
type: c.type.split(" ")[0].toLowerCase(),
|
||||
}));
|
||||
}
|
||||
|
||||
/** Source codes → display names, derived from the bundled creatures' own
|
||||
* `source` and `sourceDisplayName` fields. Adding a new book just means
|
||||
* appending creatures with the right `source` field to dnd-bundled.json;
|
||||
* no code change is required here. */
|
||||
export function getBundledDndSources(): ReadonlyMap<string, string> {
|
||||
const map = new Map<string, string>();
|
||||
for (const c of rawBundled as RawBundledCreature[]) {
|
||||
if (!map.has(c.source)) {
|
||||
map.set(c.source, c.sourceDisplayName);
|
||||
}
|
||||
}
|
||||
return map;
|
||||
}
|
||||
@@ -9,6 +9,7 @@ import {
|
||||
normalizeBestiary,
|
||||
setSourceDisplayNames,
|
||||
} from "../adapters/bestiary-adapter.js";
|
||||
import { loadBundledDndCreatures } from "../adapters/dnd-bundled-adapter.js";
|
||||
import { normalizeFoundryCreatures } from "../adapters/pf2e-bestiary-adapter.js";
|
||||
import { useAdapters } from "../contexts/adapter-context.js";
|
||||
import { useRulesEditionContext } from "../contexts/rules-edition-context.js";
|
||||
@@ -160,7 +161,11 @@ export function useBestiary(): BestiaryHook {
|
||||
}
|
||||
|
||||
void bestiaryCache.loadAllCachedCreatures().then((map) => {
|
||||
setCreatureMap(map);
|
||||
const merged = new Map(map);
|
||||
for (const c of loadBundledDndCreatures()) {
|
||||
merged.set(c.id, c);
|
||||
}
|
||||
setCreatureMap(merged);
|
||||
});
|
||||
}, [bestiaryCache, bestiaryIndex, pf2eBestiaryIndex]);
|
||||
|
||||
@@ -300,6 +305,9 @@ export function useBestiary(): BestiaryHook {
|
||||
|
||||
const refreshCache = useCallback(async (): Promise<void> => {
|
||||
const map = await bestiaryCache.loadAllCachedCreatures();
|
||||
for (const c of loadBundledDndCreatures()) {
|
||||
map.set(c.id, c);
|
||||
}
|
||||
setCreatureMap(map);
|
||||
}, [bestiaryCache]);
|
||||
|
||||
|
||||
@@ -0,0 +1 @@
|
||||
[]
|
||||
@@ -0,0 +1,561 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Extract D&D 5.5e stat blocks from The Great Labors PDF.
|
||||
|
||||
Usage:
|
||||
python3 scripts/extract-great-labors.py <path-to-pdf>
|
||||
|
||||
Reads pages 163-199 (Appendix B: Monsters) and emits
|
||||
data/bestiary/dnd-bundled.json in the Creature[] shape from
|
||||
packages/domain/src/creature-types.ts.
|
||||
|
||||
Requires: PyPDF2 (pip install PyPDF2)
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
from PyPDF2 import PdfReader
|
||||
|
||||
# --- Constants ---
|
||||
|
||||
SOURCE_CODE = "TGL"
|
||||
SOURCE_DISPLAY = "The Great Labors"
|
||||
PAGE_START = 163 # 1-indexed
|
||||
PAGE_END = 199
|
||||
|
||||
SIZE_RE = r"(Tiny|Small|Medium|Large|Huge|Gargantuan)"
|
||||
TYPE_PIECE = r"[A-Za-z][A-Za-z\- ]*?"
|
||||
ALIGN_PIECE = r"[A-Za-z][A-Za-z ()]*?"
|
||||
HEADER_RE = re.compile(
|
||||
rf"^{SIZE_RE}\s+({TYPE_PIECE}(?:\s+\([^)]+\))?),\s+({ALIGN_PIECE})\s*$"
|
||||
)
|
||||
|
||||
AC_RE = re.compile(r"^AC\s+(\d+)\s+Initiative\s+([+\-–]\s*\d+|[+\-–]?\d+)")
|
||||
HP_RE = re.compile(r"^HP\s+(\d+)\s*\(([^)]+)\)")
|
||||
SPEED_RE = re.compile(r"^Speed\s+(.+?)\s*$")
|
||||
ABILITY_ROW_RE = re.compile(
|
||||
r"^(Str|Dex|Con|Int|Wis|Cha)\s+(\d+)\s*([+\-–]?\s*\d+)\s+([+\-–]?\s*\d+)\s+"
|
||||
r"(Str|Dex|Con|Int|Wis|Cha)\s+(\d+)\s*([+\-–]?\s*\d+)\s+([+\-–]?\s*\d+)\s+"
|
||||
r"(Str|Dex|Con|Int|Wis|Cha)\s+(\d+)\s*([+\-–]?\s*\d+)\s+([+\-–]?\s*\d+)\s*$"
|
||||
)
|
||||
CR_RE = re.compile(
|
||||
r"^Challenge\s+([\d/]+)\s*\(([\d,]+)\s*XP;\s*PB\s+\+(\d+)\)"
|
||||
)
|
||||
|
||||
SECTION_HEADERS = ("Traits", "Actions", "Bonus Actions", "Reactions",
|
||||
"Legendary Actions", "Mythic Actions")
|
||||
|
||||
# Page running header like "166APPENDIX B � MONSTERS..." -- marks the
|
||||
# transition from stat-block content into prose on the next page.
|
||||
RUNNING_HEADER_RE = re.compile(r"^\d+APPENDIX B\b")
|
||||
|
||||
# Condition / status-word false positives that the title-case entry regex
|
||||
# would otherwise mistake for a new entry name. These names commonly end a
|
||||
# sentence inside an entry's body (e.g. "...while it is Bloodied.").
|
||||
NAME_FALSE_POSITIVES = {
|
||||
"Bloodied", "Restrained", "Grappled", "Charmed", "Frightened",
|
||||
"Prone", "Incapacitated", "Stunned", "Paralyzed", "Petrified",
|
||||
"Poisoned", "Blinded", "Deafened", "Invisible", "Unconscious",
|
||||
"Exhaustion", "Surprised", "Furious",
|
||||
"Failure", "Success", "Trigger", "Response", "Hit", "Miss",
|
||||
"Habitat", "Treasure", "Bonus Actions", "Reactions", "Traits", "Actions",
|
||||
"Disadvantage", "Advantage",
|
||||
}
|
||||
|
||||
# --- Helpers ---
|
||||
|
||||
|
||||
def norm_dash(s: str) -> str:
|
||||
return s.replace("–", "-").replace("—", "-").replace("−", "-")
|
||||
|
||||
|
||||
def proficiency_bonus(cr_str: str) -> int:
|
||||
if "/" in cr_str:
|
||||
n, d = cr_str.split("/")
|
||||
cr = int(n) / int(d)
|
||||
else:
|
||||
cr = int(cr_str)
|
||||
if cr <= 4:
|
||||
return 2
|
||||
if cr <= 8:
|
||||
return 3
|
||||
if cr <= 12:
|
||||
return 4
|
||||
if cr <= 16:
|
||||
return 5
|
||||
if cr <= 20:
|
||||
return 6
|
||||
if cr <= 24:
|
||||
return 7
|
||||
if cr <= 28:
|
||||
return 8
|
||||
return 9
|
||||
|
||||
|
||||
def make_creature_id(source: str, name: str) -> str:
|
||||
slug = re.sub(r"[^a-z0-9]+", "-", name.lower()).strip("-")
|
||||
return f"{source.lower()}:{slug}"
|
||||
|
||||
|
||||
def parse_passive_perception(senses_text: str) -> int | None:
|
||||
# The PDF sometimes renders multi-digit values with a kerning space
|
||||
# (e.g. "Passive Perception 1 1" meaning 11). Collapse those.
|
||||
m = re.search(r"Passive Perception\s+(\d(?:\s*\d)*)\s*$", senses_text)
|
||||
if not m:
|
||||
m = re.search(r"Passive Perception\s+(\d+)", senses_text)
|
||||
return int(m.group(1).replace(" ", "")) if m else None
|
||||
|
||||
|
||||
# --- Page extraction ---
|
||||
|
||||
|
||||
def extract_pages(pdf_path: Path) -> str:
|
||||
reader = PdfReader(str(pdf_path))
|
||||
parts = []
|
||||
for i in range(PAGE_START - 1, PAGE_END):
|
||||
parts.append(reader.pages[i].extract_text())
|
||||
return "\n".join(parts)
|
||||
|
||||
|
||||
# --- Block splitting ---
|
||||
|
||||
|
||||
def find_stat_block_starts(lines: list[str]) -> list[int]:
|
||||
starts = []
|
||||
for i, line in enumerate(lines):
|
||||
if AC_RE.match(line.strip()):
|
||||
header_idx = None
|
||||
for j in range(i - 1, max(-1, i - 5), -1):
|
||||
if HEADER_RE.match(lines[j].strip()):
|
||||
header_idx = j
|
||||
break
|
||||
if header_idx is None:
|
||||
continue
|
||||
name_idx = header_idx - 1
|
||||
if name_idx >= 0 and lines[name_idx].strip():
|
||||
starts.append(name_idx)
|
||||
return starts
|
||||
|
||||
|
||||
SECTION_HEADER_SMUSH_RE = re.compile(
|
||||
r"^(?P<body>.+?)\.(?P<hdr>Actions|Bonus Actions|Reactions|Legendary Actions|Traits)\s*$"
|
||||
)
|
||||
|
||||
|
||||
def block_for(lines: list[str], start: int, next_start: int | None) -> list[str]:
|
||||
"""Build the line list for one stat block.
|
||||
|
||||
Drops page markers and everything from the first running-header line
|
||||
onward (which marks the transition to a new prose page). Splits PDF
|
||||
smush lines like "...plants.Actions" into two lines so section header
|
||||
detection works.
|
||||
"""
|
||||
end = next_start if next_start is not None else len(lines)
|
||||
out: list[str] = []
|
||||
for ln in lines[start:end]:
|
||||
if ln.startswith("===PAGE"):
|
||||
continue
|
||||
if RUNNING_HEADER_RE.match(ln.strip()):
|
||||
break
|
||||
m = SECTION_HEADER_SMUSH_RE.match(ln.strip())
|
||||
if m:
|
||||
out.append(m.group("body") + ".")
|
||||
out.append(m.group("hdr"))
|
||||
else:
|
||||
out.append(ln)
|
||||
return out
|
||||
|
||||
|
||||
# --- Vitals parsing ---
|
||||
|
||||
|
||||
def parse_header(block: list[str]) -> dict:
|
||||
name = block[0].strip()
|
||||
header = block[1].strip()
|
||||
m = HEADER_RE.match(header)
|
||||
if not m:
|
||||
raise ValueError(f"Bad header for {name!r}: {header!r}")
|
||||
size, ctype, alignment = m.group(1), m.group(2).strip(), m.group(3).strip()
|
||||
return {"name": name, "size": size, "type": ctype, "alignment": alignment}
|
||||
|
||||
|
||||
def parse_ac(line: str) -> int:
|
||||
m = AC_RE.match(line.strip())
|
||||
if not m:
|
||||
raise ValueError(f"Bad AC line: {line!r}")
|
||||
return int(m.group(1))
|
||||
|
||||
|
||||
def parse_hp(line: str) -> dict:
|
||||
m = HP_RE.match(line.strip())
|
||||
if not m:
|
||||
raise ValueError(f"Bad HP line: {line!r}")
|
||||
return {"average": int(m.group(1)), "formula": m.group(2).strip()}
|
||||
|
||||
|
||||
def parse_speed(line: str) -> str:
|
||||
m = SPEED_RE.match(line.strip())
|
||||
if not m:
|
||||
raise ValueError(f"Bad Speed line: {line!r}")
|
||||
speed = m.group(1).rstrip(".").strip()
|
||||
# Normalize "30 ft" → "30 ft." to match 5etools adapter output style.
|
||||
speed = re.sub(r"(\d+)\s+ft\b\.?", r"\1 ft.", speed)
|
||||
return speed
|
||||
|
||||
|
||||
def parse_abilities(row1: str, row2: str) -> dict:
|
||||
out = {}
|
||||
for row in (row1, row2):
|
||||
m = ABILITY_ROW_RE.match(row.strip())
|
||||
if not m:
|
||||
raise ValueError(f"Bad ability row: {row!r}")
|
||||
for off in (0, 4, 8):
|
||||
ab = m.group(off + 1).lower()
|
||||
score = int(m.group(off + 2))
|
||||
out[ab] = score
|
||||
return out
|
||||
|
||||
|
||||
# --- Meta lines ---
|
||||
|
||||
|
||||
META_KEYS = ("Skills", "Saving Throws", "Resistances", "Immunities",
|
||||
"Vulnerabilities", "Senses", "Languages", "Gear")
|
||||
|
||||
|
||||
def is_meta_start(line: str) -> str | None:
|
||||
for key in META_KEYS:
|
||||
if line.startswith(key + " ") or line.startswith(key + " "):
|
||||
return key
|
||||
return None
|
||||
|
||||
|
||||
def parse_meta(lines: list[str], start: int) -> tuple[dict, int]:
|
||||
meta: dict[str, str] = {}
|
||||
i = start
|
||||
current_key: str | None = None
|
||||
current_val_parts: list[str] = []
|
||||
|
||||
def flush() -> None:
|
||||
nonlocal current_key, current_val_parts
|
||||
if current_key is not None:
|
||||
meta[current_key] = " ".join(p.strip() for p in current_val_parts).strip()
|
||||
current_key = None
|
||||
current_val_parts = []
|
||||
|
||||
while i < len(lines):
|
||||
line = lines[i].strip()
|
||||
if not line:
|
||||
i += 1
|
||||
continue
|
||||
if line.startswith("Challenge "):
|
||||
flush()
|
||||
return meta, i
|
||||
key = is_meta_start(line)
|
||||
if key:
|
||||
flush()
|
||||
current_key = key
|
||||
current_val_parts.append(line[len(key):].strip())
|
||||
elif current_key is not None:
|
||||
current_val_parts.append(line)
|
||||
i += 1
|
||||
flush()
|
||||
return meta, i
|
||||
|
||||
|
||||
# --- Section discovery ---
|
||||
|
||||
|
||||
def find_section_starts(block: list[str], start_idx: int) -> list[tuple[str, int]]:
|
||||
starts = []
|
||||
for i in range(start_idx, len(block)):
|
||||
ln = block[i].strip()
|
||||
if ln in SECTION_HEADERS:
|
||||
starts.append((ln, i))
|
||||
return starts
|
||||
|
||||
|
||||
def collect_section_lines(block: list[str], start: int, end: int) -> list[str]:
|
||||
"""Collect the raw lines for one section (between header indices)."""
|
||||
out: list[str] = []
|
||||
for line in block[start:end]:
|
||||
if not line.strip():
|
||||
continue
|
||||
out.append(line.rstrip())
|
||||
return out
|
||||
|
||||
|
||||
def join_section_text(lines: list[str]) -> str:
|
||||
"""Join section lines into a single text blob, repairing wrap hyphens."""
|
||||
text = " ".join(line.strip() for line in lines if line.strip())
|
||||
text = re.sub(r"\s+", " ", text)
|
||||
# Repair "civi -li zation" → "civilization" (PDF column-wrap hyphens).
|
||||
text = re.sub(r"(\w)\s*-\s+(\w)", r"\1\2", text)
|
||||
return text.strip()
|
||||
|
||||
|
||||
# --- Entry splitting ---
|
||||
|
||||
# Entry name: title-case phrase, where each "word" is either a Capitalized
|
||||
# word, a lowercase connector (of/the/and/or/in/at/on/to/with/from), a roman
|
||||
# numeral, etc. Optionally followed by parenthesized modifier.
|
||||
ENTRY_NAME_INNER = (
|
||||
r"[A-Z][A-Za-z'’]*"
|
||||
r"(?:[ \-](?:[A-Z][A-Za-z'’]*|of|the|and|or|in|at|on|to|with|from))*"
|
||||
r"(?:\s*\([^)]+\))?"
|
||||
)
|
||||
# An entry boundary occurs at the start of the joined section text, or
|
||||
# immediately after a sentence-ending punctuation. The PDF sometimes drops
|
||||
# the space between the period and the new entry name, so `\s*` is fine.
|
||||
ENTRY_BOUNDARY = re.compile(
|
||||
rf"(?:^|(?<=[\.\?\!]))\s*(?P<name>{ENTRY_NAME_INNER})\.\s+(?=[A-Z“\"(])"
|
||||
)
|
||||
|
||||
# Trim attribution quotes / page-header bleed-through from entry bodies.
|
||||
PROSE_TAIL_PATTERNS = (
|
||||
# Em-dash attribution: " —Chondrus, Priest of Lutheria"
|
||||
re.compile(r"\s+—\s*[A-Z][^—]*$"),
|
||||
# Smushed section header at end ("...plants.Actions").
|
||||
re.compile(
|
||||
r"\.\s*(?:Actions|Bonus\s+Actions|Reactions|Legendary\s+Actions|Traits)\s*$"
|
||||
),
|
||||
# Curated prose subheadings / phrase markers that follow stat blocks in
|
||||
# this book. PDF reflow often merges prose onto the same logical line
|
||||
# as the last action body, so the leading whitespace is optional.
|
||||
re.compile(
|
||||
r"\.?\s*(?:Random Trapped Creature|Maenad Bacchanal|The Phalanx Formation"
|
||||
r"|Reinforced Portal|TRAPPED|HUNGER FOR|PURSUIT OF|RITUAL|MyTHIC|BRON"
|
||||
r"|GOlDEN|NyMPH|MARBlE|KElEDONE|SOlDIER|MINOTAUR|SATyRS|GOATlING|EMPUS"
|
||||
r"|ANARCH|GyGAN|CERBERUS|WHITE STAG|STORM|FEy|VOlKAN).*",
|
||||
re.DOTALL,
|
||||
),
|
||||
# Specific prose sentence-starts observed leaking in.
|
||||
re.compile(
|
||||
r"\.(?:will gleefully|Some report that|Storm Dory|This magic weapon"
|
||||
r"|Thylean soldiers|Some claim|These leaders).*",
|
||||
re.DOTALL,
|
||||
),
|
||||
# All-caps run of 3+ uppercase letters in a word, then a space, then
|
||||
# another word with 3+ uppercase letters (PDF small-caps section header
|
||||
# like "BRON zE STRATEGOS", "MyTHIC BEAST", "GOlDEN RAM").
|
||||
re.compile(r"(?<=[\.\s])[A-Z]{2}\w*\s+[\w ]{0,12}[A-Z]{3}[A-Z\w ]*"),
|
||||
)
|
||||
|
||||
|
||||
def trim_prose_tail(body: str) -> str:
|
||||
out = body
|
||||
for pat in PROSE_TAIL_PATTERNS:
|
||||
m = pat.search(out)
|
||||
if m:
|
||||
out = out[:m.start()].rstrip().rstrip(".") + "."
|
||||
return out.strip()
|
||||
|
||||
|
||||
def is_valid_entry_name(name: str) -> bool:
|
||||
"""Filter false-positive matches that aren't really entry names."""
|
||||
if name in NAME_FALSE_POSITIVES:
|
||||
return False
|
||||
# Single short capitalized word that's a common condition or noun is
|
||||
# usually a false positive when followed by a period. Real entry names
|
||||
# almost always have either multiple words or a parenthesized modifier.
|
||||
bare = re.sub(r"\s*\([^)]+\)\s*", "", name).strip()
|
||||
if bare in NAME_FALSE_POSITIVES:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def split_text_into_entries(text: str) -> list[tuple[str, str]]:
|
||||
"""Split section text into (name, body) entries by scanning for entry-name
|
||||
boundaries (start-of-text or after a sentence period)."""
|
||||
matches: list[tuple[int, int, str]] = []
|
||||
for m in ENTRY_BOUNDARY.finditer(text):
|
||||
name = m.group("name").strip()
|
||||
if is_valid_entry_name(name):
|
||||
matches.append((m.start(), m.end(), name))
|
||||
if not matches:
|
||||
return []
|
||||
entries: list[tuple[str, str]] = []
|
||||
for i, (_, body_start, name) in enumerate(matches):
|
||||
body_end = matches[i + 1][0] if i + 1 < len(matches) else len(text)
|
||||
body = text[body_start:body_end].strip()
|
||||
entries.append((name, body))
|
||||
return entries
|
||||
|
||||
|
||||
def parse_section_traits(lines: list[str]) -> list[dict]:
|
||||
text = join_section_text(lines)
|
||||
entries = split_text_into_entries(text)
|
||||
out = []
|
||||
for name, body in entries:
|
||||
body = trim_prose_tail(body)
|
||||
if body or name:
|
||||
out.append({"name": name,
|
||||
"segments": [{"type": "text", "value": body}]})
|
||||
return out
|
||||
|
||||
|
||||
def parse_legendary(lines: list[str], creature_name: str) -> dict | None:
|
||||
"""Parse the Legendary Actions section. Text before the first entry whose
|
||||
body contains action vocabulary forms the preamble.
|
||||
"""
|
||||
text = join_section_text(lines)
|
||||
all_matches: list[tuple[int, int, str]] = []
|
||||
for m in ENTRY_BOUNDARY.finditer(text):
|
||||
name = m.group("name").strip()
|
||||
if is_valid_entry_name(name):
|
||||
all_matches.append((m.start(), m.end(), name))
|
||||
|
||||
action_anchors = ("Saving Throw", "Attack Roll", "Trigger", "Recharge",
|
||||
"Melee", "Ranged", "Constitution", "Dexterity",
|
||||
"Strength", "Intelligence", "Wisdom", "Charisma")
|
||||
first_action_idx = None
|
||||
for i, (_, body_start, _) in enumerate(all_matches):
|
||||
body_end = all_matches[i + 1][0] if i + 1 < len(all_matches) else len(text)
|
||||
body_head = text[body_start:min(body_end, body_start + 100)]
|
||||
if any(a in body_head for a in action_anchors):
|
||||
first_action_idx = i
|
||||
break
|
||||
if first_action_idx is None:
|
||||
return None
|
||||
preamble = text[:all_matches[first_action_idx][0]].strip()
|
||||
if not preamble:
|
||||
preamble = f"{creature_name} can take Legendary Actions."
|
||||
entries = []
|
||||
for i in range(first_action_idx, len(all_matches)):
|
||||
_, body_start, name = all_matches[i]
|
||||
body_end = all_matches[i + 1][0] if i + 1 < len(all_matches) else len(text)
|
||||
body = text[body_start:body_end].strip()
|
||||
entries.append((name, body))
|
||||
if not entries:
|
||||
return None
|
||||
return {
|
||||
"preamble": preamble,
|
||||
"entries": [
|
||||
{"name": name,
|
||||
"segments": [{"type": "text", "value": trim_prose_tail(body)}]}
|
||||
for name, body in entries if body
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
# --- Top-level parse ---
|
||||
|
||||
|
||||
def parse_block(block: list[str]) -> dict:
|
||||
head = parse_header(block)
|
||||
ac = parse_ac(block[2])
|
||||
hp = parse_hp(block[3])
|
||||
speed = parse_speed(block[4])
|
||||
if not block[5].strip().startswith("MOD"):
|
||||
raise ValueError(f"Expected MOD header, got: {block[5]!r}")
|
||||
abilities = parse_abilities(block[6], block[7])
|
||||
|
||||
meta, ch_idx = parse_meta(block, 8)
|
||||
cr_match = CR_RE.match(block[ch_idx].strip())
|
||||
if not cr_match:
|
||||
raise ValueError(f"Bad Challenge line: {block[ch_idx]!r}")
|
||||
cr_str = cr_match.group(1)
|
||||
|
||||
section_starts = find_section_starts(block, ch_idx + 1)
|
||||
sections: dict[str, list[str]] = {}
|
||||
for i, (name, idx) in enumerate(section_starts):
|
||||
end = section_starts[i + 1][1] if i + 1 < len(section_starts) else len(block)
|
||||
sections[name] = collect_section_lines(block, idx + 1, end)
|
||||
|
||||
creature: dict = {
|
||||
"id": make_creature_id(SOURCE_CODE, head["name"]),
|
||||
"name": head["name"],
|
||||
"source": SOURCE_CODE,
|
||||
"sourceDisplayName": SOURCE_DISPLAY,
|
||||
"size": head["size"],
|
||||
"type": head["type"],
|
||||
"alignment": head["alignment"],
|
||||
"ac": ac,
|
||||
"hp": hp,
|
||||
"speed": speed,
|
||||
"abilities": abilities,
|
||||
"cr": cr_str,
|
||||
"initiativeProficiency": 0,
|
||||
"proficiencyBonus": proficiency_bonus(cr_str),
|
||||
"passive": parse_passive_perception(meta.get("Senses", "")) or 10,
|
||||
}
|
||||
|
||||
if "Saving Throws" in meta:
|
||||
creature["savingThrows"] = meta["Saving Throws"]
|
||||
if "Skills" in meta:
|
||||
creature["skills"] = meta["Skills"]
|
||||
if "Resistances" in meta:
|
||||
creature["resist"] = meta["Resistances"]
|
||||
if "Immunities" in meta:
|
||||
creature["immune"] = meta["Immunities"]
|
||||
if "Vulnerabilities" in meta:
|
||||
creature["vulnerable"] = meta["Vulnerabilities"]
|
||||
if "Senses" in meta:
|
||||
senses = re.sub(r"[;,]?\s*Passive Perception\s+\d+\s*$", "", meta["Senses"])
|
||||
senses = senses.strip().rstrip(";").strip()
|
||||
if senses:
|
||||
creature["senses"] = senses
|
||||
if "Languages" in meta:
|
||||
creature["languages"] = meta["Languages"]
|
||||
|
||||
if "Traits" in sections:
|
||||
creature["traits"] = parse_section_traits(sections["Traits"])
|
||||
if "Actions" in sections:
|
||||
creature["actions"] = parse_section_traits(sections["Actions"])
|
||||
if "Bonus Actions" in sections:
|
||||
creature["bonusActions"] = parse_section_traits(sections["Bonus Actions"])
|
||||
if "Reactions" in sections:
|
||||
creature["reactions"] = parse_section_traits(sections["Reactions"])
|
||||
if "Legendary Actions" in sections:
|
||||
leg = parse_legendary(sections["Legendary Actions"], head["name"])
|
||||
if leg:
|
||||
creature["legendaryActions"] = leg
|
||||
|
||||
return creature
|
||||
|
||||
|
||||
def main() -> int:
|
||||
if len(sys.argv) != 2:
|
||||
print("Usage: python3 extract-great-labors.py <path-to-pdf>",
|
||||
file=sys.stderr)
|
||||
return 1
|
||||
pdf_path = Path(os.path.expanduser(sys.argv[1]))
|
||||
if not pdf_path.exists():
|
||||
print(f"PDF not found: {pdf_path}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
text = extract_pages(pdf_path)
|
||||
lines = text.split("\n")
|
||||
|
||||
starts = find_stat_block_starts(lines)
|
||||
print(f"Detected {len(starts)} stat blocks", file=sys.stderr)
|
||||
|
||||
creatures = []
|
||||
failures = []
|
||||
for i, s in enumerate(starts):
|
||||
next_s = starts[i + 1] if i + 1 < len(starts) else None
|
||||
block = block_for(lines, s, next_s)
|
||||
try:
|
||||
creatures.append(parse_block(block))
|
||||
except Exception as e:
|
||||
failures.append((block[0] if block else "<empty>", str(e)))
|
||||
|
||||
if failures:
|
||||
print(f"\n{len(failures)} parse failures:", file=sys.stderr)
|
||||
for name, err in failures:
|
||||
print(f" - {name}: {err}", file=sys.stderr)
|
||||
|
||||
out_path = Path(__file__).resolve().parent.parent / "data" / "bestiary" / "dnd-bundled.json"
|
||||
out_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with out_path.open("w") as f:
|
||||
json.dump(creatures, f, indent="\t", ensure_ascii=False)
|
||||
f.write("\n")
|
||||
print(f"Wrote {len(creatures)} creatures to {out_path}", file=sys.stderr)
|
||||
return 0 if not failures else 2
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user