Knowledge-Base on

Upgrading My PDF Converter to IBM's Docling

Sat, 02 May 2026 00:00:00 +0000

When My Own Tool Couldn’t Handle My Work

The error message was easy to dismiss: RapidOCR returned empty result!. It appeared twice in the terminal, then silence — a blank .md file where a 40-page Oracle HCM implementation guide should have been. The PDF had come straight from Oracle’s support portal, the same format I use for every triage session. But this one stored its pages as images, and PyMuPDF4LLM had nothing to work with.

That was one category of failure. The other was quieter. For documents that did convert, I started noticing the tables were wrong — not corrupted, just structurally dissolved. An eligibility matrix that should have had six clearly labeled columns came back as a run of loosely connected text. Useful for nothing.

I had built this tool to serve my Oracle work. Then my Oracle work showed me exactly where it fell short.

The Problem with PyMuPDF4LLM

If you’ve followed this series, you know that PyMuPDF4LLM was a solid choice when I first built the converter . It handled text-based PDFs cleanly, installed without friction, and required almost no configuration. For research papers and simple documentation, it worked well.

But Oracle HCM documentation is a different category of document. Oracle’s guides are dense with tables: configuration reference grids, eligibility matrices, step-and-action setup tables. These are not decorative — they carry most of the meaning. When PyMuPDF4LLM dissolved those tables into unstructured text, it was silently degrading the most important parts of the document.

The image-based PDF problem was a hard wall. If a document was captured as page images rather than extractable text, the converter returned nothing. No partial output, no warning — just empty files.

Discovering Docling

IBM Research Zurich’s AI for Knowledge team open-sourced Docling in July 2024. The project has a specific focus: turning complex documents into structured, AI-ready output. In April 2025, IBM donated it to the Linux Foundation AI & Data, and it now powers data ingestion for Red Hat Enterprise Linux AI. As of this writing it has over 24,000 GitHub stars.

What makes Docling different is that it treats document conversion as a computer vision problem, not just a text extraction problem.

Layout analysis: Docling uses an RT-DETR-derived model trained on DocLayNet — IBM’s human-annotated dataset of real-world documents — to detect and classify every region on the page: tables, figures, headers, footers, section titles, body text. It knows the structure before it extracts any content.

Table reconstruction: This is where Docling earns its place for Oracle documentation. It uses a vision transformer called TableFormer that predicts row/column structure and header roles directly from the page image. The result is a proper Markdown table, not a stream of cell values.

Image-based PDFs: For documents stored as page images, Docling integrates OCR into its pipeline natively. The same converter handles text-based and image-based PDFs without any changes on your end.

The Switch

The API change was minimal. The old code:

import pymupdf4llm

md_text = pymupdf4llm.to_markdown(pdf_path)

The new code:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert(pdf_path)
md_text = result.document.export_to_markdown()

Three lines instead of one, but the extra structure pays dividends: DocumentConverter can be initialized once and reused across an entire batch, which matters when processing a folder of 50 Oracle guides.

A note on startup: The first time you run Docling, it downloads its ML models from Hugging Face. You will see this:

Loading weights: 100%|██████████| 770/770 [00:00<00:00, 1656.35it/s]

This is normal. The models cache locally after the first download and subsequent runs start immediately. If you see a warning about HF_TOKEN, that is also expected — Docling works without one, but setting a token removes the rate-limit warning:

echo 'export HF_TOKEN="hf_your_token_here"' >> ~/.zshrc

What Changed in Practice

Oracle documentation: Tables that previously collapsed into text now render as proper Markdown tables. A 6-column configuration reference comes back with headers intact and every row correctly aligned.

AI books: My knowledge base includes dense technical books on LLM engineering and machine learning. These have complex layouts — sidebars, multi-column sections, figures with captions. Docling’s layout model handles these significantly better than PyMuPDF4LLM’s heuristic approach.

Image-based PDFs: Documents that previously produced empty output now convert cleanly. The two-step workaround (ocrmypdf → pdf2md) is no longer necessary for most cases.

Two Other Improvements

While I was updating the engine, I added two things that were overdue:

DOCX support. The converter now handles Word documents using pandoc as a backend. The same pdf2md command works for both file types. This matters for Oracle support exports and study notes from my reMarkable.

Batch manifest. When processing a large folder, the converter now writes a manifest file tracking which files have been converted and their checksums. Re-running on the same folder skips files that haven’t changed. A --force flag overrides this when you need a fresh conversion.

pdf2md --batch ~/oracle-pdfs/ # skips already-converted
pdf2md --batch ~/oracle-pdfs/ --force # reconverts everything

What’s Next

The web UI — which I added in the last post — has also been updated to use Docling. Drag a PDF onto it, click Convert, and the same deep-learning pipeline runs behind the scenes.

The next thing I want to add is direct output to the Obsidian inbox. Right now the flow is: convert → download ZIP → move to vault. A toggle that sends output directly to ~/projects/obsidian-vault/00-inbox/ would cut that manual step entirely.

The tool is doing what I originally wanted: converting my Oracle documentation and AI library into clean, searchable Markdown. Docling is what makes that reliable for the documents that actually matter.

Front Matter Is the Schema of Your Knowledge Base

Sun, 19 Apr 2026 00:00:00 +0000

Front Matter Is the Schema of Your Knowledge Base

There is a Dataview query I run at least once a week:

TABLE date, author, genre
FROM "30-books"
WHERE contains(tags, "non-fiction") AND status = "finished"
SORT date DESC

It gives me a table of every non-fiction book I have finished, when I completed it, and who wrote it — in about 200 milliseconds. When I want to find what I read on a specific topic, I do not dig through folders or search my memory. I run the query.

That query only works because every note in that folder has structured front matter. Without it, Dataview has nothing to read, and the query returns zero results. I would be back to scrolling through files, reading titles, hoping I named things consistently.

That is not a trivial difference. It is the difference between a note-taking app and a knowledge base.

The Unstructured Vault Problem

Most people start Obsidian the same way: create a folder structure, drop notes in, link a few things. It feels organized at first. Folders give the illusion of structure.

The problem is that folders are physical storage, not logical structure. A note about a book you finished sits in 47-books/. That tells you where the file lives. It tells you nothing about when you read it, whether you finished it, who wrote it, what genre it is, or whether it connects to three other books you read on the same topic in a different folder.

Worse, that knowledge is invisible to anything that tries to read your vault programmatically. Dataview cannot query it. A PAI skill cannot filter for it. An AI context loader cannot select it by relevance. The information exists, but it is locked inside prose — retrievable only by a human reading the file.

When your vault grows past a few hundred notes, that model collapses.

What Front Matter Actually Is

Front matter is a YAML block at the top of a markdown file, delimited by triple dashes. It holds structured key-value pairs that describe the note — not the content itself, but metadata about it.

It is not magic and it is not complicated. It is a schema.

A minimal front matter block for a knowledge base note might look like this:

---
title: "Thinking, Fast and Slow"
date: 2026-03-12
tags:
 - non-fiction
 - psychology
 - behavioral-economics
status: finished
author: Daniel Kahneman
rating: 5
---

Three fields do most of the work: tags (what domain and type is this), date (when), and a status or type field (where in its lifecycle). Everything else is optional until a specific query demands it.

What It Unlocks

Dataview queries. Once your notes have consistent front matter, Dataview turns your vault into a queryable database. You can build a live table of unresolved issues, a list of certification notes by module, a filtered view of blog drafts not yet published. The query language is simple. The payoff is immediate.

Cross-domain filtering. My vault spans four domains: career notes, AI governance certification notes, PAI infrastructure documentation, and blog post drafts. Without front matter, navigating across those domains means folder-hopping. With front matter, I can query across all four simultaneously — surface everything tagged behavioral-economics regardless of where it lives, or find all notes with status: in-progress across every section at once. The folder structure stays for physical organization. Front matter handles the logical layer.

AI context loading. This is the one that changed how I think about it. PAI does not load my entire vault into context when I ask a question about something I have read. It loads notes that match specific criteria: the right tags, the right domain, the right status. That selection mechanism is front matter. Without structured metadata, the system gets everything or nothing. With it, loading can be precise.

Before and After: The Same Note

Without front matter:

# Thinking, Fast and Slow
Really good book. Kahneman breaks down how we make decisions — System 1
is fast and intuitive, System 2 is slow and deliberate. The section on
cognitive biases was the most useful part. Finished it in March. Would
recommend to anyone interested in decision-making or behavioral economics.

This is a fine note. It has the information. But Dataview cannot surface it in a query. PAI cannot identify it as a finished book on behavioral economics. Six months from now, I will not remember I wrote it unless I happen to search the right words.

With front matter:

---
title: "Thinking, Fast and Slow"
date: 2026-03-12
tags:
 - non-fiction
 - psychology
 - behavioral-economics
 - decision-making
status: finished
author: Daniel Kahneman
rating: 5
---

Now the note is queryable. PAI surfaces it automatically when I ask about books on decision-making. Dataview includes it in my Q1 reading table. I can filter for all five-star books across my entire reading folder. The content of the note is identical — only the schema changed.

The Architecture Argument

A relational database without a schema is just a collection of text files. An Obsidian vault without front matter is nearly the same thing — a sophisticated folder system with backlinks and a graph view, but still fundamentally unqueryable by anything that needs to select notes by attribute.

Front matter gives your vault a schema. Folders give it a physical address. You need both, but the schema is what makes a vault a knowledge base. Without it, you are building a library where every book is correctly shelved but nothing has a catalog entry. Finding anything specific means walking the stacks and reading spines.

Where to Start

Do not design an elaborate front matter schema before you have written a hundred notes. That is premature optimization and it will not survive contact with actual usage.

Start with three fields: tags, date, and status. Add type if your notes serve different purposes (reference, log, draft, fix-doc). Add domain-specific fields only when a query demands them.

The schema should be pulled from how you actually search, not pushed from how you think you might want to search someday. Write the notes, run queries against three fields, and let the gaps tell you what to add next. The vault teaches you what it needs — if you have given it enough structure to communicate.

I Turned 50 Cybersecurity Books Into a Searchable Brain

Sat, 21 Mar 2026 00:00:00 +0000

The Problem With Security Books

I have a lot of cybersecurity books. PDFs from Humble Bundles, O’Reilly downloads, books I’ve bought and never finished, reference material I collected “just in case.” Like most people, they lived in a folder I rarely opened.

The reason is friction. When I needed to look something up — say, how SQL injection payloads work, or the steps for privilege escalation on Linux — I’d have to remember which book covered it, open it, and search inside. Or just Google it and hope Stack Overflow had something decent.

That’s not a knowledge base. That’s a graveyard.

So I built something better: a local semantic search engine over all of them, powered by PostgreSQL, pgvector, and OpenAI embeddings. Now I ask questions in plain English and get back the exact passages — with the book and chapter — that answer them. The whole thing runs locally on my machine.

Here’s how I built it, and why it’s become one of the most useful tools in my PAI (Personal AI Infrastructure) stack.

What Semantic Search Actually Means

Traditional search is keyword matching. You type “SQL injection” and it finds documents containing those exact words.

Semantic search is different. It converts your query and your documents into vectors — lists of numbers that represent meaning in high-dimensional space. Similar concepts cluster together regardless of exact wording. Ask “how to bypass database input validation” and you’ll surface the same SQL injection content, even though you never typed “SQL injection.”

This matters enormously for a security knowledge base. Security concepts have dozens of names. “Privilege escalation,” “privesc,” “root access,” “vertical privilege abuse” — these all mean the same thing. Semantic search finds all of them.

The Stack

PostgreSQL 17 — the database
pgvector 0.8.2 — vector similarity search extension for Postgres
OpenAI text-embedding-3-small — converts text chunks to 1536-dimensional vectors
CyberSecKB.ts — a custom Bun/TypeScript CLI I built to tie it all together

Everything runs locally. The only external call is to OpenAI’s embedding API (which runs once at ingest time, not at query time).

The Pipeline: From PDF to Searchable Knowledge

Step 1: Convert PDFs to Markdown

Raw PDFs are terrible for text processing. I convert everything to Markdown first using a pdf2md Python tool:

cd ~/projects/pdf-to-markdown
source venv/bin/activate

# Text-based PDFs (most books):
python pdf2md input/mybook.pdf

# Image-based or scanned PDFs (use OCR first):
ocrmypdf --force-ocr input/mybook.pdf /tmp/ocr.pdf
python pdf2md /tmp/ocr.pdf output/mybook.md

# Move to library:
mv output/mybook.md ~/projects/cybersecurity-library/books/

Step 2: Ingest into the Database

TOOL=~/.claude/skills/PAI/USER/KNOWLEDGE/CYBERSECURITY/Tools/CyberSecKB.ts

# Single book with topics tagged:
bun $TOOL ingest \
 --file ~/projects/cybersecurity-library/books/mybook.md \
 --title "My Book Title" \
 --topics web,network,linux

# Or load everything at once:
bun $TOOL ingest --batch ~/projects/cybersecurity-library/books/

The ingest process:

Reads the Markdown file
Splits it into ~800-token chunks, preserving chapter headings
Sends chunks to OpenAI’s embedding API in batches
Stores chunks + their vector embeddings in PostgreSQL

Step 3: Search

# Plain English query:
bun $TOOL search "how do attackers bypass WAF rules for SQL injection"

# Filter by topic:
bun $TOOL search "privilege escalation" --topics linux --limit 5

# Check what's in the KB:
bun $TOOL list
bun $TOOL stats

What It Looks Like in Practice

Here’s a real query. I asked:

bun $TOOL search "SQL injection bypass techniques" --limit 3

Result:

━━━ [63.3%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
The `;` metacharacter in a SQL statement is used similarly to how it's used
in command injection to combine multiple queries on the same line...
━━━ [62.5%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
If user input is used without prior validation, and it is concatenated
directly into a SQL query, a user can inject different data...
━━━ [60.4%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
Input taken from cookies, input forms, and URL variables is used to build
SQL statements that are passed back to the database...

Each result shows the similarity score, book title, chapter, and a preview. I can immediately tell which book to go deeper in.

Another query — privilege escalation:

bun $TOOL search "privilege escalation linux" --limit 3

━━━ [66.1%] Cybersecurity Attack And Defense Strategies → Privilege Escalation
Most systems are built using the least privilege concept — users are
purposefully given the least privileges they need to perform their work...
━━━ [65.9%] Kali Linux Cookbook → Privilege Escalation
CVE-2015-1328: overlayfs vulnerability affecting Ubuntu where it does not
do proper checking of file creation in the upper filesystem area...
━━━ [65.8%] Cybersecurity Attack And Defense Strategies → Privilege Escalation
On Linux, vertical escalation allows attackers to have root privileges
that enable them to modify systems and programs...

This is the power of the system: I asked about a concept, not a keyword, and got specific, sourced, actionable results from three different books.

The Current State of the KB

After the initial batch ingest:

50 books indexed
11,757 chunks stored and embedded
Coverage spans: penetration testing, malware analysis, forensics, identity and access, cloud security, social engineering, cryptography, threat modeling, and more

Some of what’s in there:

Practical Malware Analysis (620 chunks)
Cybersecurity Threats, Malware Trends and Strategies (552 chunks)
Cybersecurity Attack and Defense Strategies (460 chunks)
Security Chaos Engineering (387 chunks)
Hardware Hacking Handbook (378 chunks)
Modern Data Protection (338 chunks)

Why This Fits Into PAI

This knowledge base is part of my PAI system — Personal AI Infrastructure. The idea behind PAI is to build infrastructure that amplifies what I can do with AI, rather than using AI one prompt at a time.

The Security KB is a perfect example. It’s not about asking ChatGPT “explain SQL injection.” It’s about having my own curated library, chunked, embedded, and ready to surface exactly the passage I need — from books I trust, with sources I can trace back.

When I’m working through a security challenge or studying for a certification, I can query the KB directly. Luna (my PAI assistant) can also query it as part of a larger workflow — search the KB, pull context into the prompt, and answer questions grounded in my actual library rather than generic training data.

Building It With Claude Code

The entire CyberSecKB tool was built using Claude Code through PAI. The process:

Described what I wanted: ingest markdown books, chunk by section, embed with OpenAI, store in pgvector
Claude Code scaffolded the TypeScript CLI
We hit a few real-world issues along the way:
- The OpenAI project key needed embedding model access enabled separately
- Batch size of 2048 hit the 300k token/request limit — tuned down to 200
- The 1M tokens/minute rate limit required adding a 15-second delay between batches
- A SQL type error in the search function when no topics filter was passed

Each issue was diagnosed and fixed in the same conversation. The tool went from concept to 50 books indexed in a single session.

What’s Next

A few things I want to add:

Tag all books with proper topics — the batch ingest skipped topic assignment; I’ll tag each book so --topics web or --topics linux filters actually work
Tier 1 topic files — condensed 5-15KB reference files for the most-used topics (SQLi, XSS, privilege escalation, etc.) that load directly into context
AI Security KB integration — the AI Security research KB shares the same database; queries cross both domains automatically

The knowledge base is live. The friction is gone. Now the books actually get used.

Built with PAI, Claude Code, PostgreSQL, pgvector, and OpenAI embeddings. All processing runs locally except the embedding API calls at ingest time.