<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Knowledge-Base on</title><link>https://augmentedresilience.com/tags/knowledge-base/</link><description>Recent content in Knowledge-Base on</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sat, 02 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://augmentedresilience.com/tags/knowledge-base/index.xml" rel="self" type="application/rss+xml"/><item><title>Upgrading My PDF Converter to IBM's Docling</title><link>https://augmentedresilience.com/posts/augmented-resilience-posts/upgrading-my-pdf-converter-to-ibm-docling/</link><pubDate>Sat, 02 May 2026 00:00:00 +0000</pubDate><guid>https://augmentedresilience.com/posts/augmented-resilience-posts/upgrading-my-pdf-converter-to-ibm-docling/</guid><description>&lt;h2 id="when-my-own-tool-couldnt-handle-my-work">When My Own Tool Couldn&amp;rsquo;t Handle My Work&lt;/h2>
&lt;p>The error message was easy to dismiss: &lt;code>RapidOCR returned empty result!&lt;/code>. It appeared twice in the terminal, then silence — a blank .md file where a 40-page Oracle HCM implementation guide should have been. The PDF had come straight from Oracle&amp;rsquo;s support portal, the same format I use for every triage session. But this one stored its pages as images, and PyMuPDF4LLM had nothing to work with.&lt;/p></description><content>&lt;h2 id="when-my-own-tool-couldnt-handle-my-work">When My Own Tool Couldn&amp;rsquo;t Handle My Work&lt;/h2>
&lt;p>The error message was easy to dismiss: &lt;code>RapidOCR returned empty result!&lt;/code>. It appeared twice in the terminal, then silence — a blank .md file where a 40-page Oracle HCM implementation guide should have been. The PDF had come straight from Oracle&amp;rsquo;s support portal, the same format I use for every triage session. But this one stored its pages as images, and PyMuPDF4LLM had nothing to work with.&lt;/p>
&lt;p>That was one category of failure. The other was quieter. For documents that did convert, I started noticing the tables were wrong — not corrupted, just structurally dissolved. An eligibility matrix that should have had six clearly labeled columns came back as a run of loosely connected text. Useful for nothing.&lt;/p>
&lt;p>I had built this tool to serve my Oracle work. Then my Oracle work showed me exactly where it fell short.&lt;/p>
&lt;hr>
&lt;h2 id="the-problem-with-pymupdf4llm">The Problem with PyMuPDF4LLM&lt;/h2>
&lt;p>If you&amp;rsquo;ve followed this series, you know that PyMuPDF4LLM was a solid choice when I first &lt;a href="https://augmentedresilience.com/posts/when-your-pdf-workflow-breaks-building-a-markdown-converter-with-claude-code/" target="_blank" rel="noopener noreferrer">built the converter&lt;/a>
. It handled text-based PDFs cleanly, installed without friction, and required almost no configuration. For research papers and simple documentation, it worked well.&lt;/p>
&lt;p>But Oracle HCM documentation is a different category of document. Oracle&amp;rsquo;s guides are dense with tables: configuration reference grids, eligibility matrices, step-and-action setup tables. These are not decorative — they carry most of the meaning. When PyMuPDF4LLM dissolved those tables into unstructured text, it was silently degrading the most important parts of the document.&lt;/p>
&lt;p>The image-based PDF problem was a hard wall. If a document was captured as page images rather than extractable text, the converter returned nothing. No partial output, no warning — just empty files.&lt;/p>
&lt;hr>
&lt;h2 id="discovering-docling">Discovering Docling&lt;/h2>
&lt;p>IBM Research Zurich&amp;rsquo;s AI for Knowledge team open-sourced &lt;a href="https://github.com/docling-project/docling" target="_blank" rel="noopener noreferrer">Docling&lt;/a>
in July 2024. The project has a specific focus: turning complex documents into structured, AI-ready output. In April 2025, IBM donated it to the Linux Foundation AI &amp;amp; Data, and it now powers data ingestion for Red Hat Enterprise Linux AI. As of this writing it has over 24,000 GitHub stars.&lt;/p>
&lt;p>What makes Docling different is that it treats document conversion as a computer vision problem, not just a text extraction problem.&lt;/p>
&lt;p>&lt;strong>Layout analysis:&lt;/strong> Docling uses an RT-DETR-derived model trained on DocLayNet — IBM&amp;rsquo;s human-annotated dataset of real-world documents — to detect and classify every region on the page: tables, figures, headers, footers, section titles, body text. It knows the structure before it extracts any content.&lt;/p>
&lt;p>&lt;strong>Table reconstruction:&lt;/strong> This is where Docling earns its place for Oracle documentation. It uses a vision transformer called TableFormer that predicts row/column structure and header roles directly from the page image. The result is a proper Markdown table, not a stream of cell values.&lt;/p>
&lt;p>&lt;strong>Image-based PDFs:&lt;/strong> For documents stored as page images, Docling integrates OCR into its pipeline natively. The same converter handles text-based and image-based PDFs without any changes on your end.&lt;/p>
&lt;hr>
&lt;h2 id="the-switch">The Switch&lt;/h2>
&lt;p>The API change was minimal. The old code:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> pymupdf4llm
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>md_text &lt;span style="color:#f92672">=&lt;/span> pymupdf4llm&lt;span style="color:#f92672">.&lt;/span>to_markdown(pdf_path)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The new code:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> docling.document_converter &lt;span style="color:#f92672">import&lt;/span> DocumentConverter
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>converter &lt;span style="color:#f92672">=&lt;/span> DocumentConverter()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>result &lt;span style="color:#f92672">=&lt;/span> converter&lt;span style="color:#f92672">.&lt;/span>convert(pdf_path)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>md_text &lt;span style="color:#f92672">=&lt;/span> result&lt;span style="color:#f92672">.&lt;/span>document&lt;span style="color:#f92672">.&lt;/span>export_to_markdown()
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Three lines instead of one, but the extra structure pays dividends: &lt;code>DocumentConverter&lt;/code> can be initialized once and reused across an entire batch, which matters when processing a folder of 50 Oracle guides.&lt;/p>
&lt;p>&lt;strong>A note on startup:&lt;/strong> The first time you run Docling, it downloads its ML models from Hugging Face. You will see this:&lt;/p>
&lt;pre tabindex="0">&lt;code>Loading weights: 100%|██████████| 770/770 [00:00&amp;lt;00:00, 1656.35it/s]
&lt;/code>&lt;/pre>&lt;p>This is normal. The models cache locally after the first download and subsequent runs start immediately. If you see a warning about &lt;code>HF_TOKEN&lt;/code>, that is also expected — Docling works without one, but setting a token removes the rate-limit warning:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-zsh" data-lang="zsh">&lt;span style="display:flex;">&lt;span>echo &lt;span style="color:#e6db74">&amp;#39;export HF_TOKEN=&amp;#34;hf_your_token_here&amp;#34;&amp;#39;&lt;/span> &amp;gt;&amp;gt; ~/.zshrc
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="what-changed-in-practice">What Changed in Practice&lt;/h2>
&lt;p>&lt;strong>Oracle documentation:&lt;/strong> Tables that previously collapsed into text now render as proper Markdown tables. A 6-column configuration reference comes back with headers intact and every row correctly aligned.&lt;/p>
&lt;p>&lt;strong>AI books:&lt;/strong> My knowledge base includes dense technical books on LLM engineering and machine learning. These have complex layouts — sidebars, multi-column sections, figures with captions. Docling&amp;rsquo;s layout model handles these significantly better than PyMuPDF4LLM&amp;rsquo;s heuristic approach.&lt;/p>
&lt;p>&lt;strong>Image-based PDFs:&lt;/strong> Documents that previously produced empty output now convert cleanly. The two-step workaround (ocrmypdf → pdf2md) is no longer necessary for most cases.&lt;/p>
&lt;hr>
&lt;h2 id="two-other-improvements">Two Other Improvements&lt;/h2>
&lt;p>While I was updating the engine, I added two things that were overdue:&lt;/p>
&lt;p>&lt;strong>DOCX support.&lt;/strong> The converter now handles Word documents using pandoc as a backend. The same &lt;code>pdf2md&lt;/code> command works for both file types. This matters for Oracle support exports and study notes from my reMarkable.&lt;/p>
&lt;p>&lt;strong>Batch manifest.&lt;/strong> When processing a large folder, the converter now writes a manifest file tracking which files have been converted and their checksums. Re-running on the same folder skips files that haven&amp;rsquo;t changed. A &lt;code>--force&lt;/code> flag overrides this when you need a fresh conversion.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>pdf2md --batch ~/oracle-pdfs/ &lt;span style="color:#75715e"># skips already-converted&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>pdf2md --batch ~/oracle-pdfs/ --force &lt;span style="color:#75715e"># reconverts everything&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="whats-next">What&amp;rsquo;s Next&lt;/h2>
&lt;p>The web UI — which I added in the &lt;a href="https://augmentedresilience.com/posts/adding-a-web-ui-to-my-pdf-to-markdown-converter/" target="_blank" rel="noopener noreferrer">last post&lt;/a>
— has also been updated to use Docling. Drag a PDF onto it, click Convert, and the same deep-learning pipeline runs behind the scenes.&lt;/p>
&lt;p>The next thing I want to add is direct output to the Obsidian inbox. Right now the flow is: convert → download ZIP → move to vault. A toggle that sends output directly to &lt;code>~/projects/obsidian-vault/00-inbox/&lt;/code> would cut that manual step entirely.&lt;/p>
&lt;p>The tool is doing what I originally wanted: converting my Oracle documentation and AI library into clean, searchable Markdown. Docling is what makes that reliable for the documents that actually matter.&lt;/p></content></item><item><title>Front Matter Is the Schema of Your Knowledge Base</title><link>https://augmentedresilience.com/posts/augmented-resilience-posts/front-matter-is-the-schema-of-your-knowledge-base/</link><pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate><guid>https://augmentedresilience.com/posts/augmented-resilience-posts/front-matter-is-the-schema-of-your-knowledge-base/</guid><description>&lt;h1 id="front-matter-is-the-schema-of-your-knowledge-base">Front Matter Is the Schema of Your Knowledge Base&lt;/h1>
&lt;p>There is a Dataview query I run at least once a week:&lt;/p>
&lt;pre tabindex="0">&lt;code class="language-dataview" data-lang="dataview">TABLE date, author, genre
FROM &amp;#34;30-books&amp;#34;
WHERE contains(tags, &amp;#34;non-fiction&amp;#34;) AND status = &amp;#34;finished&amp;#34;
SORT date DESC
&lt;/code>&lt;/pre>&lt;p>It gives me a table of every non-fiction book I have finished, when I completed it, and who wrote it — in about 200 milliseconds. When I want to find what I read on a specific topic, I do not dig through folders or search my memory. I run the query.&lt;/p></description><content>&lt;h1 id="front-matter-is-the-schema-of-your-knowledge-base">Front Matter Is the Schema of Your Knowledge Base&lt;/h1>
&lt;p>There is a Dataview query I run at least once a week:&lt;/p>
&lt;pre tabindex="0">&lt;code class="language-dataview" data-lang="dataview">TABLE date, author, genre
FROM &amp;#34;30-books&amp;#34;
WHERE contains(tags, &amp;#34;non-fiction&amp;#34;) AND status = &amp;#34;finished&amp;#34;
SORT date DESC
&lt;/code>&lt;/pre>&lt;p>It gives me a table of every non-fiction book I have finished, when I completed it, and who wrote it — in about 200 milliseconds. When I want to find what I read on a specific topic, I do not dig through folders or search my memory. I run the query.&lt;/p>
&lt;p>That query only works because every note in that folder has structured front matter. Without it, Dataview has nothing to read, and the query returns zero results. I would be back to scrolling through files, reading titles, hoping I named things consistently.&lt;/p>
&lt;p>That is not a trivial difference. It is the difference between a note-taking app and a knowledge base.&lt;/p>
&lt;hr>
&lt;h2 id="the-unstructured-vault-problem">The Unstructured Vault Problem&lt;/h2>
&lt;p>Most people start Obsidian the same way: create a folder structure, drop notes in, link a few things. It feels organized at first. Folders give the illusion of structure.&lt;/p>
&lt;p>The problem is that folders are physical storage, not logical structure. A note about a book you finished sits in &lt;code>47-books/&lt;/code>. That tells you where the file lives. It tells you nothing about when you read it, whether you finished it, who wrote it, what genre it is, or whether it connects to three other books you read on the same topic in a different folder.&lt;/p>
&lt;p>Worse, that knowledge is invisible to anything that tries to read your vault programmatically. Dataview cannot query it. A PAI skill cannot filter for it. An AI context loader cannot select it by relevance. The information exists, but it is locked inside prose — retrievable only by a human reading the file.&lt;/p>
&lt;p>When your vault grows past a few hundred notes, that model collapses.&lt;/p>
&lt;hr>
&lt;h2 id="what-front-matter-actually-is">What Front Matter Actually Is&lt;/h2>
&lt;p>Front matter is a YAML block at the top of a markdown file, delimited by triple dashes. It holds structured key-value pairs that describe the note — not the content itself, but metadata about it.&lt;/p>
&lt;p>It is not magic and it is not complicated. It is a schema.&lt;/p>
&lt;p>A minimal front matter block for a knowledge base note might look like this:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>---
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">title&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;Thinking, Fast and Slow&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">date&lt;/span>: &lt;span style="color:#e6db74">2026-03-12&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">tags&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> - &lt;span style="color:#ae81ff">non-fiction&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> - &lt;span style="color:#ae81ff">psychology&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> - &lt;span style="color:#ae81ff">behavioral-economics&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">status&lt;/span>: &lt;span style="color:#ae81ff">finished&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">author&lt;/span>: &lt;span style="color:#ae81ff">Daniel Kahneman&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">rating&lt;/span>: &lt;span style="color:#ae81ff">5&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>---
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Three fields do most of the work: &lt;code>tags&lt;/code> (what domain and type is this), &lt;code>date&lt;/code> (when), and a &lt;code>status&lt;/code> or &lt;code>type&lt;/code> field (where in its lifecycle). Everything else is optional until a specific query demands it.&lt;/p>
&lt;hr>
&lt;h2 id="what-it-unlocks">What It Unlocks&lt;/h2>
&lt;p>&lt;strong>Dataview queries.&lt;/strong> Once your notes have consistent front matter, Dataview turns your vault into a queryable database. You can build a live table of unresolved issues, a list of certification notes by module, a filtered view of blog drafts not yet published. The query language is simple. The payoff is immediate.&lt;/p>
&lt;p>&lt;strong>Cross-domain filtering.&lt;/strong> My vault spans four domains: career notes, AI governance certification notes, PAI infrastructure documentation, and blog post drafts. Without front matter, navigating across those domains means folder-hopping. With front matter, I can query across all four simultaneously — surface everything tagged &lt;code>behavioral-economics&lt;/code> regardless of where it lives, or find all notes with &lt;code>status: in-progress&lt;/code> across every section at once. The folder structure stays for physical organization. Front matter handles the logical layer.&lt;/p>
&lt;p>&lt;strong>AI context loading.&lt;/strong> This is the one that changed how I think about it. PAI does not load my entire vault into context when I ask a question about something I have read. It loads notes that match specific criteria: the right tags, the right domain, the right status. That selection mechanism is front matter. Without structured metadata, the system gets everything or nothing. With it, loading can be precise.&lt;/p>
&lt;hr>
&lt;h2 id="before-and-after-the-same-note">Before and After: The Same Note&lt;/h2>
&lt;p>&lt;strong>Without front matter:&lt;/strong>&lt;/p>
&lt;pre tabindex="0">&lt;code># Thinking, Fast and Slow
Really good book. Kahneman breaks down how we make decisions — System 1
is fast and intuitive, System 2 is slow and deliberate. The section on
cognitive biases was the most useful part. Finished it in March. Would
recommend to anyone interested in decision-making or behavioral economics.
&lt;/code>&lt;/pre>&lt;p>This is a fine note. It has the information. But Dataview cannot surface it in a query. PAI cannot identify it as a finished book on behavioral economics. Six months from now, I will not remember I wrote it unless I happen to search the right words.&lt;/p>
&lt;p>&lt;strong>With front matter:&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>---
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">title&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;Thinking, Fast and Slow&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">date&lt;/span>: &lt;span style="color:#e6db74">2026-03-12&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">tags&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> - &lt;span style="color:#ae81ff">non-fiction&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> - &lt;span style="color:#ae81ff">psychology&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> - &lt;span style="color:#ae81ff">behavioral-economics&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> - &lt;span style="color:#ae81ff">decision-making&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">status&lt;/span>: &lt;span style="color:#ae81ff">finished&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">author&lt;/span>: &lt;span style="color:#ae81ff">Daniel Kahneman&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">rating&lt;/span>: &lt;span style="color:#ae81ff">5&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>---
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Now the note is queryable. PAI surfaces it automatically when I ask about books on decision-making. Dataview includes it in my Q1 reading table. I can filter for all five-star books across my entire reading folder. The content of the note is identical — only the schema changed.&lt;/p>
&lt;hr>
&lt;h2 id="the-architecture-argument">The Architecture Argument&lt;/h2>
&lt;p>A relational database without a schema is just a collection of text files. An Obsidian vault without front matter is nearly the same thing — a sophisticated folder system with backlinks and a graph view, but still fundamentally unqueryable by anything that needs to select notes by attribute.&lt;/p>
&lt;p>Front matter gives your vault a schema. Folders give it a physical address. You need both, but the schema is what makes a vault a knowledge base. Without it, you are building a library where every book is correctly shelved but nothing has a catalog entry. Finding anything specific means walking the stacks and reading spines.&lt;/p>
&lt;hr>
&lt;h2 id="where-to-start">Where to Start&lt;/h2>
&lt;p>Do not design an elaborate front matter schema before you have written a hundred notes. That is premature optimization and it will not survive contact with actual usage.&lt;/p>
&lt;p>Start with three fields: &lt;code>tags&lt;/code>, &lt;code>date&lt;/code>, and &lt;code>status&lt;/code>. Add &lt;code>type&lt;/code> if your notes serve different purposes (reference, log, draft, fix-doc). Add domain-specific fields only when a query demands them.&lt;/p>
&lt;p>The schema should be pulled from how you actually search, not pushed from how you think you might want to search someday. Write the notes, run queries against three fields, and let the gaps tell you what to add next. The vault teaches you what it needs — if you have given it enough structure to communicate.&lt;/p></content></item><item><title>I Turned 50 Cybersecurity Books Into a Searchable Brain</title><link>https://augmentedresilience.com/posts/augmented-resilience-posts/i-turned-50-cybersecurity-books-into-a-searchable-brain/</link><pubDate>Sat, 21 Mar 2026 00:00:00 +0000</pubDate><guid>https://augmentedresilience.com/posts/augmented-resilience-posts/i-turned-50-cybersecurity-books-into-a-searchable-brain/</guid><description>&lt;h2 id="the-problem-with-security-books">The Problem With Security Books&lt;/h2>
&lt;p>I have a lot of cybersecurity books. PDFs from Humble Bundles, O&amp;rsquo;Reilly downloads, books I&amp;rsquo;ve bought and never finished, reference material I collected &amp;ldquo;just in case.&amp;rdquo; Like most people, they lived in a folder I rarely opened.&lt;/p>
&lt;p>The reason is friction. When I needed to look something up — say, how SQL injection payloads work, or the steps for privilege escalation on Linux — I&amp;rsquo;d have to remember which book covered it, open it, and search inside. Or just Google it and hope Stack Overflow had something decent.&lt;/p></description><content>&lt;h2 id="the-problem-with-security-books">The Problem With Security Books&lt;/h2>
&lt;p>I have a lot of cybersecurity books. PDFs from Humble Bundles, O&amp;rsquo;Reilly downloads, books I&amp;rsquo;ve bought and never finished, reference material I collected &amp;ldquo;just in case.&amp;rdquo; Like most people, they lived in a folder I rarely opened.&lt;/p>
&lt;p>The reason is friction. When I needed to look something up — say, how SQL injection payloads work, or the steps for privilege escalation on Linux — I&amp;rsquo;d have to remember which book covered it, open it, and search inside. Or just Google it and hope Stack Overflow had something decent.&lt;/p>
&lt;p>That&amp;rsquo;s not a knowledge base. That&amp;rsquo;s a graveyard.&lt;/p>
&lt;p>So I built something better: a local semantic search engine over all of them, powered by PostgreSQL, pgvector, and OpenAI embeddings. Now I ask questions in plain English and get back the exact passages — with the book and chapter — that answer them. The whole thing runs locally on my machine.&lt;/p>
&lt;p>Here&amp;rsquo;s how I built it, and why it&amp;rsquo;s become one of the most useful tools in my PAI (Personal AI Infrastructure) stack.&lt;/p>
&lt;hr>
&lt;h2 id="what-semantic-search-actually-means">What Semantic Search Actually Means&lt;/h2>
&lt;p>Traditional search is keyword matching. You type &amp;ldquo;SQL injection&amp;rdquo; and it finds documents containing those exact words.&lt;/p>
&lt;p>Semantic search is different. It converts your query and your documents into vectors — lists of numbers that represent &lt;em>meaning&lt;/em> in high-dimensional space. Similar concepts cluster together regardless of exact wording. Ask &amp;ldquo;how to bypass database input validation&amp;rdquo; and you&amp;rsquo;ll surface the same SQL injection content, even though you never typed &amp;ldquo;SQL injection.&amp;rdquo;&lt;/p>
&lt;p>This matters enormously for a security knowledge base. Security concepts have dozens of names. &amp;ldquo;Privilege escalation,&amp;rdquo; &amp;ldquo;privesc,&amp;rdquo; &amp;ldquo;root access,&amp;rdquo; &amp;ldquo;vertical privilege abuse&amp;rdquo; — these all mean the same thing. Semantic search finds all of them.&lt;/p>
&lt;hr>
&lt;h2 id="the-stack">The Stack&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>PostgreSQL 17&lt;/strong> — the database&lt;/li>
&lt;li>&lt;strong>pgvector 0.8.2&lt;/strong> — vector similarity search extension for Postgres&lt;/li>
&lt;li>&lt;strong>OpenAI text-embedding-3-small&lt;/strong> — converts text chunks to 1536-dimensional vectors&lt;/li>
&lt;li>&lt;strong>CyberSecKB.ts&lt;/strong> — a custom Bun/TypeScript CLI I built to tie it all together&lt;/li>
&lt;/ul>
&lt;p>Everything runs locally. The only external call is to OpenAI&amp;rsquo;s embedding API (which runs once at ingest time, not at query time).&lt;/p>
&lt;hr>
&lt;h2 id="the-pipeline-from-pdf-to-searchable-knowledge">The Pipeline: From PDF to Searchable Knowledge&lt;/h2>
&lt;h3 id="step-1-convert-pdfs-to-markdown">Step 1: Convert PDFs to Markdown&lt;/h3>
&lt;p>Raw PDFs are terrible for text processing. I convert everything to Markdown first using a &lt;code>pdf2md&lt;/code> Python tool:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>cd ~/projects/pdf-to-markdown
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>source venv/bin/activate
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Text-based PDFs (most books):&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>python pdf2md input/mybook.pdf
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Image-based or scanned PDFs (use OCR first):&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ocrmypdf --force-ocr input/mybook.pdf /tmp/ocr.pdf
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>python pdf2md /tmp/ocr.pdf output/mybook.md
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Move to library:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>mv output/mybook.md ~/projects/cybersecurity-library/books/
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="step-2-ingest-into-the-database">Step 2: Ingest into the Database&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>TOOL&lt;span style="color:#f92672">=&lt;/span>~/.claude/skills/PAI/USER/KNOWLEDGE/CYBERSECURITY/Tools/CyberSecKB.ts
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Single book with topics tagged:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bun $TOOL ingest &lt;span style="color:#ae81ff">\
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --file ~/projects/cybersecurity-library/books/mybook.md &lt;span style="color:#ae81ff">\
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --title &lt;span style="color:#e6db74">&amp;#34;My Book Title&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --topics web,network,linux
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Or load everything at once:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bun $TOOL ingest --batch ~/projects/cybersecurity-library/books/
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The ingest process:&lt;/p>
&lt;ol>
&lt;li>Reads the Markdown file&lt;/li>
&lt;li>Splits it into ~800-token chunks, preserving chapter headings&lt;/li>
&lt;li>Sends chunks to OpenAI&amp;rsquo;s embedding API in batches&lt;/li>
&lt;li>Stores chunks + their vector embeddings in PostgreSQL&lt;/li>
&lt;/ol>
&lt;h3 id="step-3-search">Step 3: Search&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Plain English query:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bun $TOOL search &lt;span style="color:#e6db74">&amp;#34;how do attackers bypass WAF rules for SQL injection&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Filter by topic:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bun $TOOL search &lt;span style="color:#e6db74">&amp;#34;privilege escalation&amp;#34;&lt;/span> --topics linux --limit &lt;span style="color:#ae81ff">5&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Check what&amp;#39;s in the KB:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bun $TOOL list
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bun $TOOL stats
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="what-it-looks-like-in-practice">What It Looks Like in Practice&lt;/h2>
&lt;p>Here&amp;rsquo;s a real query. I asked:&lt;/p>
&lt;pre tabindex="0">&lt;code>bun $TOOL search &amp;#34;SQL injection bypass techniques&amp;#34; --limit 3
&lt;/code>&lt;/pre>&lt;p>Result:&lt;/p>
&lt;pre tabindex="0">&lt;code>━━━ [63.3%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
The `;` metacharacter in a SQL statement is used similarly to how it&amp;#39;s used
in command injection to combine multiple queries on the same line...
━━━ [62.5%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
If user input is used without prior validation, and it is concatenated
directly into a SQL query, a user can inject different data...
━━━ [60.4%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
Input taken from cookies, input forms, and URL variables is used to build
SQL statements that are passed back to the database...
&lt;/code>&lt;/pre>&lt;p>Each result shows the similarity score, book title, chapter, and a preview. I can immediately tell which book to go deeper in.&lt;/p>
&lt;p>Another query — privilege escalation:&lt;/p>
&lt;pre tabindex="0">&lt;code>bun $TOOL search &amp;#34;privilege escalation linux&amp;#34; --limit 3
&lt;/code>&lt;/pre>&lt;pre tabindex="0">&lt;code>━━━ [66.1%] Cybersecurity Attack And Defense Strategies → Privilege Escalation
Most systems are built using the least privilege concept — users are
purposefully given the least privileges they need to perform their work...
━━━ [65.9%] Kali Linux Cookbook → Privilege Escalation
CVE-2015-1328: overlayfs vulnerability affecting Ubuntu where it does not
do proper checking of file creation in the upper filesystem area...
━━━ [65.8%] Cybersecurity Attack And Defense Strategies → Privilege Escalation
On Linux, vertical escalation allows attackers to have root privileges
that enable them to modify systems and programs...
&lt;/code>&lt;/pre>&lt;p>This is the power of the system: I asked about a concept, not a keyword, and got specific, sourced, actionable results from three different books.&lt;/p>
&lt;hr>
&lt;h2 id="the-current-state-of-the-kb">The Current State of the KB&lt;/h2>
&lt;p>After the initial batch ingest:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>50 books&lt;/strong> indexed&lt;/li>
&lt;li>&lt;strong>11,757 chunks&lt;/strong> stored and embedded&lt;/li>
&lt;li>Coverage spans: penetration testing, malware analysis, forensics, identity and access, cloud security, social engineering, cryptography, threat modeling, and more&lt;/li>
&lt;/ul>
&lt;p>Some of what&amp;rsquo;s in there:&lt;/p>
&lt;ul>
&lt;li>&lt;em>Practical Malware Analysis&lt;/em> (620 chunks)&lt;/li>
&lt;li>&lt;em>Cybersecurity Threats, Malware Trends and Strategies&lt;/em> (552 chunks)&lt;/li>
&lt;li>&lt;em>Cybersecurity Attack and Defense Strategies&lt;/em> (460 chunks)&lt;/li>
&lt;li>&lt;em>Security Chaos Engineering&lt;/em> (387 chunks)&lt;/li>
&lt;li>&lt;em>Hardware Hacking Handbook&lt;/em> (378 chunks)&lt;/li>
&lt;li>&lt;em>Modern Data Protection&lt;/em> (338 chunks)&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="why-this-fits-into-pai">Why This Fits Into PAI&lt;/h2>
&lt;p>This knowledge base is part of my PAI system — Personal AI Infrastructure. The idea behind PAI is to build infrastructure that &lt;em>amplifies&lt;/em> what I can do with AI, rather than using AI one prompt at a time.&lt;/p>
&lt;p>The Security KB is a perfect example. It&amp;rsquo;s not about asking ChatGPT &amp;ldquo;explain SQL injection.&amp;rdquo; It&amp;rsquo;s about having my own curated library, chunked, embedded, and ready to surface exactly the passage I need — from books I trust, with sources I can trace back.&lt;/p>
&lt;p>When I&amp;rsquo;m working through a security challenge or studying for a certification, I can query the KB directly. Luna (my PAI assistant) can also query it as part of a larger workflow — search the KB, pull context into the prompt, and answer questions grounded in my actual library rather than generic training data.&lt;/p>
&lt;hr>
&lt;h2 id="building-it-with-claude-code">Building It With Claude Code&lt;/h2>
&lt;p>The entire CyberSecKB tool was built using Claude Code through PAI. The process:&lt;/p>
&lt;ol>
&lt;li>Described what I wanted: ingest markdown books, chunk by section, embed with OpenAI, store in pgvector&lt;/li>
&lt;li>Claude Code scaffolded the TypeScript CLI&lt;/li>
&lt;li>We hit a few real-world issues along the way:
&lt;ul>
&lt;li>The OpenAI project key needed embedding model access enabled separately&lt;/li>
&lt;li>Batch size of 2048 hit the 300k token/request limit — tuned down to 200&lt;/li>
&lt;li>The 1M tokens/minute rate limit required adding a 15-second delay between batches&lt;/li>
&lt;li>A SQL type error in the search function when no topics filter was passed&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>Each issue was diagnosed and fixed in the same conversation. The tool went from concept to 50 books indexed in a single session.&lt;/p>
&lt;hr>
&lt;h2 id="whats-next">What&amp;rsquo;s Next&lt;/h2>
&lt;p>A few things I want to add:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Tag all books with proper topics&lt;/strong> — the batch ingest skipped topic assignment; I&amp;rsquo;ll tag each book so &lt;code>--topics web&lt;/code> or &lt;code>--topics linux&lt;/code> filters actually work&lt;/li>
&lt;li>&lt;strong>Tier 1 topic files&lt;/strong> — condensed 5-15KB reference files for the most-used topics (SQLi, XSS, privilege escalation, etc.) that load directly into context&lt;/li>
&lt;li>&lt;strong>AI Security KB integration&lt;/strong> — the AI Security research KB shares the same database; queries cross both domains automatically&lt;/li>
&lt;/ul>
&lt;p>The knowledge base is live. The friction is gone. Now the books actually get used.&lt;/p>
&lt;hr>
&lt;p>&lt;em>Built with PAI, Claude Code, PostgreSQL, pgvector, and OpenAI embeddings. All processing runs locally except the embedding API calls at ingest time.&lt;/em>&lt;/p></content></item></channel></rss>