<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Embeddings on</title><link>https://augmentedresilience.com/tags/embeddings/</link><description>Recent content in Embeddings on</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sat, 21 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://augmentedresilience.com/tags/embeddings/index.xml" rel="self" type="application/rss+xml"/><item><title>I Turned 50 Cybersecurity Books Into a Searchable Brain</title><link>https://augmentedresilience.com/posts/augmented-resilience-posts/i-turned-50-cybersecurity-books-into-a-searchable-brain/</link><pubDate>Sat, 21 Mar 2026 00:00:00 +0000</pubDate><guid>https://augmentedresilience.com/posts/augmented-resilience-posts/i-turned-50-cybersecurity-books-into-a-searchable-brain/</guid><description>&lt;h2 id="the-problem-with-security-books">The Problem With Security Books&lt;/h2>
&lt;p>I have a lot of cybersecurity books. PDFs from Humble Bundles, O&amp;rsquo;Reilly downloads, books I&amp;rsquo;ve bought and never finished, reference material I collected &amp;ldquo;just in case.&amp;rdquo; Like most people, they lived in a folder I rarely opened.&lt;/p>
&lt;p>The reason is friction. When I needed to look something up — say, how SQL injection payloads work, or the steps for privilege escalation on Linux — I&amp;rsquo;d have to remember which book covered it, open it, and search inside. Or just Google it and hope Stack Overflow had something decent.&lt;/p></description><content>&lt;h2 id="the-problem-with-security-books">The Problem With Security Books&lt;/h2>
&lt;p>I have a lot of cybersecurity books. PDFs from Humble Bundles, O&amp;rsquo;Reilly downloads, books I&amp;rsquo;ve bought and never finished, reference material I collected &amp;ldquo;just in case.&amp;rdquo; Like most people, they lived in a folder I rarely opened.&lt;/p>
&lt;p>The reason is friction. When I needed to look something up — say, how SQL injection payloads work, or the steps for privilege escalation on Linux — I&amp;rsquo;d have to remember which book covered it, open it, and search inside. Or just Google it and hope Stack Overflow had something decent.&lt;/p>
&lt;p>That&amp;rsquo;s not a knowledge base. That&amp;rsquo;s a graveyard.&lt;/p>
&lt;p>So I built something better: a local semantic search engine over all of them, powered by PostgreSQL, pgvector, and OpenAI embeddings. Now I ask questions in plain English and get back the exact passages — with the book and chapter — that answer them. The whole thing runs locally on my machine.&lt;/p>
&lt;p>Here&amp;rsquo;s how I built it, and why it&amp;rsquo;s become one of the most useful tools in my PAI (Personal AI Infrastructure) stack.&lt;/p>
&lt;hr>
&lt;h2 id="what-semantic-search-actually-means">What Semantic Search Actually Means&lt;/h2>
&lt;p>Traditional search is keyword matching. You type &amp;ldquo;SQL injection&amp;rdquo; and it finds documents containing those exact words.&lt;/p>
&lt;p>Semantic search is different. It converts your query and your documents into vectors — lists of numbers that represent &lt;em>meaning&lt;/em> in high-dimensional space. Similar concepts cluster together regardless of exact wording. Ask &amp;ldquo;how to bypass database input validation&amp;rdquo; and you&amp;rsquo;ll surface the same SQL injection content, even though you never typed &amp;ldquo;SQL injection.&amp;rdquo;&lt;/p>
&lt;p>This matters enormously for a security knowledge base. Security concepts have dozens of names. &amp;ldquo;Privilege escalation,&amp;rdquo; &amp;ldquo;privesc,&amp;rdquo; &amp;ldquo;root access,&amp;rdquo; &amp;ldquo;vertical privilege abuse&amp;rdquo; — these all mean the same thing. Semantic search finds all of them.&lt;/p>
&lt;hr>
&lt;h2 id="the-stack">The Stack&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>PostgreSQL 17&lt;/strong> — the database&lt;/li>
&lt;li>&lt;strong>pgvector 0.8.2&lt;/strong> — vector similarity search extension for Postgres&lt;/li>
&lt;li>&lt;strong>OpenAI text-embedding-3-small&lt;/strong> — converts text chunks to 1536-dimensional vectors&lt;/li>
&lt;li>&lt;strong>CyberSecKB.ts&lt;/strong> — a custom Bun/TypeScript CLI I built to tie it all together&lt;/li>
&lt;/ul>
&lt;p>Everything runs locally. The only external call is to OpenAI&amp;rsquo;s embedding API (which runs once at ingest time, not at query time).&lt;/p>
&lt;hr>
&lt;h2 id="the-pipeline-from-pdf-to-searchable-knowledge">The Pipeline: From PDF to Searchable Knowledge&lt;/h2>
&lt;h3 id="step-1-convert-pdfs-to-markdown">Step 1: Convert PDFs to Markdown&lt;/h3>
&lt;p>Raw PDFs are terrible for text processing. I convert everything to Markdown first using a &lt;code>pdf2md&lt;/code> Python tool:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>cd ~/projects/pdf-to-markdown
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>source venv/bin/activate
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Text-based PDFs (most books):&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>python pdf2md input/mybook.pdf
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Image-based or scanned PDFs (use OCR first):&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ocrmypdf --force-ocr input/mybook.pdf /tmp/ocr.pdf
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>python pdf2md /tmp/ocr.pdf output/mybook.md
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Move to library:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>mv output/mybook.md ~/projects/cybersecurity-library/books/
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="step-2-ingest-into-the-database">Step 2: Ingest into the Database&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>TOOL&lt;span style="color:#f92672">=&lt;/span>~/.claude/skills/PAI/USER/KNOWLEDGE/CYBERSECURITY/Tools/CyberSecKB.ts
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Single book with topics tagged:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bun $TOOL ingest &lt;span style="color:#ae81ff">\
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --file ~/projects/cybersecurity-library/books/mybook.md &lt;span style="color:#ae81ff">\
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --title &lt;span style="color:#e6db74">&amp;#34;My Book Title&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --topics web,network,linux
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Or load everything at once:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bun $TOOL ingest --batch ~/projects/cybersecurity-library/books/
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The ingest process:&lt;/p>
&lt;ol>
&lt;li>Reads the Markdown file&lt;/li>
&lt;li>Splits it into ~800-token chunks, preserving chapter headings&lt;/li>
&lt;li>Sends chunks to OpenAI&amp;rsquo;s embedding API in batches&lt;/li>
&lt;li>Stores chunks + their vector embeddings in PostgreSQL&lt;/li>
&lt;/ol>
&lt;h3 id="step-3-search">Step 3: Search&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Plain English query:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bun $TOOL search &lt;span style="color:#e6db74">&amp;#34;how do attackers bypass WAF rules for SQL injection&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Filter by topic:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bun $TOOL search &lt;span style="color:#e6db74">&amp;#34;privilege escalation&amp;#34;&lt;/span> --topics linux --limit &lt;span style="color:#ae81ff">5&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Check what&amp;#39;s in the KB:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bun $TOOL list
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bun $TOOL stats
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="what-it-looks-like-in-practice">What It Looks Like in Practice&lt;/h2>
&lt;p>Here&amp;rsquo;s a real query. I asked:&lt;/p>
&lt;pre tabindex="0">&lt;code>bun $TOOL search &amp;#34;SQL injection bypass techniques&amp;#34; --limit 3
&lt;/code>&lt;/pre>&lt;p>Result:&lt;/p>
&lt;pre tabindex="0">&lt;code>━━━ [63.3%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
The `;` metacharacter in a SQL statement is used similarly to how it&amp;#39;s used
in command injection to combine multiple queries on the same line...
━━━ [62.5%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
If user input is used without prior validation, and it is concatenated
directly into a SQL query, a user can inject different data...
━━━ [60.4%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
Input taken from cookies, input forms, and URL variables is used to build
SQL statements that are passed back to the database...
&lt;/code>&lt;/pre>&lt;p>Each result shows the similarity score, book title, chapter, and a preview. I can immediately tell which book to go deeper in.&lt;/p>
&lt;p>Another query — privilege escalation:&lt;/p>
&lt;pre tabindex="0">&lt;code>bun $TOOL search &amp;#34;privilege escalation linux&amp;#34; --limit 3
&lt;/code>&lt;/pre>&lt;pre tabindex="0">&lt;code>━━━ [66.1%] Cybersecurity Attack And Defense Strategies → Privilege Escalation
Most systems are built using the least privilege concept — users are
purposefully given the least privileges they need to perform their work...
━━━ [65.9%] Kali Linux Cookbook → Privilege Escalation
CVE-2015-1328: overlayfs vulnerability affecting Ubuntu where it does not
do proper checking of file creation in the upper filesystem area...
━━━ [65.8%] Cybersecurity Attack And Defense Strategies → Privilege Escalation
On Linux, vertical escalation allows attackers to have root privileges
that enable them to modify systems and programs...
&lt;/code>&lt;/pre>&lt;p>This is the power of the system: I asked about a concept, not a keyword, and got specific, sourced, actionable results from three different books.&lt;/p>
&lt;hr>
&lt;h2 id="the-current-state-of-the-kb">The Current State of the KB&lt;/h2>
&lt;p>After the initial batch ingest:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>50 books&lt;/strong> indexed&lt;/li>
&lt;li>&lt;strong>11,757 chunks&lt;/strong> stored and embedded&lt;/li>
&lt;li>Coverage spans: penetration testing, malware analysis, forensics, identity and access, cloud security, social engineering, cryptography, threat modeling, and more&lt;/li>
&lt;/ul>
&lt;p>Some of what&amp;rsquo;s in there:&lt;/p>
&lt;ul>
&lt;li>&lt;em>Practical Malware Analysis&lt;/em> (620 chunks)&lt;/li>
&lt;li>&lt;em>Cybersecurity Threats, Malware Trends and Strategies&lt;/em> (552 chunks)&lt;/li>
&lt;li>&lt;em>Cybersecurity Attack and Defense Strategies&lt;/em> (460 chunks)&lt;/li>
&lt;li>&lt;em>Security Chaos Engineering&lt;/em> (387 chunks)&lt;/li>
&lt;li>&lt;em>Hardware Hacking Handbook&lt;/em> (378 chunks)&lt;/li>
&lt;li>&lt;em>Modern Data Protection&lt;/em> (338 chunks)&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="why-this-fits-into-pai">Why This Fits Into PAI&lt;/h2>
&lt;p>This knowledge base is part of my PAI system — Personal AI Infrastructure. The idea behind PAI is to build infrastructure that &lt;em>amplifies&lt;/em> what I can do with AI, rather than using AI one prompt at a time.&lt;/p>
&lt;p>The Security KB is a perfect example. It&amp;rsquo;s not about asking ChatGPT &amp;ldquo;explain SQL injection.&amp;rdquo; It&amp;rsquo;s about having my own curated library, chunked, embedded, and ready to surface exactly the passage I need — from books I trust, with sources I can trace back.&lt;/p>
&lt;p>When I&amp;rsquo;m working through a security challenge or studying for a certification, I can query the KB directly. Luna (my PAI assistant) can also query it as part of a larger workflow — search the KB, pull context into the prompt, and answer questions grounded in my actual library rather than generic training data.&lt;/p>
&lt;hr>
&lt;h2 id="building-it-with-claude-code">Building It With Claude Code&lt;/h2>
&lt;p>The entire CyberSecKB tool was built using Claude Code through PAI. The process:&lt;/p>
&lt;ol>
&lt;li>Described what I wanted: ingest markdown books, chunk by section, embed with OpenAI, store in pgvector&lt;/li>
&lt;li>Claude Code scaffolded the TypeScript CLI&lt;/li>
&lt;li>We hit a few real-world issues along the way:
&lt;ul>
&lt;li>The OpenAI project key needed embedding model access enabled separately&lt;/li>
&lt;li>Batch size of 2048 hit the 300k token/request limit — tuned down to 200&lt;/li>
&lt;li>The 1M tokens/minute rate limit required adding a 15-second delay between batches&lt;/li>
&lt;li>A SQL type error in the search function when no topics filter was passed&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>Each issue was diagnosed and fixed in the same conversation. The tool went from concept to 50 books indexed in a single session.&lt;/p>
&lt;hr>
&lt;h2 id="whats-next">What&amp;rsquo;s Next&lt;/h2>
&lt;p>A few things I want to add:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Tag all books with proper topics&lt;/strong> — the batch ingest skipped topic assignment; I&amp;rsquo;ll tag each book so &lt;code>--topics web&lt;/code> or &lt;code>--topics linux&lt;/code> filters actually work&lt;/li>
&lt;li>&lt;strong>Tier 1 topic files&lt;/strong> — condensed 5-15KB reference files for the most-used topics (SQLi, XSS, privilege escalation, etc.) that load directly into context&lt;/li>
&lt;li>&lt;strong>AI Security KB integration&lt;/strong> — the AI Security research KB shares the same database; queries cross both domains automatically&lt;/li>
&lt;/ul>
&lt;p>The knowledge base is live. The friction is gone. Now the books actually get used.&lt;/p>
&lt;hr>
&lt;p>&lt;em>Built with PAI, Claude Code, PostgreSQL, pgvector, and OpenAI embeddings. All processing runs locally except the embedding API calls at ingest time.&lt;/em>&lt;/p></content></item></channel></rss>