Ai on

AI Writes Code Fast. Here's Why Security-First Thinking Matters More Than Ever.

Sun, 19 Apr 2026 00:00:00 +0000

Speed Is the Feature. Unexamined Speed Is the Liability.

One of the most impressive things about using AI to write code is how fast it moves. You describe an endpoint, a data model, a feature — and within seconds you have working code. Not a sketch. Not pseudocode. Actual, runnable implementation.

That speed is real, and it is genuinely useful. I use it every day.

But here is something I noticed the longer I worked with AI-generated code: it writes to the happy path. AI is extraordinarily good at making code that works when everything goes as expected. It is considerably less reliable at writing code that stays safe when things go wrong — when someone sends unexpected input, when a dependency is compromised, when a secret accidentally surfaces in a log, when a logged-in user tries to access someone else’s data.

This is not a criticism of AI models. It is a structural observation about how code generation works. AI learns from what code looks like, not from what happens to systems that run it. The attack surface is invisible at generation time.

Security-first thinking is the discipline that fills that gap. And building it into how you work with AI is one of the highest-leverage things you can do as a developer.

What AI Gets Wrong By Default

Before we get to principles, it helps to see the failure modes concretely. These are patterns I started noticing after reviewing AI-generated code more carefully.

Hardcoded secrets. Ask AI to write a function that connects to a database or calls an external API, and it will often produce something like this:

db = connect(host="prod-db.company.com", user="admin", password="Sup3rS3cr3t!")
client = OpenAI(api_key="sk-proj-abc123...")

The code works. It will also put your credentials in git history forever. A secret committed to a repository — even briefly, even to a private repo — must be treated as compromised. Git history is permanent. The only remediation is rotation.

Missing input validation. AI tends to write code that trusts request data. An endpoint that receives req.body.quantity will often use it directly, skipping the check for whether it is a positive integer, whether it is within expected bounds, whether it contains what the code assumes it contains.

Fetch-then-check authorization. This one is subtle. AI commonly writes authorization logic like this:

const order = await db.orders.findById(req.params.id);
if (order.userId !== req.user.id) return res.status(403).send();
return res.json(order);

That looks right. But it fetches the record first and checks ownership second. A slightly better pattern is to include the user ID in the query itself, so that non-owned records simply return null — and you return a 404, not a 403. Returning 403 confirms to an attacker that the resource exists. It is a small thing that compounds at scale.

Weak cryptography by familiarity. Ask AI to hash a password and it might reach for SHA-256. SHA-256 is a solid hash function — for data integrity. It is the wrong tool for passwords because it is fast. Fast means a GPU can test billions of candidate passwords per second against a leaked hash. bcrypt and Argon2 are deliberately slow. That slowness is the security property.

No rate limiting on authentication. AI generates login endpoints without the defensive scaffolding that should always accompany them: rate limiting per IP, account lockout after repeated failures, uniform response times to prevent user enumeration.

None of these are exotic edge cases. They are the everyday failure modes that show up in security audits and breach post-mortems, again and again.

The Mindset Shift: From Checklist to First Principles

Here is where I want to push back against the usual framing. Security is often presented as a checklist — OWASP Top 10, compliance requirements, “use HTTPS.” Checklists have their place, but they do not produce secure code. They produce code that passes the checklist.

Security-first thinking is different. It is a way of looking at code that asks one question at every boundary: who or what is being trusted here, and has that trust been earned?

Every class of vulnerability — SQL injection, XSS, CSRF, IDOR, broken authentication, insecure deserialization — reduces to a single failure: untrusted data crossing a trust boundary with the authority of trusted data.

SQL injection happens when user input is trusted to be SQL-safe before it reaches the database. XSS happens when user-generated content is trusted to be display-safe before it enters the DOM. IDOR happens when a request parameter is trusted to represent a resource the current user is allowed to access.

Once you see security through this lens — trust boundaries and their enforcement — you stop asking “did I remember the checklist item?” and start asking “where are the trust boundaries in this code, and what enforces them?” That question applies to every system, every language, every framework. It does not go stale.

This is what I mean by security-first thinking as a mindset rather than a procedure. It is not about knowing rules. It is about developing the habit of seeing trust assumptions in code.

The Eight Principles

When I worked through this with Luna recently — building a security knowledge base from 50 cybersecurity books — we distilled the trust boundary framework down to eight irreducible properties that AI-generated code must satisfy.

I want to walk through each one because they are not independent rules. They are facets of the same underlying principle.

1. Input is untrusted by default. Every value arriving from outside your process — HTTP body, query parameters, headers, cookies, file uploads — is hostile until validated. Client-side validation is UX. Server-side validation is security. They cannot substitute for each other.

2. Output is encoded for its context. A string displayed in HTML needs HTML encoding. A value interpolated into SQL needs parameterization. A value passed to a shell command needs to go through an argument array, not string concatenation. The encoding is not about the string itself — it is about the context it will be interpreted in.

3. Secrets never live in code. Passwords, API keys, tokens, certificates — these belong in environment variables or a secrets manager, never in source files. This is absolute. Git history is permanent.

4. Authentication is verified, not assumed. Every protected route, every protected function, checks identity on that request. Being logged in at some prior point does not carry forward.

5. Authorization is checked at the data layer. Being authenticated is not the same as being authorized to access a specific resource. Ownership checks belong in the query itself, not as a post-fetch condition.

6. Least privilege is the default. Database connections, service accounts, IAM roles, file operations — each gets only the access its specific function requires. Nothing more. A compromised least-privilege component fails safely. A compromised over-privileged one does not.

7. Cryptography is never homegrown. Cipher and hash choices go to battle-tested libraries: bcrypt or Argon2 for passwords, AES-GCM for encryption, HMAC-SHA256 for message authentication. The failure modes of hand-rolled crypto are subtle and catastrophic.

8. Dependencies are trust extensions. Every package you add is code you are trusting. Its transitive dependencies are code you are trusting. Supply chain attacks are real. Lock files, audits, and version pinning are not paranoia — they are basic hygiene.

These eight properties are not a checklist to run through at the end. They are filters to apply while generating code. The question during every code-writing session is: which of these are relevant to what I am building right now?

Baking It Into Your AI Workflow

Knowing principles is not the same as applying them consistently. The reason I started building infrastructure around this is that consistency requires more than intention — it requires systems.

Here is what I built, and it came directly from the work of turning 50 cybersecurity books into a searchable knowledge base.

Tier 1 topic files. Each of the eight principles above now lives in a concise reference file — 8 to 12KB each — distilled from the source material. input-validation.md, secrets-management.md, authentication.md, authorization.md, cryptography.md, and the others. Each file is dense but scannable: concrete patterns, code examples, anti-patterns, quick checklists.

Engineer PREFERENCES. I wired those topic files into the Engineer agent through a PREFERENCES file. The trigger table maps code categories to topic files: anything involving SQL gets xss-injection-prevention.md loaded. Anything involving authentication loads authentication.md. Anything involving secrets loads secrets-management.md. The agent loads the relevant files silently before writing code and runs the associated checklist before declaring work done.

The result is that security context is present in every relevant code-generation session without me having to ask for it. The principles are not something I have to remember to invoke. They are part of the workflow infrastructure.

This is what I keep coming back to with PAI: the value is not in any single AI interaction. It is in building infrastructure that makes the right thing the default thing. Security-first code used to require conscious effort and specialized knowledge in the moment. Now it is loaded context — always present, reliably applied.

Security and AI Are Not in Tension

I want to end on this because it is easy to come away from a security conversation feeling like the message is “AI is dangerous, slow down, be careful.” That is not the message.

AI writing code fast is a genuine capability improvement. Being able to go from specification to working implementation in minutes changes what is possible for a small team or a solo developer. That is real.

Security-first thinking is not a constraint on that capability. It is the thing that makes the output of that capability trustworthy. Code that moves fast and ships vulnerable systems is not actually faster in any meaningful sense — it is accumulating debt that will be paid with interest.

The combination is the point. AI that knows how to think about trust boundaries, that loads security context automatically, that applies principles rather than just patterns — that is a force multiplier, not a liability. You get the speed and you get the rigor.

Building toward that combination is one of the more interesting engineering problems I have worked on. The knowledge base, the topic files, the wiring into the workflow — it is all aimed at the same thing: making security-first thinking something that happens automatically, not something that requires a specialist in the room.

The specialist knowledge exists. We built it into 77KB of reference material drawn from 50 books. Now it is always in the room.

Part of the PAI series on building infrastructure that makes AI more useful, more reliable, and more trustworthy. The security knowledge base and topic files referenced in this post were built using Claude Code, PostgreSQL, pgvector, and source material from O’Reilly’s security catalog.

I Turned 50 Cybersecurity Books Into a Searchable Brain

Sat, 21 Mar 2026 00:00:00 +0000

The Problem With Security Books

I have a lot of cybersecurity books. PDFs from Humble Bundles, O’Reilly downloads, books I’ve bought and never finished, reference material I collected “just in case.” Like most people, they lived in a folder I rarely opened.

The reason is friction. When I needed to look something up — say, how SQL injection payloads work, or the steps for privilege escalation on Linux — I’d have to remember which book covered it, open it, and search inside. Or just Google it and hope Stack Overflow had something decent.

That’s not a knowledge base. That’s a graveyard.

So I built something better: a local semantic search engine over all of them, powered by PostgreSQL, pgvector, and OpenAI embeddings. Now I ask questions in plain English and get back the exact passages — with the book and chapter — that answer them. The whole thing runs locally on my machine.

Here’s how I built it, and why it’s become one of the most useful tools in my PAI (Personal AI Infrastructure) stack.

What Semantic Search Actually Means

Traditional search is keyword matching. You type “SQL injection” and it finds documents containing those exact words.

Semantic search is different. It converts your query and your documents into vectors — lists of numbers that represent meaning in high-dimensional space. Similar concepts cluster together regardless of exact wording. Ask “how to bypass database input validation” and you’ll surface the same SQL injection content, even though you never typed “SQL injection.”

This matters enormously for a security knowledge base. Security concepts have dozens of names. “Privilege escalation,” “privesc,” “root access,” “vertical privilege abuse” — these all mean the same thing. Semantic search finds all of them.

The Stack

PostgreSQL 17 — the database
pgvector 0.8.2 — vector similarity search extension for Postgres
OpenAI text-embedding-3-small — converts text chunks to 1536-dimensional vectors
CyberSecKB.ts — a custom Bun/TypeScript CLI I built to tie it all together

Everything runs locally. The only external call is to OpenAI’s embedding API (which runs once at ingest time, not at query time).

The Pipeline: From PDF to Searchable Knowledge

Step 1: Convert PDFs to Markdown

Raw PDFs are terrible for text processing. I convert everything to Markdown first using a pdf2md Python tool:

cd ~/projects/pdf-to-markdown
source venv/bin/activate

# Text-based PDFs (most books):
python pdf2md input/mybook.pdf

# Image-based or scanned PDFs (use OCR first):
ocrmypdf --force-ocr input/mybook.pdf /tmp/ocr.pdf
python pdf2md /tmp/ocr.pdf output/mybook.md

# Move to library:
mv output/mybook.md ~/projects/cybersecurity-library/books/

Step 2: Ingest into the Database

TOOL=~/.claude/skills/PAI/USER/KNOWLEDGE/CYBERSECURITY/Tools/CyberSecKB.ts

# Single book with topics tagged:
bun $TOOL ingest \
 --file ~/projects/cybersecurity-library/books/mybook.md \
 --title "My Book Title" \
 --topics web,network,linux

# Or load everything at once:
bun $TOOL ingest --batch ~/projects/cybersecurity-library/books/

The ingest process:

Reads the Markdown file
Splits it into ~800-token chunks, preserving chapter headings
Sends chunks to OpenAI’s embedding API in batches
Stores chunks + their vector embeddings in PostgreSQL

Step 3: Search

# Plain English query:
bun $TOOL search "how do attackers bypass WAF rules for SQL injection"

# Filter by topic:
bun $TOOL search "privilege escalation" --topics linux --limit 5

# Check what's in the KB:
bun $TOOL list
bun $TOOL stats

What It Looks Like in Practice

Here’s a real query. I asked:

bun $TOOL search "SQL injection bypass techniques" --limit 3

Result:

━━━ [63.3%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
The `;` metacharacter in a SQL statement is used similarly to how it's used
in command injection to combine multiple queries on the same line...
━━━ [62.5%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
If user input is used without prior validation, and it is concatenated
directly into a SQL query, a user can inject different data...
━━━ [60.4%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
Input taken from cookies, input forms, and URL variables is used to build
SQL statements that are passed back to the database...

Each result shows the similarity score, book title, chapter, and a preview. I can immediately tell which book to go deeper in.

Another query — privilege escalation:

bun $TOOL search "privilege escalation linux" --limit 3

━━━ [66.1%] Cybersecurity Attack And Defense Strategies → Privilege Escalation
Most systems are built using the least privilege concept — users are
purposefully given the least privileges they need to perform their work...
━━━ [65.9%] Kali Linux Cookbook → Privilege Escalation
CVE-2015-1328: overlayfs vulnerability affecting Ubuntu where it does not
do proper checking of file creation in the upper filesystem area...
━━━ [65.8%] Cybersecurity Attack And Defense Strategies → Privilege Escalation
On Linux, vertical escalation allows attackers to have root privileges
that enable them to modify systems and programs...

This is the power of the system: I asked about a concept, not a keyword, and got specific, sourced, actionable results from three different books.

The Current State of the KB

After the initial batch ingest:

50 books indexed
11,757 chunks stored and embedded
Coverage spans: penetration testing, malware analysis, forensics, identity and access, cloud security, social engineering, cryptography, threat modeling, and more

Some of what’s in there:

Practical Malware Analysis (620 chunks)
Cybersecurity Threats, Malware Trends and Strategies (552 chunks)
Cybersecurity Attack and Defense Strategies (460 chunks)
Security Chaos Engineering (387 chunks)
Hardware Hacking Handbook (378 chunks)
Modern Data Protection (338 chunks)

Why This Fits Into PAI

This knowledge base is part of my PAI system — Personal AI Infrastructure. The idea behind PAI is to build infrastructure that amplifies what I can do with AI, rather than using AI one prompt at a time.

The Security KB is a perfect example. It’s not about asking ChatGPT “explain SQL injection.” It’s about having my own curated library, chunked, embedded, and ready to surface exactly the passage I need — from books I trust, with sources I can trace back.

When I’m working through a security challenge or studying for a certification, I can query the KB directly. Luna (my PAI assistant) can also query it as part of a larger workflow — search the KB, pull context into the prompt, and answer questions grounded in my actual library rather than generic training data.

Building It With Claude Code

The entire CyberSecKB tool was built using Claude Code through PAI. The process:

Described what I wanted: ingest markdown books, chunk by section, embed with OpenAI, store in pgvector
Claude Code scaffolded the TypeScript CLI
We hit a few real-world issues along the way:
- The OpenAI project key needed embedding model access enabled separately
- Batch size of 2048 hit the 300k token/request limit — tuned down to 200
- The 1M tokens/minute rate limit required adding a 15-second delay between batches
- A SQL type error in the search function when no topics filter was passed

Each issue was diagnosed and fixed in the same conversation. The tool went from concept to 50 books indexed in a single session.

What’s Next

A few things I want to add:

Tag all books with proper topics — the batch ingest skipped topic assignment; I’ll tag each book so --topics web or --topics linux filters actually work
Tier 1 topic files — condensed 5-15KB reference files for the most-used topics (SQLi, XSS, privilege escalation, etc.) that load directly into context
AI Security KB integration — the AI Security research KB shares the same database; queries cross both domains automatically

The knowledge base is live. The friction is gone. Now the books actually get used.

Built with PAI, Claude Code, PostgreSQL, pgvector, and OpenAI embeddings. All processing runs locally except the embedding API calls at ingest time.

Building an AI Conference Directory That Populates Itself

Sat, 14 Mar 2026 00:00:00 +0000

The Problem: AI Conferences Are Everywhere and Nowhere

If you’ve ever tried to find a comprehensive list of upcoming AI conferences, you know the pain. There’s no single source. AAAI has their page. NeurIPS has theirs. ICML posts deadlines on OpenReview. Half the emerging summits only exist on LinkedIn event pages or buried in Reddit threads.

I wanted a simple, searchable directory of AI conferences — one site where I could see what’s coming up, filter by topic, and get the key details. But I didn’t want to manually curate it. I’ve seen too many “awesome lists” on GitHub that are lovingly maintained for three months and then abandoned.

What I wanted was a system that populates itself.

So I built one. And with Claude Code running through my PAI system, the whole pipeline — from search to database to website — came together over a few focused sessions.

Here’s the full story.

The Architecture: Three Layers, Zero Manual Data Entry

The final system has three layers, each handling a distinct responsibility:

SearXNG (search engine)
→ conference_tracker.py (discovery)
→ Airtable (database)
→ fetch-events.mjs (build-time fetch)
→ React + Vite site on Netlify

Each layer is independently useful, loosely coupled, and replaceable. Let’s walk through them.

Layer 1: The Tracker — Finding Conferences Automatically

The foundation is a Python script called conference_tracker.py. Its job is simple: search the web for AI conferences and store what it finds.

Search: SearXNG Instead of Google

Rather than hitting the Google API (with its quotas and billing), I use SearXNG — an open-source, self-hosted meta-search engine. It aggregates results from Google, Bing, DuckDuckGo, and others without API keys or rate limits.

The tracker runs a curated list of search queries defined in config.yaml:

search_queries:
 - "AI conference 2026"
 - "artificial intelligence conference 2026"
 - "machine learning conference 2026"
 - "NeurIPS 2026"
 - "ICML 2026"
 - "AAAI 2026"
 - "AI summit 2026"
 - "deep learning conference 2026"
 - "computer vision conference 2026 CVPR"
 - "natural language processing conference 2026"

Each query returns up to 10 results. The tracker extracts the title, URL, and snippet from each result, deduplicates against what’s already in the database, and stores new finds.

Storage: Airtable as the Source of Truth

Why Airtable? Because it’s a real database with an API, but it also has a spreadsheet-like UI for manual review. When you’re building a pipeline that discovers data automatically, you want a way to eyeball the results and clean up noise — and Airtable is perfect for that.

The tracker writes five fields per record: title, websiteUrl, description, Source Query, and Date Found. That’s it. Just the raw discovery data. The structured details come later.

The deduplication is URL-based — normalized and lowercased. If we’ve already stored neurips.cc/2026, we don’t store it again even if it appears in a different search query.

def extract_conference_info(result, source_query):
 return {
 "title": result["title"][:200],
 "websiteUrl": result["url"],
 "description": result["snippet"][:1000],
 "Source Query": source_query,
 "Date Found": datetime.now(timezone.utc).strftime("%Y-%m-%d"),
 }

After one run, we had 87 unique conference records. The real stuff — NeurIPS, ICML, CVPR, AAAI — alongside smaller but interesting events like the Quantum AI and NLP Conference, Deep Learning Indaba, and the Wharton Human-AI Research summit.

Layer 2: The Website — React + Vite on Netlify

The directory itself is a React app built with Vite and deployed on Netlify. It’s a single-page app with search, tag filtering, and individual event pages.

The key architectural decision: data is fetched at build time, not runtime. A prebuild script (fetch-events.mjs) pulls conference data from the database and writes it to a data.ts file that Vite bundles into the site. This means:

No API keys exposed in the browser
No CORS issues
Instant page loads (data is already in the bundle)
The site works even if Airtable is temporarily down

The prebuild hook in package.json makes this automatic:

{
 "scripts": {
 "fetch-events": "bun scripts/fetch-events.mjs",
 "prebuild": "bun scripts/fetch-events.mjs",
 "build": "vite build"
 }
}

Every time Netlify builds the site, it automatically fetches the latest data from Airtable. Fresh data on every deploy.

The Middleman Problem: Cutting Google Sheets

Here’s where the story gets interesting.

The original pipeline had an extra step: Airtable → Google Sheets → website. The fetch-events.mjs script was pulling from a published Google Sheet CSV. Why? Because when I first prototyped the site, I started with a spreadsheet. It was quick and easy.

But once the conference tracker was writing directly to Airtable, Google Sheets became a middleman with no purpose. Data had to be synced from Airtable to Sheets (manually or via Zapier), and that sync was another thing that could break.

The fix was straightforward: teach fetch-events.mjs to talk directly to the Airtable API.

Airtable’s REST API

The Airtable API is clean. A single GET request returns records as JSON:

const url = new URL(`https://api.airtable.com/v0/${baseId}/${tableId}`);
const resp = await fetch(url.toString(), {
 headers: { Authorization: `Bearer ${pat}` },
});
const data = await resp.json();
// data.records = [{ id, fields: { title, date, ... } }]

The one gotcha: Airtable paginates at 100 records. You need to follow the offset token:

async function fetchFromAirtable(pat, baseId, tableId) {
 const allRecords = [];
 let offset = null;

 do {
 const url = new URL(`https://api.airtable.com/v0/${baseId}/${tableId}`);
 if (offset) url.searchParams.set('offset', offset);

 const resp = await fetch(url.toString(), {
 headers: { Authorization: `Bearer ${pat}` },
 });
 const data = await resp.json();
 allRecords.push(...data.records);
 offset = data.offset || null;
 } while (offset);

 return allRecords;
}

Graceful Fallback

I kept the Google Sheets path as a fallback. The main() function uses a priority chain:

Airtable — if AIRTABLE_PAT, AIRTABLE_BASE_ID, AIRTABLE_TABLE_ID are set
Google Sheets — if GOOGLE_SHEET_CSV_URL is set
Fallback events — hardcoded sample data so the build never fails

This means you can’t break the site by misconfiguring a data source. The build always succeeds.

Layer 3: The Enrichment — AI-Powered Data Extraction

This is where things got really interesting.

After cutting Google Sheets, I had 87 conference records in Airtable. But they only had three useful fields: title, description, and URL. No dates. No locations. No tags. The site worked, but every event card was sparse — no way to filter by date or location, no tags to browse by topic.

Filling in 87 records by hand? No thanks.

The Idea: Visit Each URL and Ask AI to Extract the Data

The approach: for each conference record, fetch its web page, extract the text content, and use AI inference to pull out structured fields like date, location, organizer, and tags.

I built an enrichment script — enrich_conferences.py — that sits alongside the tracker in the same project.

Step 1: Fetch and Clean the Page

Each conference URL gets fetched with requests, then cleaned with BeautifulSoup. Navigation, footers, scripts, and styling get stripped, leaving just the text content:

def fetch_page_text(url, timeout=15):
 resp = requests.get(url, headers=headers, timeout=timeout)
 soup = BeautifulSoup(resp.text, "html.parser")

 for tag in soup(["script", "style", "nav", "footer", "header", "aside"]):
 tag.decompose()

 text = soup.get_text(separator="\n", strip=True)
 lines = [line.strip() for line in text.splitlines() if line.strip()]
 return "\n".join(lines)

Step 2: AI Extraction via PAI Inference

The cleaned text gets sent to Claude (via PAI’s Inference tool) with a structured extraction prompt. The prompt is specific about what to extract and what format to use:

Given text from a conference web page, extract these fields as JSON:
{
"date": "human-readable date like 'May 5-6, 2026'",
"endDate": "ISO end date like '2026-05-06'",
"location": "City, State/Country",
"venue": "venue name",
"price": "ticket price or 'Free'",
"organizer": "organizing body",
"tags": "comma-separated topic tags (max 4)"
}

One critical addition: if the page is a list of conferences (like “Top 10 AI Conferences of 2026”), the AI returns {"is_list_page": true} and the script skips it. This was essential — about 15% of our URLs were aggregator pages, not individual conference pages.

Step 3: Write Back to Airtable

Non-empty extracted fields get PATCHed back to Airtable. The script only writes fields that actually exist in the table schema — a lesson learned the hard way when venue and imageUrl threw 422 errors because those columns hadn’t been created yet.

def build_patch_fields(extracted, allowed_fields):
 if extracted.get("is_list_page"):
 return None
 patch = {}
 for key in ["date", "endDate", "location", "venue", "price", "organizer", "tags"]:
 if key not in allowed_fields:
 continue
 val = extracted.get(key, "")
 if isinstance(val, str) and val.strip():
 patch[key] = val.strip()
 return patch if patch else None

The Results

Running the enrichment script across all 87 records:

Outcome	Count
Records enriched	48
List/aggregator pages (correctly skipped)	12
No extractable fields (social media, OpenReview, etc.)	11
Errors (timeouts, HTTP 403s)	16

After enrichment:

Field	Records populated
Date	42
Location	41
Tags	47
Organizer	27
Price	4

From zero structured data to a directory where most events have dates, locations, and topic tags — without opening a single conference website manually.

Some highlights from the extraction:

NeurIPS 2026: December 6-12, Sydney, Australia — Deep Learning, Research, Algorithms, LLMs
CVPR 2026: June 3-7, Denver, CO — Computer Vision, Deep Learning, Research
ICML 2026: July 6-11, Seoul, South Korea — LLMs, Computer Vision, NLP, Robotics
AI Council 2026: May 12-14, San Francisco, CA — Generative AI, ML Ops, AI Safety
MIDL 2026: July 8-10, Taipei — Deep Learning, Healthcare AI, Computer Vision

The Pipeline Today

Here’s what the full system looks like now:

SearXNG (self-hosted search)
→ conference_tracker.py (Python — discovers conferences)
→ Airtable (source of truth — 87 records)
→ enrich_conferences.py (Python — AI-powered field extraction)
→ Airtable (now with dates, locations, tags)
→ fetch-events.mjs (Node — build-time data fetch)
→ data.ts (bundled into the site)
→ React + Vite app on Netlify

The tracker discovers. The enricher structures. The fetcher delivers. The site displays. Each piece runs independently and can be re-run at any time.

The enrichment script is idempotent — it only processes records where the date field is empty, so running it again only touches new or previously-failed records.

What I’d Do Differently (And What’s Next)

The Timeout Problem

About 16 records hit the 25-second inference timeout. The fast tier (Haiku) is quick but occasionally chokes on pages with dense, complex content. A retry mechanism using the standard tier (Sonnet) for failed records would catch most of these.

Missing Table Columns

The venue and imageUrl fields don’t exist in the Airtable table yet. The enrichment script extracts venue names beautifully (The Venetian for Ai4, COEX Convention Center for ICML, Dongguk University for AAAI Summer), but the data gets dropped because the columns aren’t there. A quick table schema update in the Airtable UI fixes this.

Scheduled Runs

Right now, both the tracker and enricher are manual. The natural next step is scheduling — run the tracker daily to discover new conferences, the enricher on new records, and trigger a Netlify deploy afterward. The Netlify build hook is already configured; it just needs a cron job or GitHub Action to call it.

Data Quality

Some records are noise — Reddit discussion threads, Amazon Science blog posts, Twitter/X profiles. A quality filter (either rule-based on URL patterns or AI-powered) would clean the dataset before enrichment runs.

Lessons Learned

1. Eliminate Middlemen Early

Google Sheets added zero value once Airtable was in the picture. But it lingered because it was the “original” approach. Every extra hop in a pipeline is a thing that can break, a thing that needs syncing, and a thing that slows you down. Cut it.

2. Build-Time Data Fetching Is Underrated

Pulling data at build time instead of runtime means no API keys in the browser, no loading spinners, and no CORS headaches. For data that changes daily (not per-second), this is the right architecture.

3. AI Extraction Beats Manual Curation

Using AI to extract structured data from unstructured web pages isn’t perfect — we got 48 out of 87 records enriched, not 87 out of 87. But it took 20 minutes of runtime versus what would have been hours of manual work. And the script is re-runnable. Improvement is incremental.

4. Detect Your Data’s Shape Before Writing

The Airtable 422 errors on venue were entirely preventable. The enrichment script now probes the table schema at startup and only writes to fields that exist. Defensive coding at system boundaries saves debugging time.

5. List Page Detection Is Essential for Web Scraping Pipelines

When you’re scraping URLs from search results, a significant percentage will be aggregator pages (“Top 10 Best AI Conferences”) rather than individual event pages. If you don’t detect and skip these, you’ll corrupt your dataset with merged data from multiple events. The is_list_page flag in the AI extraction prompt was one of the highest-value additions to the whole pipeline.

The Bigger Picture

This project is a miniature version of a pattern I keep coming back to: systems that compound.

The tracker runs once and discovers 87 conferences. The enricher runs once and structures 48 of them. The next time the tracker runs, it discovers only new conferences (deduplication handles the rest). The next time the enricher runs, it only processes records it hasn’t touched yet.

Every run makes the dataset better without redoing previous work. That’s the whole point of building infrastructure instead of doing things manually — you invest upfront so the system improves over time with minimal additional effort.

Working with Claude through PAI made each layer come together faster than I expected. The tracker, the Airtable integration, the Google Sheets elimination, the enrichment script — each was a focused session where the AI handled the implementation details while I focused on architecture decisions.

That’s the augmented part of Augmented Resilience. Not replacing the thinking — amplifying it.

When Your PDF Workflow Breaks - Building a Markdown Converter with Claude Code

Wed, 18 Feb 2026 00:00:00 +0000

The Problem: PDFs Are Knowledge Prisons

You know that feeling when you download a brilliant research paper, only to realize you can’t easily feed it into your AI workflow? Or when you want to add documentation to your knowledge base, but it’s locked in a format that doesn’t play well with version control or LLM tools?

Yeah, I was there last week.

I had just downloaded a fascinating 1.3MB research paper on Generative Engine Optimization and wanted to process it with my AI tools. But PDFs are terrible for this. They’re designed for printing, not for processing. What I needed was Markdown—clean, portable, AI-friendly Markdown.

So I built a converter. And with Claude Code as my copilot through the PAI (Personal AI Infrastructure) system, the whole thing took less than 30 minutes.

Here’s how it went down.

Why Markdown is Better Than PDF for LLMs

Before diving into the build, let’s answer the obvious question: why bother converting? Can’t LLMs just read PDFs directly?

Technically, yes. But the results are significantly worse, and the reasons are fundamental to how PDFs work.

PDFs Are Layout-First, Not Structure-First

PDFs were designed to describe where things appear on a page, not what they mean. As Steven Howard explains in Why PDFs Fail Under LLM Parsing :

“Table cells with wrapped text insert hard line breaks that fragment token continuity and break logical row recognition. Headers and footers simply add noise to the context when used with LLMs. Sentences are split with arbitrary CR/LFs making it very difficult to find paragraph boundaries.”

This architectural mismatch — a format designed for printing being fed into a system designed for understanding — causes cascading problems downstream.

The Token Efficiency Problem

Every token your LLM processes costs money and consumes context window space. PDF extraction wastes both.

According to analysis from MarkdownConverters , Markdown saves up to 70% more tokens compared to extracted PDF text for the same content. The culprit: PDF extraction introduces formatting artifacts, metadata noise, headers/footers, and encoding remnants that all consume tokens without adding semantic value.

To put that in practical terms: a PDF that would use 10,000 tokens might only need 3,000 tokens when properly converted to Markdown. At scale, this compounds dramatically.

The RAG Performance Problem

If you’re building Retrieval Augmented Generation (RAG) systems — using documents as a knowledge base for AI — document format directly impacts answer quality.

The research here is compelling:

Academic validation: A 2024 paper on arXiv (Revolutionizing RAG with Enhanced PDF Structure Recognition ) found that “the low accuracy of PDF parsing significantly impacts the effectiveness of professional knowledge-based QA.”
Industry validation: NVIDIA’s technical blog documents how their NeMo Retriever pipeline converts extracted content to Markdown specifically because it “preserves row/column relationships in an LLM-native format, significantly reducing numeric hallucination” — and reduces incorrect answers by 50%. (NVIDIA: Approaches to PDF Data Extraction for Information Retrieval )
Chunking quality: Analysis from Towards Data Science shows that Markdown’s heading structure (#, ##, ###) produces semantically meaningful chunks, while PDF-based chunking relies on arbitrary page breaks and heuristics.
Retrieval failure rates: Unstructured.io’s research on contextual chunking — tested across 5,563 question-answer pairs — showed an 84% reduction in retrieval failure rates when using structure-aware chunking (the kind Markdown enables natively).
Real-world outcomes: The 2025 Semrush AI Index, cited by Webex Developers Blog , found that 72% of top AI-indexed articles used Markdown or Markdown-like structures, achieving 34% higher retrieval accuracy across ChatGPT, Perplexity, and Gemini.

The Bottom Line

Metric	Impact
Token reduction	Up to 70% fewer tokens vs PDF extraction
Incorrect answers in RAG	50% reduction (NVIDIA NeMo)
Retrieval failure rates	84% reduction (Unstructured.io)
Retrieval accuracy	34% higher (Semrush AI Index 2025)

Markdown isn’t just more convenient — it’s meaningfully better for AI. Converting your document libraries is one of the highest-ROI steps you can take before building any LLM-powered workflow.

The First Failure: When Bleeding-Edge Python Bites Back

I’m running Python 3.14.2—the latest release, barely a few weeks old. Modern, shiny, cutting-edge. Perfect, right?

Not quite.

My first instinct was to use marker-pdf, a high-performance converter optimized for scientific papers and books. It looked perfect on paper (pun intended). But when I tried to install it:

Building wheel for Pillow (pyproject.toml): finished with status 'error'

Ugh.

Turns out, marker-pdf depends on Pillow (the Python imaging library), and Pillow hasn’t built binary wheels for Python 3.14 yet. I could have downgraded Python. I could have fought with source compilation. But why?

This is where working with Claude Code really shines. Instead of going down a rabbit hole trying to force marker-pdf to work, Claude suggested pivoting to PyMuPDF4LLM—a mature, actively maintained library specifically designed for AI/LLM workflows.

And it just worked.

The Solution: PyMuPDF4LLM

PyMuPDF4LLM turned out to be exactly what I needed:

Works flawlessly with Python 3.14 (no compilation errors)
Fast and accurate conversion
Built specifically for feeding documents into LLMs
Clean, simple API
Actively maintained by the PyMuPDF team

The installation was literally:

pip install pymupdf4llm

Five seconds later, I was ready to go.

Building the Tool: First Principles Thinking

As someone new to the CLI world, I’ve been learning to think through project structure from first principles. Where should this live? How should it be organized?

With Claude’s guidance, I chose /Users/dsa/projects/pdf-to-markdown/ for a few key reasons:

Separation of Concerns: Tool projects should be separate from my main workspace
Discoverability: Clear, descriptive naming means I’ll find it again in 6 months
Reusability: This structure works both as a CLI tool AND as a library I could import later

The project structure ended up simple but complete:

pdf-to-markdown/
├── README.md # Documentation
├── venv/ # Isolated Python environment
├── input/ # Test PDFs
├── output/ # Generated markdown
├── pdf2md # CLI wrapper script
└── requirements.txt # Dependencies

The Code: A Simple but Powerful CLI

I wanted a tool I could actually use—something with a clean command-line interface that handles the common cases elegantly. Working with Claude through PAI, we created a Python script that does exactly that:

#!/usr/bin/env python3
"""
PDF to Markdown Converter
A simple CLI tool to convert PDF files to Markdown using PyMuPDF4LLM
"""

import sys
import os
from pathlib import Path
import pymupdf4llm
import pymupdf
from tqdm import tqdm

def convert_pdf_to_markdown(pdf_path: str, output_path: str = None) -> str:
 """Convert a PDF file to Markdown format."""

 if not os.path.exists(pdf_path):
 raise FileNotFoundError(f"PDF file not found: {pdf_path}")

 # Get page count for progress bar
 doc = pymupdf.open(pdf_path)
 page_count = doc.page_count
 doc.close()

 print(f"Converting: {pdf_path}")
 with tqdm(total=page_count, unit="page", desc="Processing", colour="blue") as bar:
 md_text = pymupdf4llm.to_markdown(pdf_path, page_chunks=False)
 bar.n = page_count
 bar.refresh()

 if output_path is None:
 output_path = Path(pdf_path).with_suffix('.md')

 with open(output_path, 'w', encoding='utf-8') as f:
 f.write(md_text)

 print(f"✓ Done: {output_path} ({len(md_text):,} characters)")
 return str(output_path)

def batch_convert(input_dir: str, output_dir: str = None) -> None:
 """Convert all PDFs in a directory to Markdown."""
 input_path = Path(input_dir)
 if not input_path.is_dir():
 raise NotADirectoryError(f"Not a directory: {input_dir}")

 pdfs = sorted(input_path.glob("*.pdf"))
 if not pdfs:
 print(f"No PDF files found in: {input_dir}")
 sys.exit(0)

 if output_dir:
 output_dir = Path(output_dir)
 else:
 output_dir = input_path.parent / "output"
 output_dir.mkdir(parents=True, exist_ok=True)

 total = len(pdfs)
 succeeded = 0
 failed = 0

 print(f"\nBatch mode: {total} PDF(s) found in '{input_dir}'")
 print(f"Output folder: {output_dir}\n")

 for i, pdf_path in enumerate(pdfs, start=1):
 print(f"[{i}/{total}] {pdf_path.name}")
 output_path = output_dir / pdf_path.with_suffix('.md').name
 try:
 convert_pdf_to_markdown(str(pdf_path), str(output_path))
 succeeded += 1
 except Exception as e:
 print(f" ✗ Failed: {e}")
 failed += 1
 print()

 print("─" * 40)
 print(f"Batch complete: {succeeded} converted, {failed} failed")
 print(f"Output folder: {output_dir}")

def main():
 """Main CLI entry point"""
 args = sys.argv[1:]

 if not args:
 print("Usage:")
 print(" pdf2md <input.pdf> [output.md] # Convert a single PDF")
 print(" pdf2md --batch <folder/> # Convert all PDFs in a folder")
 print(" pdf2md --batch <folder/> --output <out_folder/> # Batch with custom output dir")
 print("\nExamples:")
 print(" pdf2md document.pdf # Creates document.md")
 print(" pdf2md document.pdf custom.md # Creates custom.md")
 print(" pdf2md --batch input/ # Converts all PDFs in input/")
 print(" pdf2md --batch ~/documents/pdfs/ --output ~/knowledge-base/docs/")
 sys.exit(1)

 if args[0] == "--batch":
 input_dir = args[1]
 output_dir = None
 if "--output" in args:
 idx = args.index("--output")
 output_dir = args[idx + 1]
 batch_convert(input_dir, output_dir)
 else:
 pdf_path = args[0]
 output_path = args[1] if len(args) > 1 else None
 convert_pdf_to_markdown(pdf_path, output_path)

if __name__ == "__main__":
 main()

What I love about this code:

Smart defaults: If you don’t specify an output path, it just replaces .pdf with .md
Progress bars: tqdm gives you a blue progress bar with page count
Batch mode: --batch processes an entire folder at once, with optional --output target
Helpful errors: Clear messages when things go wrong
Flexible usage: Works with relative paths, absolute paths, custom output names

Make it executable:

chmod +x pdf2md

And now it’s a proper command-line tool.

The Moment of Truth: Testing with Real Data

Theory is great. But does it actually work?

I grabbed that 1.3MB research paper on Generative Engine Optimization and ran:

python pdf2md input/test.pdf output/test.md

The output:

Converting input/test.pdf to Markdown...
Processing: 100%|████████████████| 12/12 [00:02<00:00, 5.8 pages/s]
✓ Done: output/test.md (73,463 characters)

1.3MB PDF → 74KB of clean Markdown in seconds.

I opened the output file, and there it was—perfectly formatted markdown:

## **GEO: Generative Engine Optimization**

Pranjal Aggarwal [∗]
Indian Institute of Technology Delhi
New Delhi, India
pranjal2041@gmail.com

Ashwin Kalyan
Independent
Seattle, USA
asaavashwin@gmail.com
...

Headers, formatting, structure—all preserved. No manual cleanup needed.

Success.

What This Unlocks

Now that I have PDFs converting to Markdown reliably, a whole world of possibilities opens up:

AI Workflows

Feed research papers and documentation directly into Claude or other LLMs
Build RAG (Retrieval Augmented Generation) pipelines backed by your document library
Process technical documentation at scale without losing structure

Knowledge Management

Import PDFs into your Obsidian vault automatically
Version control document content (because it’s now plain text in git)
Full-text search across your entire converted document library

Automation Ideas

Watch folder that auto-converts any dropped PDFs
Batch process entire directories of reports, papers, or manuals
Feed converted markdown directly into a vector database
API wrapper to convert PDFs via HTTP requests

Lessons Learned (Especially for CLI Beginners)

1. Virtual Environments Are Non-Negotiable

Every Python project should live in its own virtual environment. Always:

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip

This keeps dependencies isolated and projects reproducible.

2. Bleeding-Edge Isn’t Always Better

Python 3.14 is awesome, but sometimes mature tooling (like PyMuPDF) that “just works” beats bleeding-edge alternatives. Don’t be afraid to pivot when something doesn’t work.

3. Test With Real Data

I didn’t test with “hello.pdf” containing two sentences. I tested with a 1.3MB research paper. Real data reveals real issues (or in this case, confirms it works beautifully).

4. Document As You Build

Writing the README alongside the code made the project immediately understandable. Future-me will thank present-me.

5. Claude Code + PAI = Superpowers

Working with Claude through the PAI infrastructure meant I had a senior developer helping me think through:

Project structure (first principles)
Library selection (when to pivot)
Code organization (clean, maintainable)
Real-world usage patterns

This wasn’t just coding faster—it was learning better patterns while building.

Usage Examples

Basic Conversion

# Activate environment first (always!)
source venv/bin/activate

# Convert a PDF
python pdf2md document.pdf

# Custom output name
python pdf2md research.pdf my-notes.md

# Full paths
python pdf2md ~/Downloads/paper.pdf ~/Documents/notes.md

Batch Processing

Convert an entire folder of PDFs:

source venv/bin/activate

# Convert all PDFs in a folder (output goes to output/ by default)
python pdf2md --batch ~/documents/pdfs/

# Convert to a specific knowledge base directory
python pdf2md --batch ~/documents/pdfs/ --output ~/knowledge-base/docs/

Add to PATH (Optional)

To use pdf2md from anywhere:

# Add to ~/.zshrc
export PATH="/Users/dsa/projects/pdf-to-markdown:$PATH"

# Then run from anywhere
pdf2md ~/Downloads/paper.pdf ~/Documents/paper.md

What’s Next?

This tool works great as-is, but there are some exciting enhancements on the roadmap:

Immediate Improvements

Better layout analysis: Install pymupdf_layout for improved structure detection on complex documents
Recursive batch mode: Process nested folder structures, not just flat directories

Future Integrations

RAG pipeline: Auto-feed converted markdown into a vector database
Obsidian plugin: Detect PDFs in vault and convert automatically
FastAPI wrapper: Create an HTTP API for web apps to use
Electron/Tauri app: Build a desktop GUI for non-technical users

The Bigger Picture: Why This Matters

This project is tiny—roughly 100 lines of Python, 30 minutes of work. But it represents something bigger:

The ability to build tools that solve your actual problems.

I had a workflow friction (PDFs don’t work well with AI tools). I built a solution. Now that friction is gone, and I can focus on higher-level work.

And the data is clear: converting your document library to Markdown isn’t a nice-to-have. It’s a multiplier on every AI workflow that follows. Up to 70% fewer tokens consumed. 84% fewer retrieval failures. 50% fewer incorrect answers. These aren’t marginal improvements—they’re transformational.

Working with Claude Code through PAI accelerated all of this. It’s like having a patient senior developer sitting next to you, suggesting better approaches, catching errors before they happen, and explaining why certain patterns work.

Resources

PyMuPDF4LLM Docs: https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/
PyMuPDF GitHub: https://github.com/pymupdf/PyMuPDF

Citations: Markdown vs PDF for LLMs

Why PDFs Fail Under LLM Parsing — Steven Howard, Untethered AI: https://untetheredai.substack.com/p/why-pdfs-fail-under-llm-parsing
PDF vs Markdown for AI: Token Efficiency — MarkdownConverters: https://markdownconverters.com/blog/pdf-vs-markdown-ai-tokens
Revolutionizing RAG with Enhanced PDF Structure Recognition — arXiv:2401.12599 (2024): https://arxiv.org/abs/2401.12599
Approaches to PDF Data Extraction for Information Retrieval — NVIDIA Technical Blog: https://developer.nvidia.com/blog/approaches-to-pdf-data-extraction-for-information-retrieval/
Improved RAG Document Processing With Markdown — Dr. Leon Eversberg, Towards Data Science: https://medium.com/data-science/improved-rag-document-processing-with-markdown-426a2e0dd82b
Contextual Chunking: Boost Your RAG Retrieval Accuracy — Unstructured.io: https://unstructured.io/blog/contextual-chunking-in-unstructured-platform-boost-your-rag-retrieval-accuracy
Boosting AI Performance: The Power of LLM-Friendly Content in Markdown — Webex Developers Blog: https://developer.webex.com/blog/boosting-ai-performance-the-power-of-llm-friendly-content-in-markdown

Happy converting!