<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Llm on</title><link>https://augmentedresilience.com/tags/llm/</link><description>Recent content in Llm on</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sun, 31 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://augmentedresilience.com/tags/llm/index.xml" rel="self" type="application/rss+xml"/><item><title>Your AI Isn't Going Off the Rails. It Never Had Any.</title><link>https://augmentedresilience.com/posts/augmented-resilience-posts/your-ai-isnt-going-off-the-rails.-it-never-had-any/</link><pubDate>Sun, 31 May 2026 00:00:00 +0000</pubDate><guid>https://augmentedresilience.com/posts/augmented-resilience-posts/your-ai-isnt-going-off-the-rails.-it-never-had-any/</guid><description>&lt;h1 id="your-ai-isnt-going-off-the-rails-it-never-had-any">Your AI Isn&amp;rsquo;t Going Off the Rails. It Never Had Any.&lt;/h1>
&lt;p>&lt;img src="https://augmentedresilience.com/images/ar-cover-off-the-rails-v2.png" alt="Image Description">&lt;/p>
&lt;p>The most common complaint I hear from people using AI is some version of the same sentence: &amp;ldquo;It goes off the rails.&amp;rdquo; It drifts. It forgets what I told it. It invents things. It has no direction.&lt;/p>
&lt;p>I understand the frustration, but the phrasing hides the actual problem. Going off the rails implies there were rails to begin with. There weren&amp;rsquo;t. A model in a blank chat window has no memory of who you are, no rules about how it should behave, and no defined process for doing the work. It is not drifting away from a plan. There was never a plan for it to drift from.&lt;/p></description><content>&lt;h1 id="your-ai-isnt-going-off-the-rails-it-never-had-any">Your AI Isn&amp;rsquo;t Going Off the Rails. It Never Had Any.&lt;/h1>
&lt;p>&lt;img src="https://augmentedresilience.com/images/ar-cover-off-the-rails-v2.png" alt="Image Description">&lt;/p>
&lt;p>The most common complaint I hear from people using AI is some version of the same sentence: &amp;ldquo;It goes off the rails.&amp;rdquo; It drifts. It forgets what I told it. It invents things. It has no direction.&lt;/p>
&lt;p>I understand the frustration, but the phrasing hides the actual problem. Going off the rails implies there were rails to begin with. There weren&amp;rsquo;t. A model in a blank chat window has no memory of who you are, no rules about how it should behave, and no defined process for doing the work. It is not drifting away from a plan. There was never a plan for it to drift from.&lt;/p>
&lt;p>So when people tell me their AI lacks direction, what they are really describing is a missing system. The fix is almost never a better prompt. It is more structure. And the amount of structure you give the AI is the single biggest variable in whether it behaves like a reliable partner or a clever stranger who resets every morning.&lt;/p>
&lt;p>The clearest way I have found to explain this is as three tiers. Each one solves a bigger slice of the &amp;ldquo;no direction&amp;rdquo; problem than the last.&lt;/p>
&lt;hr>
&lt;h2 id="tier-1-claude-chat--the-conversation">Tier 1: Claude Chat — The Conversation&lt;/h2>
&lt;p>This is where almost everyone starts, and where most people stay. You open a chat window, you type, it responds. Each conversation is mostly a blank slate.&lt;/p>
&lt;p>The defining trait of this tier is amnesia. A new chat forgets everything. Whatever context you want the model to have, you provide manually, in the prompt, every single time. The direction comes entirely from you. The model cannot touch your files, run anything, or reach your systems. It talks, and that is all it does.&lt;/p>
&lt;p>This is genuinely useful. For a quick question, a brainstorm, a first draft, or thinking out loud, a chat window is fast and frictionless. But it is also exactly why it feels directionless on anything bigger. Nothing constrains it. There are no rules, no persistent goal beyond your last message. If your prompt is vague, the output is vague. It is a brilliant intern with total amnesia, and you are re-explaining the entire job every morning.&lt;/p>
&lt;p>People at this tier blame the model. The model is rarely the issue. The issue is that nothing is holding it on a track, because there is no track.&lt;/p>
&lt;hr>
&lt;h2 id="tier-2-claude-code--the-operator">Tier 2: Claude Code — The Operator&lt;/h2>
&lt;p>The second tier is a different kind of tool entirely. Claude Code is an agent that lives in your terminal. It does not just talk about work — it does the work. It reads and writes real files, runs commands, searches the web, and operates on your actual environment instead of an imagined one.&lt;/p>
&lt;p>Two things change the moment you move here.&lt;/p>
&lt;p>First, it gets a working memory of your project. A &lt;code>CLAUDE.md&lt;/code> file holds persistent instructions for that codebase or project, so the model arrives already knowing the conventions, the goals, and the rules you have written down. You stop re-explaining the project on every session.&lt;/p>
&lt;p>Second, and more importantly, it works in a loop: act, observe the result, correct. It writes a file and sees whether the change worked. It runs a command and reads the actual output. That feedback loop is what kills the drift. The model is not imagining what might happen — it is looking at what did happen and adjusting. Operating on real artifacts instead of guesses is most of the discipline.&lt;/p>
&lt;p>The honest limitation: this discipline is per-project and largely manual. You set up each project&amp;rsquo;s instructions yourself. The memory does not follow you from one project to the next, and there is no consistent persona or process spanning everything you do. It is a sharp, capable operator — but one you have to brief fresh for every new job.&lt;/p>
&lt;hr>
&lt;h2 id="tier-3-ai-infrastructure">Tier 3: AI Infrastructure&lt;/h2>
&lt;p>The third tier is the one that actually fixes &amp;ldquo;no direction&amp;rdquo; at the root, because it stops treating each session as a fresh start.&lt;/p>
&lt;p>AI Infrastructure is a system that wraps Claude Code and gives it a permanent identity, a rule set, a knowledge base, and a defined process. The jump from Tier 2 to Tier 3 is the difference between hiring a contractor and building an operations department. This post itself is a small example: it was written inside an AI Infrastructure that already knew my blog&amp;rsquo;s voice, my formatting conventions, and where the file should be saved, without my having to say any of it.&lt;/p>
&lt;p>Three things work together here, and they are what make drift structurally hard.&lt;/p>
&lt;p>The first is persistent identity and memory. The infrastructure does not forget who I am, what I work on, or how I want things done. When I correct it, the correction sticks across every future session, not just the current chat. The knowledge lives in files I own, not inside a conversation that disappears when I close the tab.&lt;/p>
&lt;p>The second is a defined process. Every non-trivial request gets classified and routed through a structured sequence: understand the request, plan the approach, do the work, verify it against explicit criteria. The model cannot freewheel, because a process governs the response before it starts. That is the literal opposite of going off the rails — the rails are built in.&lt;/p>
&lt;p>The third is context routing. Instead of me pasting the right background into every prompt, the system pulls the relevant knowledge automatically based on what I am doing. The model arrives oriented, every time.&lt;/p>
&lt;p>None of this makes the underlying model smarter. It makes the environment around the model disciplined. That is the whole trick.&lt;/p>
&lt;hr>
&lt;h2 id="the-same-symptom-mapped-to-the-fix">The Same Symptom, Mapped to the Fix&lt;/h2>
&lt;p>When someone describes their AI as directionless, the specific complaint usually tells you exactly which tier they are stuck on and what would move them up.&lt;/p>
&lt;p>If it forgets what you told it, you are in a chat window and you need persistent instructions — that is the move to Claude Code.&lt;/p>
&lt;p>If it hallucinates instead of using your real data, you need to let it read your actual files — again, the move to an operator that touches your environment.&lt;/p>
&lt;p>If you keep re-explaining your preferences across projects, you have outgrown per-project memory and need persistent identity — the move to infrastructure.&lt;/p>
&lt;p>And if it has no consistent process from one task to the next, you need a defined algorithm governing how every request gets handled — the same move.&lt;/p>
&lt;p>Notice the pattern. Every one of these is solved by adding structure, not by writing a cleverer sentence into the prompt box.&lt;/p>
&lt;hr>
&lt;h2 id="the-real-shift">The Real Shift&lt;/h2>
&lt;p>Direction is not something you nag the AI for inside each prompt. It is something you build into the system once.&lt;/p>
&lt;p>That reframing is the entire jump from chatting to infrastructure. A chat window is equally capable on day one and day three hundred, because nothing accumulates. An operator gets more useful per project, as long as you keep briefing it. An infrastructure compounds — every rule you add, every preference it learns, every process you refine makes the next session start further ahead than the last.&lt;/p>
&lt;p>If your AI feels like it has no direction, it is not malfunctioning. It is doing exactly what an unstructured system does. The rails were never the model&amp;rsquo;s job to build. They are yours.&lt;/p></content></item><item><title>When Your PDF Workflow Breaks - Building a Markdown Converter with Claude Code</title><link>https://augmentedresilience.com/posts/augmented-resilience-posts/building-a-pdf-to-markdown-converter-with-claude-code/</link><pubDate>Wed, 18 Feb 2026 00:00:00 +0000</pubDate><guid>https://augmentedresilience.com/posts/augmented-resilience-posts/building-a-pdf-to-markdown-converter-with-claude-code/</guid><description>&lt;h2 id="the-problem-pdfs-are-knowledge-prisons">The Problem: PDFs Are Knowledge Prisons&lt;/h2>
&lt;p>You know that feeling when you download a brilliant research paper, only to realize you can&amp;rsquo;t easily feed it into your AI workflow? Or when you want to add documentation to your knowledge base, but it&amp;rsquo;s locked in a format that doesn&amp;rsquo;t play well with version control or LLM tools?&lt;/p>
&lt;p>Yeah, I was there last week.&lt;/p>
&lt;p>I had just downloaded a fascinating 1.3MB research paper on Generative Engine Optimization and wanted to process it with my AI tools. But PDFs are terrible for this. They&amp;rsquo;re designed for &lt;em>printing&lt;/em>, not for &lt;em>processing&lt;/em>. What I needed was Markdown—clean, portable, AI-friendly Markdown.&lt;/p></description><content>&lt;h2 id="the-problem-pdfs-are-knowledge-prisons">The Problem: PDFs Are Knowledge Prisons&lt;/h2>
&lt;p>You know that feeling when you download a brilliant research paper, only to realize you can&amp;rsquo;t easily feed it into your AI workflow? Or when you want to add documentation to your knowledge base, but it&amp;rsquo;s locked in a format that doesn&amp;rsquo;t play well with version control or LLM tools?&lt;/p>
&lt;p>Yeah, I was there last week.&lt;/p>
&lt;p>I had just downloaded a fascinating 1.3MB research paper on Generative Engine Optimization and wanted to process it with my AI tools. But PDFs are terrible for this. They&amp;rsquo;re designed for &lt;em>printing&lt;/em>, not for &lt;em>processing&lt;/em>. What I needed was Markdown—clean, portable, AI-friendly Markdown.&lt;/p>
&lt;p>So I built a converter. And with Claude Code as my copilot through the PAI (Personal AI Infrastructure) system, the whole thing took less than 30 minutes.&lt;/p>
&lt;p>Here&amp;rsquo;s how it went down.&lt;/p>
&lt;hr>
&lt;h2 id="why-markdown-is-better-than-pdf-for-llms">Why Markdown is Better Than PDF for LLMs&lt;/h2>
&lt;p>Before diving into the build, let&amp;rsquo;s answer the obvious question: &lt;em>why bother converting?&lt;/em> Can&amp;rsquo;t LLMs just read PDFs directly?&lt;/p>
&lt;p>Technically, yes. But the results are significantly worse, and the reasons are fundamental to how PDFs work.&lt;/p>
&lt;h3 id="pdfs-are-layout-first-not-structure-first">PDFs Are Layout-First, Not Structure-First&lt;/h3>
&lt;p>PDFs were designed to describe &lt;em>where things appear on a page&lt;/em>, not &lt;em>what they mean&lt;/em>. As Steven Howard explains in &lt;a href="https://untetheredai.substack.com/p/why-pdfs-fail-under-llm-parsing" target="_blank" rel="noopener noreferrer">Why PDFs Fail Under LLM Parsing&lt;/a>
:&lt;/p>
&lt;blockquote>
&lt;p>&amp;ldquo;Table cells with wrapped text insert hard line breaks that fragment token continuity and break logical row recognition. Headers and footers simply add noise to the context when used with LLMs. Sentences are split with arbitrary CR/LFs making it very difficult to find paragraph boundaries.&amp;rdquo;&lt;/p>&lt;/blockquote>
&lt;p>This architectural mismatch — a format designed for printing being fed into a system designed for understanding — causes cascading problems downstream.&lt;/p>
&lt;h3 id="the-token-efficiency-problem">The Token Efficiency Problem&lt;/h3>
&lt;p>Every token your LLM processes costs money and consumes context window space. PDF extraction wastes both.&lt;/p>
&lt;p>According to analysis from &lt;a href="https://markdownconverters.com/blog/pdf-vs-markdown-ai-tokens" target="_blank" rel="noopener noreferrer">MarkdownConverters&lt;/a>
, &lt;strong>Markdown saves up to 70% more tokens compared to extracted PDF text&lt;/strong> for the same content. The culprit: PDF extraction introduces formatting artifacts, metadata noise, headers/footers, and encoding remnants that all consume tokens without adding semantic value.&lt;/p>
&lt;p>To put that in practical terms: a PDF that would use 10,000 tokens might only need 3,000 tokens when properly converted to Markdown. At scale, this compounds dramatically.&lt;/p>
&lt;h3 id="the-rag-performance-problem">The RAG Performance Problem&lt;/h3>
&lt;p>If you&amp;rsquo;re building Retrieval Augmented Generation (RAG) systems — using documents as a knowledge base for AI — document format directly impacts answer quality.&lt;/p>
&lt;p>The research here is compelling:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Academic validation&lt;/strong>: A 2024 paper on arXiv (&lt;a href="https://arxiv.org/abs/2401.12599" target="_blank" rel="noopener noreferrer">Revolutionizing RAG with Enhanced PDF Structure Recognition&lt;/a>
) found that &amp;ldquo;the low accuracy of PDF parsing significantly impacts the effectiveness of professional knowledge-based QA.&amp;rdquo;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Industry validation&lt;/strong>: NVIDIA&amp;rsquo;s technical blog documents how their NeMo Retriever pipeline converts extracted content to Markdown specifically because it &amp;ldquo;preserves row/column relationships in an LLM-native format, significantly reducing numeric hallucination&amp;rdquo; — and &lt;strong>reduces incorrect answers by 50%&lt;/strong>. (&lt;a href="https://developer.nvidia.com/blog/approaches-to-pdf-data-extraction-for-information-retrieval/" target="_blank" rel="noopener noreferrer">NVIDIA: Approaches to PDF Data Extraction for Information Retrieval&lt;/a>
)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Chunking quality&lt;/strong>: Analysis from &lt;a href="https://medium.com/data-science/improved-rag-document-processing-with-markdown-426a2e0dd82b" target="_blank" rel="noopener noreferrer">Towards Data Science&lt;/a>
shows that Markdown&amp;rsquo;s heading structure (&lt;code>#&lt;/code>, &lt;code>##&lt;/code>, &lt;code>###&lt;/code>) produces semantically meaningful chunks, while PDF-based chunking relies on arbitrary page breaks and heuristics.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Retrieval failure rates&lt;/strong>: Unstructured.io&amp;rsquo;s &lt;a href="https://unstructured.io/blog/contextual-chunking-in-unstructured-platform-boost-your-rag-retrieval-accuracy" target="_blank" rel="noopener noreferrer">research on contextual chunking&lt;/a>
— tested across 5,563 question-answer pairs — showed an &lt;strong>84% reduction in retrieval failure rates&lt;/strong> when using structure-aware chunking (the kind Markdown enables natively).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Real-world outcomes&lt;/strong>: The 2025 Semrush AI Index, cited by &lt;a href="https://developer.webex.com/blog/boosting-ai-performance-the-power-of-llm-friendly-content-in-markdown" target="_blank" rel="noopener noreferrer">Webex Developers Blog&lt;/a>
, found that 72% of top AI-indexed articles used Markdown or Markdown-like structures, achieving &lt;strong>34% higher retrieval accuracy&lt;/strong> across ChatGPT, Perplexity, and Gemini.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="the-bottom-line">The Bottom Line&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Metric&lt;/th>
&lt;th>Impact&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Token reduction&lt;/td>
&lt;td>Up to 70% fewer tokens vs PDF extraction&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Incorrect answers in RAG&lt;/td>
&lt;td>50% reduction (NVIDIA NeMo)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Retrieval failure rates&lt;/td>
&lt;td>84% reduction (Unstructured.io)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Retrieval accuracy&lt;/td>
&lt;td>34% higher (Semrush AI Index 2025)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Markdown isn&amp;rsquo;t just more convenient — it&amp;rsquo;s meaningfully better for AI. Converting your document libraries is one of the highest-ROI steps you can take before building any LLM-powered workflow.&lt;/p>
&lt;hr>
&lt;h2 id="the-first-failure-when-bleeding-edge-python-bites-back">The First Failure: When Bleeding-Edge Python Bites Back&lt;/h2>
&lt;p>I&amp;rsquo;m running Python 3.14.2—the latest release, barely a few weeks old. Modern, shiny, cutting-edge. Perfect, right?&lt;/p>
&lt;p>Not quite.&lt;/p>
&lt;p>My first instinct was to use &lt;code>marker-pdf&lt;/code>, a high-performance converter optimized for scientific papers and books. It looked perfect on paper (pun intended). But when I tried to install it:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>Building wheel for Pillow (pyproject.toml): finished with status &amp;#39;error&amp;#39;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Ugh.&lt;/p>
&lt;p>Turns out, &lt;code>marker-pdf&lt;/code> depends on Pillow (the Python imaging library), and Pillow hasn&amp;rsquo;t built binary wheels for Python 3.14 yet. I could have downgraded Python. I could have fought with source compilation. But why?&lt;/p>
&lt;p>&lt;strong>This is where working with Claude Code really shines.&lt;/strong> Instead of going down a rabbit hole trying to force marker-pdf to work, Claude suggested pivoting to &lt;strong>PyMuPDF4LLM&lt;/strong>—a mature, actively maintained library specifically designed for AI/LLM workflows.&lt;/p>
&lt;p>And it just worked.&lt;/p>
&lt;hr>
&lt;h2 id="the-solution-pymupdf4llm">The Solution: PyMuPDF4LLM&lt;/h2>
&lt;p>PyMuPDF4LLM turned out to be exactly what I needed:&lt;/p>
&lt;ul>
&lt;li>Works flawlessly with Python 3.14 (no compilation errors)&lt;/li>
&lt;li>Fast and accurate conversion&lt;/li>
&lt;li>Built specifically for feeding documents into LLMs&lt;/li>
&lt;li>Clean, simple API&lt;/li>
&lt;li>Actively maintained by the PyMuPDF team&lt;/li>
&lt;/ul>
&lt;p>The installation was literally:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>pip install pymupdf4llm
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Five seconds later, I was ready to go.&lt;/p>
&lt;hr>
&lt;h2 id="building-the-tool-first-principles-thinking">Building the Tool: First Principles Thinking&lt;/h2>
&lt;p>As someone new to the CLI world, I&amp;rsquo;ve been learning to think through project structure from first principles. Where should this live? How should it be organized?&lt;/p>
&lt;p>With Claude&amp;rsquo;s guidance, I chose &lt;code>/Users/dsa/projects/pdf-to-markdown/&lt;/code> for a few key reasons:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Separation of Concerns:&lt;/strong> Tool projects should be separate from my main workspace&lt;/li>
&lt;li>&lt;strong>Discoverability:&lt;/strong> Clear, descriptive naming means I&amp;rsquo;ll find it again in 6 months&lt;/li>
&lt;li>&lt;strong>Reusability:&lt;/strong> This structure works both as a CLI tool AND as a library I could import later&lt;/li>
&lt;/ol>
&lt;p>The project structure ended up simple but complete:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>pdf-to-markdown/
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├── README.md # Documentation
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├── venv/ # Isolated Python environment
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├── input/ # Test PDFs
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├── output/ # Generated markdown
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├── pdf2md # CLI wrapper script
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└── requirements.txt # Dependencies
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="the-code-a-simple-but-powerful-cli">The Code: A Simple but Powerful CLI&lt;/h2>
&lt;p>I wanted a tool I could actually use—something with a clean command-line interface that handles the common cases elegantly. Working with Claude through PAI, we created a Python script that does exactly that:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">#!/usr/bin/env python3&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74">PDF to Markdown Converter
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74">A simple CLI tool to convert PDF files to Markdown using PyMuPDF4LLM
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> sys
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> os
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> pathlib &lt;span style="color:#f92672">import&lt;/span> Path
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> pymupdf4llm
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> pymupdf
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> tqdm &lt;span style="color:#f92672">import&lt;/span> tqdm
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">convert_pdf_to_markdown&lt;/span>(pdf_path: str, output_path: str &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> str:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;Convert a PDF file to Markdown format.&amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#f92672">not&lt;/span> os&lt;span style="color:#f92672">.&lt;/span>path&lt;span style="color:#f92672">.&lt;/span>exists(pdf_path):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> &lt;span style="color:#a6e22e">FileNotFoundError&lt;/span>(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;PDF file not found: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>pdf_path&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># Get page count for progress bar&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> doc &lt;span style="color:#f92672">=&lt;/span> pymupdf&lt;span style="color:#f92672">.&lt;/span>open(pdf_path)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> page_count &lt;span style="color:#f92672">=&lt;/span> doc&lt;span style="color:#f92672">.&lt;/span>page_count
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> doc&lt;span style="color:#f92672">.&lt;/span>close()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Converting: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>pdf_path&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">with&lt;/span> tqdm(total&lt;span style="color:#f92672">=&lt;/span>page_count, unit&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;page&amp;#34;&lt;/span>, desc&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Processing&amp;#34;&lt;/span>, colour&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;blue&amp;#34;&lt;/span>) &lt;span style="color:#66d9ef">as&lt;/span> bar:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> md_text &lt;span style="color:#f92672">=&lt;/span> pymupdf4llm&lt;span style="color:#f92672">.&lt;/span>to_markdown(pdf_path, page_chunks&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">False&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> bar&lt;span style="color:#f92672">.&lt;/span>n &lt;span style="color:#f92672">=&lt;/span> page_count
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> bar&lt;span style="color:#f92672">.&lt;/span>refresh()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> output_path &lt;span style="color:#f92672">is&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> output_path &lt;span style="color:#f92672">=&lt;/span> Path(pdf_path)&lt;span style="color:#f92672">.&lt;/span>with_suffix(&lt;span style="color:#e6db74">&amp;#39;.md&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">with&lt;/span> open(output_path, &lt;span style="color:#e6db74">&amp;#39;w&amp;#39;&lt;/span>, encoding&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#39;utf-8&amp;#39;&lt;/span>) &lt;span style="color:#66d9ef">as&lt;/span> f:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> f&lt;span style="color:#f92672">.&lt;/span>write(md_text)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;✓ Done: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>output_path&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> (&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>len(md_text)&lt;span style="color:#e6db74">:&lt;/span>&lt;span style="color:#e6db74">,&lt;/span>&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> characters)&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> str(output_path)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">batch_convert&lt;/span>(input_dir: str, output_dir: str &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;Convert all PDFs in a directory to Markdown.&amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> input_path &lt;span style="color:#f92672">=&lt;/span> Path(input_dir)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#f92672">not&lt;/span> input_path&lt;span style="color:#f92672">.&lt;/span>is_dir():
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> &lt;span style="color:#a6e22e">NotADirectoryError&lt;/span>(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Not a directory: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>input_dir&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pdfs &lt;span style="color:#f92672">=&lt;/span> sorted(input_path&lt;span style="color:#f92672">.&lt;/span>glob(&lt;span style="color:#e6db74">&amp;#34;*.pdf&amp;#34;&lt;/span>))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#f92672">not&lt;/span> pdfs:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;No PDF files found in: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>input_dir&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> sys&lt;span style="color:#f92672">.&lt;/span>exit(&lt;span style="color:#ae81ff">0&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> output_dir:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> output_dir &lt;span style="color:#f92672">=&lt;/span> Path(output_dir)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> output_dir &lt;span style="color:#f92672">=&lt;/span> input_path&lt;span style="color:#f92672">.&lt;/span>parent &lt;span style="color:#f92672">/&lt;/span> &lt;span style="color:#e6db74">&amp;#34;output&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> output_dir&lt;span style="color:#f92672">.&lt;/span>mkdir(parents&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">True&lt;/span>, exist_ok&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">True&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> total &lt;span style="color:#f92672">=&lt;/span> len(pdfs)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> succeeded &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> failed &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">Batch mode: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>total&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> PDF(s) found in &amp;#39;&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>input_dir&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#39;&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Output folder: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>output_dir&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> i, pdf_path &lt;span style="color:#f92672">in&lt;/span> enumerate(pdfs, start&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;[&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>i&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>total&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">] &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>pdf_path&lt;span style="color:#f92672">.&lt;/span>name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> output_path &lt;span style="color:#f92672">=&lt;/span> output_dir &lt;span style="color:#f92672">/&lt;/span> pdf_path&lt;span style="color:#f92672">.&lt;/span>with_suffix(&lt;span style="color:#e6db74">&amp;#39;.md&amp;#39;&lt;/span>)&lt;span style="color:#f92672">.&lt;/span>name
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> convert_pdf_to_markdown(str(pdf_path), str(output_path))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> succeeded &lt;span style="color:#f92672">+=&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">Exception&lt;/span> &lt;span style="color:#66d9ef">as&lt;/span> e:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34; ✗ Failed: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>e&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> failed &lt;span style="color:#f92672">+=&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">&amp;#34;─&amp;#34;&lt;/span> &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">40&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Batch complete: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>succeeded&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> converted, &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>failed&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> failed&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Output folder: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>output_dir&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">main&lt;/span>():
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;Main CLI entry point&amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> args &lt;span style="color:#f92672">=&lt;/span> sys&lt;span style="color:#f92672">.&lt;/span>argv[&lt;span style="color:#ae81ff">1&lt;/span>:]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#f92672">not&lt;/span> args:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">&amp;#34;Usage:&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">&amp;#34; pdf2md &amp;lt;input.pdf&amp;gt; [output.md] # Convert a single PDF&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">&amp;#34; pdf2md --batch &amp;lt;folder/&amp;gt; # Convert all PDFs in a folder&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">&amp;#34; pdf2md --batch &amp;lt;folder/&amp;gt; --output &amp;lt;out_folder/&amp;gt; # Batch with custom output dir&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">Examples:&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">&amp;#34; pdf2md document.pdf # Creates document.md&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">&amp;#34; pdf2md document.pdf custom.md # Creates custom.md&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">&amp;#34; pdf2md --batch input/ # Converts all PDFs in input/&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">&amp;#34; pdf2md --batch ~/documents/pdfs/ --output ~/knowledge-base/docs/&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> sys&lt;span style="color:#f92672">.&lt;/span>exit(&lt;span style="color:#ae81ff">1&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> args[&lt;span style="color:#ae81ff">0&lt;/span>] &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#e6db74">&amp;#34;--batch&amp;#34;&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> input_dir &lt;span style="color:#f92672">=&lt;/span> args[&lt;span style="color:#ae81ff">1&lt;/span>]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> output_dir &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#e6db74">&amp;#34;--output&amp;#34;&lt;/span> &lt;span style="color:#f92672">in&lt;/span> args:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> idx &lt;span style="color:#f92672">=&lt;/span> args&lt;span style="color:#f92672">.&lt;/span>index(&lt;span style="color:#e6db74">&amp;#34;--output&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> output_dir &lt;span style="color:#f92672">=&lt;/span> args[idx &lt;span style="color:#f92672">+&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> batch_convert(input_dir, output_dir)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pdf_path &lt;span style="color:#f92672">=&lt;/span> args[&lt;span style="color:#ae81ff">0&lt;/span>]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> output_path &lt;span style="color:#f92672">=&lt;/span> args[&lt;span style="color:#ae81ff">1&lt;/span>] &lt;span style="color:#66d9ef">if&lt;/span> len(args) &lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span> &lt;span style="color:#66d9ef">else&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> convert_pdf_to_markdown(pdf_path, output_path)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">if&lt;/span> __name__ &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#e6db74">&amp;#34;__main__&amp;#34;&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> main()
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>What I love about this code:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Smart defaults:&lt;/strong> If you don&amp;rsquo;t specify an output path, it just replaces &lt;code>.pdf&lt;/code> with &lt;code>.md&lt;/code>&lt;/li>
&lt;li>&lt;strong>Progress bars:&lt;/strong> &lt;code>tqdm&lt;/code> gives you a blue progress bar with page count&lt;/li>
&lt;li>&lt;strong>Batch mode:&lt;/strong> &lt;code>--batch&lt;/code> processes an entire folder at once, with optional &lt;code>--output&lt;/code> target&lt;/li>
&lt;li>&lt;strong>Helpful errors:&lt;/strong> Clear messages when things go wrong&lt;/li>
&lt;li>&lt;strong>Flexible usage:&lt;/strong> Works with relative paths, absolute paths, custom output names&lt;/li>
&lt;/ul>
&lt;p>Make it executable:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>chmod +x pdf2md
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>And now it&amp;rsquo;s a proper command-line tool.&lt;/p>
&lt;hr>
&lt;h2 id="the-moment-of-truth-testing-with-real-data">The Moment of Truth: Testing with Real Data&lt;/h2>
&lt;p>Theory is great. But does it actually work?&lt;/p>
&lt;p>I grabbed that 1.3MB research paper on Generative Engine Optimization and ran:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>python pdf2md input/test.pdf output/test.md
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The output:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>Converting input/test.pdf to Markdown...
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Processing: 100%|████████████████| 12/12 [00:02&amp;lt;00:00, 5.8 pages/s]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>✓ Done: output/test.md (73,463 characters)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>1.3MB PDF → 74KB of clean Markdown in seconds.&lt;/strong>&lt;/p>
&lt;p>I opened the output file, and there it was—perfectly formatted markdown:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-markdown" data-lang="markdown">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">## **GEO: Generative Engine Optimization**
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Pranjal Aggarwal [∗]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Indian Institute of Technology Delhi
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>New Delhi, India
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>pranjal2041@gmail.com
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Ashwin Kalyan
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Independent
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Seattle, USA
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>asaavashwin@gmail.com
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>...
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Headers, formatting, structure—all preserved. No manual cleanup needed.&lt;/p>
&lt;p>Success.&lt;/p>
&lt;hr>
&lt;h2 id="what-this-unlocks">What This Unlocks&lt;/h2>
&lt;p>Now that I have PDFs converting to Markdown reliably, a whole world of possibilities opens up:&lt;/p>
&lt;h3 id="ai-workflows">AI Workflows&lt;/h3>
&lt;ul>
&lt;li>Feed research papers and documentation directly into Claude or other LLMs&lt;/li>
&lt;li>Build RAG (Retrieval Augmented Generation) pipelines backed by your document library&lt;/li>
&lt;li>Process technical documentation at scale without losing structure&lt;/li>
&lt;/ul>
&lt;h3 id="knowledge-management">Knowledge Management&lt;/h3>
&lt;ul>
&lt;li>Import PDFs into your Obsidian vault automatically&lt;/li>
&lt;li>Version control document content (because it&amp;rsquo;s now plain text in git)&lt;/li>
&lt;li>Full-text search across your entire converted document library&lt;/li>
&lt;/ul>
&lt;h3 id="automation-ideas">Automation Ideas&lt;/h3>
&lt;ul>
&lt;li>Watch folder that auto-converts any dropped PDFs&lt;/li>
&lt;li>Batch process entire directories of reports, papers, or manuals&lt;/li>
&lt;li>Feed converted markdown directly into a vector database&lt;/li>
&lt;li>API wrapper to convert PDFs via HTTP requests&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="lessons-learned-especially-for-cli-beginners">Lessons Learned (Especially for CLI Beginners)&lt;/h2>
&lt;h3 id="1-virtual-environments-are-non-negotiable">1. Virtual Environments Are Non-Negotiable&lt;/h3>
&lt;p>Every Python project should live in its own virtual environment. Always:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>python3 -m venv venv
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>source venv/bin/activate
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>pip install --upgrade pip
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This keeps dependencies isolated and projects reproducible.&lt;/p>
&lt;h3 id="2-bleeding-edge-isnt-always-better">2. Bleeding-Edge Isn&amp;rsquo;t Always Better&lt;/h3>
&lt;p>Python 3.14 is awesome, but sometimes mature tooling (like PyMuPDF) that &amp;ldquo;just works&amp;rdquo; beats bleeding-edge alternatives. Don&amp;rsquo;t be afraid to pivot when something doesn&amp;rsquo;t work.&lt;/p>
&lt;h3 id="3-test-with-real-data">3. Test With Real Data&lt;/h3>
&lt;p>I didn&amp;rsquo;t test with &amp;ldquo;hello.pdf&amp;rdquo; containing two sentences. I tested with a 1.3MB research paper. Real data reveals real issues (or in this case, confirms it works beautifully).&lt;/p>
&lt;h3 id="4-document-as-you-build">4. Document As You Build&lt;/h3>
&lt;p>Writing the README alongside the code made the project immediately understandable. Future-me will thank present-me.&lt;/p>
&lt;h3 id="5-claude-code--pai--superpowers">5. Claude Code + PAI = Superpowers&lt;/h3>
&lt;p>Working with Claude through the PAI infrastructure meant I had a senior developer helping me think through:&lt;/p>
&lt;ul>
&lt;li>Project structure (first principles)&lt;/li>
&lt;li>Library selection (when to pivot)&lt;/li>
&lt;li>Code organization (clean, maintainable)&lt;/li>
&lt;li>Real-world usage patterns&lt;/li>
&lt;/ul>
&lt;p>This wasn&amp;rsquo;t just coding faster—it was learning better patterns while building.&lt;/p>
&lt;hr>
&lt;h2 id="usage-examples">Usage Examples&lt;/h2>
&lt;h3 id="basic-conversion">Basic Conversion&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Activate environment first (always!)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>source venv/bin/activate
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Convert a PDF&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>python pdf2md document.pdf
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Custom output name&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>python pdf2md research.pdf my-notes.md
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Full paths&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>python pdf2md ~/Downloads/paper.pdf ~/Documents/notes.md
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="batch-processing">Batch Processing&lt;/h3>
&lt;p>Convert an entire folder of PDFs:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>source venv/bin/activate
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Convert all PDFs in a folder (output goes to output/ by default)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>python pdf2md --batch ~/documents/pdfs/
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Convert to a specific knowledge base directory&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>python pdf2md --batch ~/documents/pdfs/ --output ~/knowledge-base/docs/
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="add-to-path-optional">Add to PATH (Optional)&lt;/h3>
&lt;p>To use &lt;code>pdf2md&lt;/code> from anywhere:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Add to ~/.zshrc&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>export PATH&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;/Users/dsa/projects/pdf-to-markdown:&lt;/span>$PATH&lt;span style="color:#e6db74">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Then run from anywhere&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>pdf2md ~/Downloads/paper.pdf ~/Documents/paper.md
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="whats-next">What&amp;rsquo;s Next?&lt;/h2>
&lt;p>This tool works great as-is, but there are some exciting enhancements on the roadmap:&lt;/p>
&lt;h3 id="immediate-improvements">Immediate Improvements&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Better layout analysis:&lt;/strong> Install &lt;code>pymupdf_layout&lt;/code> for improved structure detection on complex documents&lt;/li>
&lt;li>&lt;strong>Recursive batch mode:&lt;/strong> Process nested folder structures, not just flat directories&lt;/li>
&lt;/ul>
&lt;h3 id="future-integrations">Future Integrations&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>RAG pipeline:&lt;/strong> Auto-feed converted markdown into a vector database&lt;/li>
&lt;li>&lt;strong>Obsidian plugin:&lt;/strong> Detect PDFs in vault and convert automatically&lt;/li>
&lt;li>&lt;strong>FastAPI wrapper:&lt;/strong> Create an HTTP API for web apps to use&lt;/li>
&lt;li>&lt;strong>Electron/Tauri app:&lt;/strong> Build a desktop GUI for non-technical users&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="the-bigger-picture-why-this-matters">The Bigger Picture: Why This Matters&lt;/h2>
&lt;p>This project is tiny—roughly 100 lines of Python, 30 minutes of work. But it represents something bigger:&lt;/p>
&lt;p>&lt;strong>The ability to build tools that solve your actual problems.&lt;/strong>&lt;/p>
&lt;p>I had a workflow friction (PDFs don&amp;rsquo;t work well with AI tools). I built a solution. Now that friction is gone, and I can focus on higher-level work.&lt;/p>
&lt;p>And the data is clear: converting your document library to Markdown isn&amp;rsquo;t a nice-to-have. It&amp;rsquo;s a multiplier on every AI workflow that follows. Up to 70% fewer tokens consumed. 84% fewer retrieval failures. 50% fewer incorrect answers. These aren&amp;rsquo;t marginal improvements—they&amp;rsquo;re transformational.&lt;/p>
&lt;p>Working with Claude Code through PAI accelerated all of this. It&amp;rsquo;s like having a patient senior developer sitting next to you, suggesting better approaches, catching errors before they happen, and explaining &lt;em>why&lt;/em> certain patterns work.&lt;/p>
&lt;hr>
&lt;h2 id="resources">Resources&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>PyMuPDF4LLM Docs:&lt;/strong> &lt;a href="https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/" target="_blank" rel="noopener noreferrer">https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/&lt;/a>
&lt;/li>
&lt;li>&lt;strong>PyMuPDF GitHub:&lt;/strong> &lt;a href="https://github.com/pymupdf/PyMuPDF" target="_blank" rel="noopener noreferrer">https://github.com/pymupdf/PyMuPDF&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="citations-markdown-vs-pdf-for-llms">Citations: Markdown vs PDF for LLMs&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Why PDFs Fail Under LLM Parsing&lt;/strong> — Steven Howard, Untethered AI: &lt;a href="https://untetheredai.substack.com/p/why-pdfs-fail-under-llm-parsing" target="_blank" rel="noopener noreferrer">https://untetheredai.substack.com/p/why-pdfs-fail-under-llm-parsing&lt;/a>
&lt;/li>
&lt;li>&lt;strong>PDF vs Markdown for AI: Token Efficiency&lt;/strong> — MarkdownConverters: &lt;a href="https://markdownconverters.com/blog/pdf-vs-markdown-ai-tokens" target="_blank" rel="noopener noreferrer">https://markdownconverters.com/blog/pdf-vs-markdown-ai-tokens&lt;/a>
&lt;/li>
&lt;li>&lt;strong>Revolutionizing RAG with Enhanced PDF Structure Recognition&lt;/strong> — arXiv:2401.12599 (2024): &lt;a href="https://arxiv.org/abs/2401.12599" target="_blank" rel="noopener noreferrer">https://arxiv.org/abs/2401.12599&lt;/a>
&lt;/li>
&lt;li>&lt;strong>Approaches to PDF Data Extraction for Information Retrieval&lt;/strong> — NVIDIA Technical Blog: &lt;a href="https://developer.nvidia.com/blog/approaches-to-pdf-data-extraction-for-information-retrieval/" target="_blank" rel="noopener noreferrer">https://developer.nvidia.com/blog/approaches-to-pdf-data-extraction-for-information-retrieval/&lt;/a>
&lt;/li>
&lt;li>&lt;strong>Improved RAG Document Processing With Markdown&lt;/strong> — Dr. Leon Eversberg, Towards Data Science: &lt;a href="https://medium.com/data-science/improved-rag-document-processing-with-markdown-426a2e0dd82b" target="_blank" rel="noopener noreferrer">https://medium.com/data-science/improved-rag-document-processing-with-markdown-426a2e0dd82b&lt;/a>
&lt;/li>
&lt;li>&lt;strong>Contextual Chunking: Boost Your RAG Retrieval Accuracy&lt;/strong> — Unstructured.io: &lt;a href="https://unstructured.io/blog/contextual-chunking-in-unstructured-platform-boost-your-rag-retrieval-accuracy" target="_blank" rel="noopener noreferrer">https://unstructured.io/blog/contextual-chunking-in-unstructured-platform-boost-your-rag-retrieval-accuracy&lt;/a>
&lt;/li>
&lt;li>&lt;strong>Boosting AI Performance: The Power of LLM-Friendly Content in Markdown&lt;/strong> — Webex Developers Blog: &lt;a href="https://developer.webex.com/blog/boosting-ai-performance-the-power-of-llm-friendly-content-in-markdown" target="_blank" rel="noopener noreferrer">https://developer.webex.com/blog/boosting-ai-performance-the-power-of-llm-friendly-content-in-markdown&lt;/a>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;p>&lt;strong>Happy converting!&lt;/strong>&lt;/p></content></item></channel></rss>