<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Ibm on</title><link>https://augmentedresilience.com/tags/ibm/</link><description>Recent content in Ibm on</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sat, 02 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://augmentedresilience.com/tags/ibm/index.xml" rel="self" type="application/rss+xml"/><item><title>Upgrading My PDF Converter to IBM's Docling</title><link>https://augmentedresilience.com/posts/augmented-resilience-posts/upgrading-my-pdf-converter-to-ibm-docling/</link><pubDate>Sat, 02 May 2026 00:00:00 +0000</pubDate><guid>https://augmentedresilience.com/posts/augmented-resilience-posts/upgrading-my-pdf-converter-to-ibm-docling/</guid><description>&lt;h2 id="when-my-own-tool-couldnt-handle-my-work">When My Own Tool Couldn&amp;rsquo;t Handle My Work&lt;/h2>
&lt;p>The error message was easy to dismiss: &lt;code>RapidOCR returned empty result!&lt;/code>. It appeared twice in the terminal, then silence — a blank .md file where a 40-page Oracle HCM implementation guide should have been. The PDF had come straight from Oracle&amp;rsquo;s support portal, the same format I use for every triage session. But this one stored its pages as images, and PyMuPDF4LLM had nothing to work with.&lt;/p></description><content>&lt;h2 id="when-my-own-tool-couldnt-handle-my-work">When My Own Tool Couldn&amp;rsquo;t Handle My Work&lt;/h2>
&lt;p>The error message was easy to dismiss: &lt;code>RapidOCR returned empty result!&lt;/code>. It appeared twice in the terminal, then silence — a blank .md file where a 40-page Oracle HCM implementation guide should have been. The PDF had come straight from Oracle&amp;rsquo;s support portal, the same format I use for every triage session. But this one stored its pages as images, and PyMuPDF4LLM had nothing to work with.&lt;/p>
&lt;p>That was one category of failure. The other was quieter. For documents that did convert, I started noticing the tables were wrong — not corrupted, just structurally dissolved. An eligibility matrix that should have had six clearly labeled columns came back as a run of loosely connected text. Useful for nothing.&lt;/p>
&lt;p>I had built this tool to serve my Oracle work. Then my Oracle work showed me exactly where it fell short.&lt;/p>
&lt;hr>
&lt;h2 id="the-problem-with-pymupdf4llm">The Problem with PyMuPDF4LLM&lt;/h2>
&lt;p>If you&amp;rsquo;ve followed this series, you know that PyMuPDF4LLM was a solid choice when I first &lt;a href="https://augmentedresilience.com/posts/when-your-pdf-workflow-breaks-building-a-markdown-converter-with-claude-code/" target="_blank" rel="noopener noreferrer">built the converter&lt;/a>
. It handled text-based PDFs cleanly, installed without friction, and required almost no configuration. For research papers and simple documentation, it worked well.&lt;/p>
&lt;p>But Oracle HCM documentation is a different category of document. Oracle&amp;rsquo;s guides are dense with tables: configuration reference grids, eligibility matrices, step-and-action setup tables. These are not decorative — they carry most of the meaning. When PyMuPDF4LLM dissolved those tables into unstructured text, it was silently degrading the most important parts of the document.&lt;/p>
&lt;p>The image-based PDF problem was a hard wall. If a document was captured as page images rather than extractable text, the converter returned nothing. No partial output, no warning — just empty files.&lt;/p>
&lt;hr>
&lt;h2 id="discovering-docling">Discovering Docling&lt;/h2>
&lt;p>IBM Research Zurich&amp;rsquo;s AI for Knowledge team open-sourced &lt;a href="https://github.com/docling-project/docling" target="_blank" rel="noopener noreferrer">Docling&lt;/a>
in July 2024. The project has a specific focus: turning complex documents into structured, AI-ready output. In April 2025, IBM donated it to the Linux Foundation AI &amp;amp; Data, and it now powers data ingestion for Red Hat Enterprise Linux AI. As of this writing it has over 24,000 GitHub stars.&lt;/p>
&lt;p>What makes Docling different is that it treats document conversion as a computer vision problem, not just a text extraction problem.&lt;/p>
&lt;p>&lt;strong>Layout analysis:&lt;/strong> Docling uses an RT-DETR-derived model trained on DocLayNet — IBM&amp;rsquo;s human-annotated dataset of real-world documents — to detect and classify every region on the page: tables, figures, headers, footers, section titles, body text. It knows the structure before it extracts any content.&lt;/p>
&lt;p>&lt;strong>Table reconstruction:&lt;/strong> This is where Docling earns its place for Oracle documentation. It uses a vision transformer called TableFormer that predicts row/column structure and header roles directly from the page image. The result is a proper Markdown table, not a stream of cell values.&lt;/p>
&lt;p>&lt;strong>Image-based PDFs:&lt;/strong> For documents stored as page images, Docling integrates OCR into its pipeline natively. The same converter handles text-based and image-based PDFs without any changes on your end.&lt;/p>
&lt;hr>
&lt;h2 id="the-switch">The Switch&lt;/h2>
&lt;p>The API change was minimal. The old code:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> pymupdf4llm
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>md_text &lt;span style="color:#f92672">=&lt;/span> pymupdf4llm&lt;span style="color:#f92672">.&lt;/span>to_markdown(pdf_path)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The new code:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> docling.document_converter &lt;span style="color:#f92672">import&lt;/span> DocumentConverter
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>converter &lt;span style="color:#f92672">=&lt;/span> DocumentConverter()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>result &lt;span style="color:#f92672">=&lt;/span> converter&lt;span style="color:#f92672">.&lt;/span>convert(pdf_path)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>md_text &lt;span style="color:#f92672">=&lt;/span> result&lt;span style="color:#f92672">.&lt;/span>document&lt;span style="color:#f92672">.&lt;/span>export_to_markdown()
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Three lines instead of one, but the extra structure pays dividends: &lt;code>DocumentConverter&lt;/code> can be initialized once and reused across an entire batch, which matters when processing a folder of 50 Oracle guides.&lt;/p>
&lt;p>&lt;strong>A note on startup:&lt;/strong> The first time you run Docling, it downloads its ML models from Hugging Face. You will see this:&lt;/p>
&lt;pre tabindex="0">&lt;code>Loading weights: 100%|██████████| 770/770 [00:00&amp;lt;00:00, 1656.35it/s]
&lt;/code>&lt;/pre>&lt;p>This is normal. The models cache locally after the first download and subsequent runs start immediately. If you see a warning about &lt;code>HF_TOKEN&lt;/code>, that is also expected — Docling works without one, but setting a token removes the rate-limit warning:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-zsh" data-lang="zsh">&lt;span style="display:flex;">&lt;span>echo &lt;span style="color:#e6db74">&amp;#39;export HF_TOKEN=&amp;#34;hf_your_token_here&amp;#34;&amp;#39;&lt;/span> &amp;gt;&amp;gt; ~/.zshrc
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="what-changed-in-practice">What Changed in Practice&lt;/h2>
&lt;p>&lt;strong>Oracle documentation:&lt;/strong> Tables that previously collapsed into text now render as proper Markdown tables. A 6-column configuration reference comes back with headers intact and every row correctly aligned.&lt;/p>
&lt;p>&lt;strong>AI books:&lt;/strong> My knowledge base includes dense technical books on LLM engineering and machine learning. These have complex layouts — sidebars, multi-column sections, figures with captions. Docling&amp;rsquo;s layout model handles these significantly better than PyMuPDF4LLM&amp;rsquo;s heuristic approach.&lt;/p>
&lt;p>&lt;strong>Image-based PDFs:&lt;/strong> Documents that previously produced empty output now convert cleanly. The two-step workaround (ocrmypdf → pdf2md) is no longer necessary for most cases.&lt;/p>
&lt;hr>
&lt;h2 id="two-other-improvements">Two Other Improvements&lt;/h2>
&lt;p>While I was updating the engine, I added two things that were overdue:&lt;/p>
&lt;p>&lt;strong>DOCX support.&lt;/strong> The converter now handles Word documents using pandoc as a backend. The same &lt;code>pdf2md&lt;/code> command works for both file types. This matters for Oracle support exports and study notes from my reMarkable.&lt;/p>
&lt;p>&lt;strong>Batch manifest.&lt;/strong> When processing a large folder, the converter now writes a manifest file tracking which files have been converted and their checksums. Re-running on the same folder skips files that haven&amp;rsquo;t changed. A &lt;code>--force&lt;/code> flag overrides this when you need a fresh conversion.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>pdf2md --batch ~/oracle-pdfs/ &lt;span style="color:#75715e"># skips already-converted&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>pdf2md --batch ~/oracle-pdfs/ --force &lt;span style="color:#75715e"># reconverts everything&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="whats-next">What&amp;rsquo;s Next&lt;/h2>
&lt;p>The web UI — which I added in the &lt;a href="https://augmentedresilience.com/posts/adding-a-web-ui-to-my-pdf-to-markdown-converter/" target="_blank" rel="noopener noreferrer">last post&lt;/a>
— has also been updated to use Docling. Drag a PDF onto it, click Convert, and the same deep-learning pipeline runs behind the scenes.&lt;/p>
&lt;p>The next thing I want to add is direct output to the Obsidian inbox. Right now the flow is: convert → download ZIP → move to vault. A toggle that sends output directly to &lt;code>~/projects/obsidian-vault/00-inbox/&lt;/code> would cut that manual step entirely.&lt;/p>
&lt;p>The tool is doing what I originally wanted: converting my Oracle documentation and AI library into clean, searchable Markdown. Docling is what makes that reliable for the documents that actually matter.&lt;/p></content></item></channel></rss>