Compressing an obsidian vault into a single PDF

Compressing an obsidian vault into a single PDF
Photo by Danika Perkinson / Unsplash

As mentioned in my other post Obsidian is a great tool for writing documentation.

PDF Export for selfhosted outline instance
Intro TLDR: I want to export an Outline collection and will try around with Obsidian later. Why Outline? Outline is a clean and powerful knowledge base solution that fits some needs better than obsidian. Don’t get me wrong - Obsidian is an excellent tool as well, and both have

I want to set aside the cross-device syncing issues for now, since those can likely be solved by either paying for Obsidian Sync or setting up LiveSync (which I haven’t done yet).

For now, how can I create a single PDF from my Obsidian vault?

Cleaning the markdown

Since Obsidian doesn’t have a real API, and I don’t want to rely on a plugin that could break whenever Obsidian updates, I prefer to work directly with the raw Markdown files.

This approach comes with a small compromise: it assumes your Markdown doesn’t rely on too many non-standard features. If you’ve heavily customized Obsidian with dozens of “productivity” plugins, this method might not work for you.

First I pass the obsidian vault through obsidian-export, a rust tool that exports the vault to regular markdown. This resolves [[note]] wikilink links to normal links.

Since the markdown files are in a possibly nested folder structure and I want to merge all pages into one document the relative links need to be fixed.

My rule for a page link is that a document at /my-vault/example/subfolder/test.md has the unique relative filepath example/subfolder/test.md which is sluggified to example-subfolder-test

So for each link in the markdown:

  • get the referenced filepath and calculate its relative path from the vault root instead
  • using this relative path we can calculate the unique anchor reference
  • put anchor reference as link target
export function localizeFileLinks(content, filePath) {
    const fileFolder = path.dirname(filePath);

    return content.replace(/\[([^\]]+)\]\(([^)]+)\)/g, (match, text, link) => {
        if (link.endsWith(".md") || link.includes(".md#")) {
            const [targetFile, hash] = link.split("#");

            const absTarget = path.resolve(fileFolder, targetFile);
            const anchor = filenameToAnchor(absTarget);

            if (anchor) {
                return `[${text}](#${anchor}${hash ? "-" + hash : ""})`;
            } else {
                console.log(absTarget + " not found");
            }
        }
        return match;
    });
}

For images we simply get their path relative to the vault root and leave them where they are. This way if I put the merged markdown at the vault root, the image links in the markdown point to the correct location.

export function localizeImageLinks(content, filePath, rootDir) {
    const fileFolder = path.dirname(filePath);

    return content.replace(/!\[([^\]]*)\]\(([^)]+)\)/g, (match, alt, imgPath) => {
        const absImg = path.resolve(fileFolder, imgPath);
        const relToRoot = path.relative(rootDir, absImg);
        return `![${alt}](${relToRoot})`;
    });
}

Fixing Headlines

Since nothing prevents you from adding Level-1 headers in obsidian I decided to simply normalize the headings so that there are only level-2 ones

export function normalizeHeadings(markdown) {
    const hasH1 = /^# .+/m.test(markdown);
    if (!hasH1) return markdown;

    return markdown.replace(/^(#+)(\s+)/gm, (match, hashes, space) => {
        // add one extra '#'
        return hashes + "#" + space;
    });
}

This allows me to put the file name as the Level-1 Headline with the unique anchor I mentioned above

export async function processMarkdownFile(file, rootDir) {
    ...
    
    const headline = path.basename(file).replace(/\.md$/, "");
    const anchor = filenameToAnchor(file);

    let markdown = "";

    //first the anchor to jump to this .md file
    markdown += `<a class="header-anchor" id="${anchor}"></a>\n\n`;

    //the main h1 headline
    markdown += "# " + headline + "\n\n";

    //now the body
    markdown += parsed.body;

    //force page break after
    markdown += "<div style='page-break-after:always'>&nbsp;</div>\n\n";

    return markdown;
}

The above setup works for links to other markdown files but it does not work for links to headlines inside markdown files because if you take a look at the code

[Other Headline](../other-file.md#other-headline)

currently gets converted to

[Other Headline](#subfolder-other-file-other-headline)

Which is a headline that does not exist, we only create anchors for the Level-1 headings.

Also anchor links in a markdown file that reference a heading inside the same markdown file like

[Headline in same file](#headline)

Are also not good because a) nothing guarantees that "headline" is a headline that does not exist anywhere else in the vault and b) Non-Level-1 headlines do not even have ids yet

I fix this in the conversion step from markdown to html with a plugin:

import markdownItAnchor from "markdown-it-anchor";

md.use(markdownItAnchor, {
    level: 2
});

This adds anchors but it creates non-unique ones (## heading gets the anchor id heading)

After some tinkering I ended up with this complicated looking custom markdown-it plugin:

md.use((md) => {
        md.core.ruler.push('heading_parent', function (state) {
            let h1Stack = [];

            state.tokens.forEach((token, index) => {
                if (token.type === 'inline') {

                    const match = token.content.match(
                        /<a\s+class="header-anchor"\s+id="([^"]+)"><\/a>/
                    );

                    if (match) {
                        const lastAnchorId = match[1];
                        h1Stack.push(lastAnchorId);
                    }
                }

                if (token.type === 'heading_open') {
                    const level = parseInt(token.tag[1]);

                    if (level > 1 && h1Stack.length > 0) {
                        // Associate with last <h1> token
                        const parentH1 = h1Stack[h1Stack.length - 1];

                        let title = state.tokens[index + 1].children
                            .filter(t => ['text', 'code_inline'].includes(t.type))
                            .map(t => t.content)
                            .join('');

                        state.tokens[index].attrSet("id", parentH1 + "-" + slugify(title))
                    }
                }
            });
        });
    });

Let's explain what it does:

  • loop over the markdown document tokens
  • for every token that is one of the Level-1 anchors we created, push it's id into an array
  • for every token that is a non-level-1 heading take the last entry of the level-1 anchor ids (basically the "parent" level 1 header) plus the current heading text and create a unique slug for it

This way all headings have unique anchors and the fixed links work!

Adding a PDF Outline

PDFs can have an “outline” that appears as a navigation menu in most PDF viewers. To create this, I need the page number for each headline - but that’s tricky, because I can’t know in advance how much space a Markdown document will take, and therefore how many pages it will occupy.

I decided on parsing the final pdf after it has been generated. So first generate the pdf:

export async function htmlToPdf(htmlPath, pdfPdf) {
    const browser = await puppeteer.launch({
        headless: true,
        defaultViewport: null,
        executablePath: '/usr/bin/google-chrome',
        args: ['--no-sandbox'],
    });
    const page = await browser.newPage();

    await page.goto("file://" + htmlPath, { waitUntil: "networkidle0" });

    // Export to PDF
    await page.pdf({
        path: pdfPdf,
        format: "A4",
        printBackground: true,
        displayHeaderFooter: false,
        margin: { top: "20mm", bottom: "20mm", left: "20mm", right: "20mm" },
    });

    await browser.close();
}

Now get the headlines:

export async function getHeadlines(filePath) {
    const loadingTask = pdfjsLib.getDocument(filePath);
    const pdf = await loadingTask.promise;

    let parsed = [];
    let biggestFontSize = 0;

    for (let i = 1; i <= pdf.numPages; i++) {
        const page = await pdf.getPage(i);
        const content = await page.getTextContent();

        const texts = content.items.map(item => {
            const size = item.transform[0];

            if (size > biggestFontSize) biggestFontSize = size;

            return {
                text: item.str,
                font: item.fontName,
                size: size
            }
        });

        parsed.push({
            pageNr: i,
            texts
        });
    }

    let chapters = [];

    parsed.forEach(page => {
        let headlineText = page.texts.filter(text => text.size === biggestFontSize).map(text => text.text);

        if (headlineText.length) {
            chapters.push({
                pageNr: page.pageNr,
                headline: headlineText.join("")
            })
        }
    });

    return chapters;
}

As you can see, this approach is a bit unconventional. When parsing a PDF, there’s no metadata indicating whether a piece of text is an <h1> headline—you only get positions and dimensions. So, I make the assumption that a piece of text is an <h1> if it uses the largest font size in the document. I’m not entirely happy with this method, but for now, I haven’t come up with a better solution.

Once I have this array, I can use it to add the outline:

import { outlinePdfCjs } from "@lillallol/outline-pdf-cjs";

export async function addOutline(filePath) {
    const chapters = await getHeadlines(filePath);

    let outline = chapters.map(chapter => {
        return chapter.pageNr + "||" + chapter.headline
    }).join("\n");

    let tmpFile = "/tmp/outlined-pdf.pdf";

    await outlinePdfCjs({
        loadPath: filePath,
        savePath: tmpFile,
        outline
    });

    await fs.unlink(filePath)
    await fs.rename(tmpFile, filePath)
}

Check out the repo for the full sourcecode

obsidian-to-pdf
obsidian-to-pdf