Compressing an obsidian vault into a single PDF
As mentioned in my other post Obsidian is a great tool for writing documentation.
I want to set aside the cross-device syncing issues for now, since those can likely be solved by either paying for Obsidian Sync or setting up LiveSync (which I haven’t done yet).
For now, how can I create a single PDF from my Obsidian vault?
Cleaning the markdown
Since Obsidian doesn’t have a real API, and I don’t want to rely on a plugin that could break whenever Obsidian updates, I prefer to work directly with the raw Markdown files.
This approach comes with a small compromise: it assumes your Markdown doesn’t rely on too many non-standard features. If you’ve heavily customized Obsidian with dozens of “productivity” plugins, this method might not work for you.
First I pass the obsidian vault through obsidian-export, a rust tool that exports the vault to regular markdown. This resolves [[note]]
wikilink links to normal links.
Fixing Links
Since the markdown files are in a possibly nested folder structure and I want to merge all pages into one document the relative links need to be fixed.
My rule for a page link is that a document at /my-vault/example/subfolder/test.md
has the unique relative filepath example/subfolder/test.md
which is sluggified to example-subfolder-test
So for each link in the markdown:
- get the referenced filepath and calculate its relative path from the vault root instead
- using this relative path we can calculate the unique anchor reference
- put anchor reference as link target
export function localizeFileLinks(content, filePath) {
const fileFolder = path.dirname(filePath);
return content.replace(/\[([^\]]+)\]\(([^)]+)\)/g, (match, text, link) => {
if (link.endsWith(".md") || link.includes(".md#")) {
const [targetFile, hash] = link.split("#");
const absTarget = path.resolve(fileFolder, targetFile);
const anchor = filenameToAnchor(absTarget);
if (anchor) {
return `[${text}](#${anchor}${hash ? "-" + hash : ""})`;
} else {
console.log(absTarget + " not found");
}
}
return match;
});
}
For images we simply get their path relative to the vault root and leave them where they are. This way if I put the merged markdown at the vault root, the image links in the markdown point to the correct location.
export function localizeImageLinks(content, filePath, rootDir) {
const fileFolder = path.dirname(filePath);
return content.replace(/!\[([^\]]*)\]\(([^)]+)\)/g, (match, alt, imgPath) => {
const absImg = path.resolve(fileFolder, imgPath);
const relToRoot = path.relative(rootDir, absImg);
return ``;
});
}
Fixing Headlines
Since nothing prevents you from adding Level-1 headers in obsidian I decided to simply normalize the headings so that there are only level-2 ones
export function normalizeHeadings(markdown) {
const hasH1 = /^# .+/m.test(markdown);
if (!hasH1) return markdown;
return markdown.replace(/^(#+)(\s+)/gm, (match, hashes, space) => {
// add one extra '#'
return hashes + "#" + space;
});
}
This allows me to put the file name as the Level-1 Headline with the unique anchor I mentioned above
export async function processMarkdownFile(file, rootDir) {
...
const headline = path.basename(file).replace(/\.md$/, "");
const anchor = filenameToAnchor(file);
let markdown = "";
//first the anchor to jump to this .md file
markdown += `<a class="header-anchor" id="${anchor}"></a>\n\n`;
//the main h1 headline
markdown += "# " + headline + "\n\n";
//now the body
markdown += parsed.body;
//force page break after
markdown += "<div style='page-break-after:always'> </div>\n\n";
return markdown;
}
Fixing Anchor Links
The above setup works for links to other markdown files but it does not work for links to headlines inside markdown files because if you take a look at the code
[Other Headline](../other-file.md#other-headline)
currently gets converted to
[Other Headline](#subfolder-other-file-other-headline)
Which is a headline that does not exist, we only create anchors for the Level-1 headings.
Also anchor links in a markdown file that reference a heading inside the same markdown file like
[Headline in same file](#headline)
Are also not good because a) nothing guarantees that "headline" is a headline that does not exist anywhere else in the vault and b) Non-Level-1 headlines do not even have ids yet
I fix this in the conversion step from markdown to html with a plugin:
import markdownItAnchor from "markdown-it-anchor";
md.use(markdownItAnchor, {
level: 2
});
This adds anchors but it creates non-unique ones (## heading
gets the anchor id heading
)
After some tinkering I ended up with this complicated looking custom markdown-it plugin:
md.use((md) => {
md.core.ruler.push('heading_parent', function (state) {
let h1Stack = [];
state.tokens.forEach((token, index) => {
if (token.type === 'inline') {
const match = token.content.match(
/<a\s+class="header-anchor"\s+id="([^"]+)"><\/a>/
);
if (match) {
const lastAnchorId = match[1];
h1Stack.push(lastAnchorId);
}
}
if (token.type === 'heading_open') {
const level = parseInt(token.tag[1]);
if (level > 1 && h1Stack.length > 0) {
// Associate with last <h1> token
const parentH1 = h1Stack[h1Stack.length - 1];
let title = state.tokens[index + 1].children
.filter(t => ['text', 'code_inline'].includes(t.type))
.map(t => t.content)
.join('');
state.tokens[index].attrSet("id", parentH1 + "-" + slugify(title))
}
}
});
});
});
Let's explain what it does:
- loop over the markdown document tokens
- for every token that is one of the Level-1 anchors we created, push it's id into an array
- for every token that is a non-level-1 heading take the last entry of the level-1 anchor ids (basically the "parent" level 1 header) plus the current heading text and create a unique slug for it
This way all headings have unique anchors and the fixed links work!
Adding a PDF Outline
PDFs can have an “outline” that appears as a navigation menu in most PDF viewers. To create this, I need the page number for each headline - but that’s tricky, because I can’t know in advance how much space a Markdown document will take, and therefore how many pages it will occupy.
I decided on parsing the final pdf after it has been generated. So first generate the pdf:
export async function htmlToPdf(htmlPath, pdfPdf) {
const browser = await puppeteer.launch({
headless: true,
defaultViewport: null,
executablePath: '/usr/bin/google-chrome',
args: ['--no-sandbox'],
});
const page = await browser.newPage();
await page.goto("file://" + htmlPath, { waitUntil: "networkidle0" });
// Export to PDF
await page.pdf({
path: pdfPdf,
format: "A4",
printBackground: true,
displayHeaderFooter: false,
margin: { top: "20mm", bottom: "20mm", left: "20mm", right: "20mm" },
});
await browser.close();
}
Now get the headlines:
export async function getHeadlines(filePath) {
const loadingTask = pdfjsLib.getDocument(filePath);
const pdf = await loadingTask.promise;
let parsed = [];
let biggestFontSize = 0;
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const content = await page.getTextContent();
const texts = content.items.map(item => {
const size = item.transform[0];
if (size > biggestFontSize) biggestFontSize = size;
return {
text: item.str,
font: item.fontName,
size: size
}
});
parsed.push({
pageNr: i,
texts
});
}
let chapters = [];
parsed.forEach(page => {
let headlineText = page.texts.filter(text => text.size === biggestFontSize).map(text => text.text);
if (headlineText.length) {
chapters.push({
pageNr: page.pageNr,
headline: headlineText.join("")
})
}
});
return chapters;
}
As you can see, this approach is a bit unconventional. When parsing a PDF, there’s no metadata indicating whether a piece of text is an <h1>
headline—you only get positions and dimensions. So, I make the assumption that a piece of text is an <h1>
if it uses the largest font size in the document. I’m not entirely happy with this method, but for now, I haven’t come up with a better solution.
Once I have this array, I can use it to add the outline:
import { outlinePdfCjs } from "@lillallol/outline-pdf-cjs";
export async function addOutline(filePath) {
const chapters = await getHeadlines(filePath);
let outline = chapters.map(chapter => {
return chapter.pageNr + "||" + chapter.headline
}).join("\n");
let tmpFile = "/tmp/outlined-pdf.pdf";
await outlinePdfCjs({
loadPath: filePath,
savePath: tmpFile,
outline
});
await fs.unlink(filePath)
await fs.rename(tmpFile, filePath)
}
Check out the repo for the full sourcecode