PDF Export for selfhosted outline instance

PDF Export for selfhosted outline instance
Photo by Debby Hudson / Unsplash

Intro

TLDR: I want to export an Outline collection and will try around with Obsidian later.

Why Outline?

Outline is a clean and powerful knowledge base solution that fits some needs better than obsidian. Don’t get me wrong - Obsidian is an excellent tool as well, and both have their strengths.

Why Obsidian is great

Your notes are just markdown files files. Even if you stop using Obsidian (or if it stops being free), your data remains as readable markdown files and accessible in any markdown editor.

It just feels like there is a lot of thought put into the app and it is actively developed. There are many features that just feel right, like the link graph or canvases, and the new "Bases" feature.

Where Obsidian fell short for me

In practice, I found self-managed syncing frustrating.

  • I tried Synology Drive with my Obsidian vault stored on my NAS. While it worked, it wasn’t smooth: I had to install the app on every device, deal with sync jobs getting killed on my phone, or keep an annoying permanent notification running just to maintain sync.
  • Exporting to pdf only works per document and only using the obsidian client, it does not have a cli api or something like that, only community driven plugins that also only work in-app and only as long as they are supported
    • Converting the raw markdown files to pdf works for basic stuff but all obsidian-specific markdown tags will not work
  • Obsidian has no api, it is closed source. There are plugins that add a REST api or a socket to send commands but those are all non-official plugins that have been developed and then abandoned.

The whole setup felt clunky and distracted from actually using the tool. I might look into setting up livesync/synthing in the future.

Where Outline works better for me

With Outline, the some things are more seamless:

  • Self-hosted on my VPS → I can access it from anywhere, just using a browser.
  • Cross-device → Switching between laptop and phone feels natural.
  • Collaboration → Real-time editing and live collaboration built in.
  • Open Source + API → I can write a script that interacts with the instance

Export to PDF

Exporting a document to PDF is only available per document for the business / enterprise editions of outline, the community version gives you JSON, HTML and (outline flavoured) Markdown.

Under the hood the PDF export uses gotenberg which (I guess) POSTs the html there to convert html to pdf. They offer a demo instance which I tried out to convert an example document.

curl \
--request POST https://demo.gotenberg.dev/forms/chromium/convert/html \
--header 'Content-Type: multipart/form-data' \
--form files=@index.html \
--form files=@attachments/example.png \
-o my.pdf

You are able to send the html (it has to be named index.html) and all attachments like pictures - they are not allowed to be in a subfolder though, so I would need to modify the html first

The plan

Here is what I want to improve:

  • My exporter should be able to export whole collections or documents with all subdocuments, not just one single document
  • It should include all inline pictures and maybe also attached pdf files
  • Links to other documents should work (scroll to page in pdf)
  • Add Page Numbers

Exporting the collection

This is can be done via the outline api

  • trigger colleciton.export with format html
  • poll the file operation via fileOperations.info
  • get the file url using fileOperations.redirect
  • clean up the file using fileOperations.delete

This gives us a zip file with all the html files and attachments

So exporting the whole collection - check ✅
Include the inline images - check ✅

Get document infos

The documents can be sorted in outline, to get this structure I request the object via api using collections.documents and add them to a map to find documents by their path relative to the collection root folder:

async function getDocumentsList(collection_id) {
    const documentStructure = await apiRequest('collections.documents', {
        id: collection_id
    });

    let documentsByPath = new Map();
    let order = 0;

    const addToMap = (item, parentPath) => {
        const itemPath = parentPath + "/" + item.title;
        item.order = order++;
        documentsByPath.set(itemPath, item);

        if (item.children && item.children.length) {
            item.children.forEach(child => {
                addToMap(child, itemPath)
            });
        }
    };

    documentStructure.data.forEach(item => {
        addToMap(item, "")
    });

    return documentsByPath;
}

Converting to PDF

For this I use puppeteer which means the chromium browser in the background does all the heavy lifting for me

const browser = await puppeteer.launch();
const page = await browser.newPage();

for (const f in files) {
    const file = files[f];

    const pdfPath = path.join(tmp_pdf, "document_" + file.document.order + ".pdf");
    const pdfFolder = path.dirname(pdfPath);
    await mkdirp(pdfFolder);

    file.pdf_path = pdfPath;

    const absoluteHtmlPath = "file://" + file.path;
    await page.goto(absoluteHtmlPath, { waitUntil: "networkidle0" });

    await page.pdf({
        path: file.pdf_path,
        format: "A4",
        printBackground: true,
        tagged: true,
        displayHeaderFooter: true,
    });
}

await browser.close();

After that I merge the pages using pdf-lib

const pdfDoc = await PDFDocument.create();
pdfDoc.setProducer("pax by CodingKiwi");
pdfDoc.setCreationDate(new Date());

for (const f in files) {
    const file = files[f];

    const fileData = await fs.readFile(file.pdf_path);
    const fileDoc = await PDFDocument.load(fileData);

    const indices = fileDoc.getPageIndices();
    const copiedPages = await pdfDoc.copyPages(fileDoc, indices);
    copiedPages.forEach((page) => {
        pdfDoc.addPage(page)
    });
}

I want the links to work, currently linking to another document results in a relative link like this:


<a href="./otherfile.html">Other File</a>
<a href="./../otherfile.html">Other File outside the current folder</a>

The problem with this approach is that each html page is rendered seperately. Links to other files are "broken" because the other file is not part of the pdf, thus puppeteer renders the link as a not-clickable text.

My first idea was merging all html documents into one giant html file, this felt a bit clunky because each html file has its own css style definitions and some html files are full width and some are not (this depends on tables inside outline for example).

I also would need to move all attachments into the same folder as the merged html for the images to work since the html is no longer at the same location relative to the image sources.

After fiddling around for a bit I ended up with a different approach:

First, we I modify the links so they survive the pdf phase:

for (const f in files) {
    const file = files[f];

    ...

    //replace links
    const links = await page.$$("a[href]");

    for (const link of links) {
        const href = await (await link.getProperty("href")).jsonValue();
        if (!href.startsWith(docLinkBase)) continue;

        const filePath = href.replace("file://", "");
        const file = files.find(f => f.path === filePath);

        if (!file) {
            logger.debug("Could not find file for %s", href);
            continue;
        }

        await link.evaluate((el, newHref) => {
            el.setAttribute("href", newHref);
        }, "http://replace-me.com/#" + file.document.id);
    }

    await page.pdf(...);
}

I also add page numbers:

let pageCounter = 0;

for (const f in files) {
    const file = files[f];

    ...

    await page.pdf(...);

    file.page = pageCounter;
    file.page_count = await getPageCount(file.pdf_path);
    pageCounter += file.page_count;
}

This way I can modify the links in the final pdf using dark arcane pdf magic.

Get the id from the link -> get the file from the id -> get the page number from the file -> modify the link from a URI link to a GoTo Link

const pages = pdfDoc.getPages();
pages.forEach(page => {
    page.node.Annots()?.asArray().forEach((a) => {
        const dict = pdfDoc.context.lookupMaybe(a, PDFDict);
        const aRecord = dict.get(asPDFName(`A`));
        const link = pdfDoc.context.lookupMaybe(aRecord, PDFDict);
        const uri = link.get(asPDFName("URI")).toString().slice(1, -1); // get the original link, remove parenthesis

        if (uri.startsWith("http://replace-me.com/#")) {
            let id = uri.replace("http://replace-me.com/#", "");

            let pageNr = files.find(f => f.document.id === id).page;

            link.set(asPDFName('S'), asPDFName('GoTo'));

            const targetPageRef = pdfDoc.getPage(pageNr).ref;
            const ctx = PDFArray.withContext(pdfDoc.context);
            ctx.push(targetPageRef);
            ctx.push(asPDFName('Fit'));
            link.set(asPDFName('D'), ctx);
        }
    });
})

Links to other documents - check ✅

Adding page numbers

For this I add a little bit of margin on the bottom and add the pageNr to the loop we just created

await page.pdf({
   displayHeaderFooter: false,
   margin: {
      bottom: 50
   }
});

...


const { width } = page.getSize();

page.drawText(pageNr + " / " + pages.length, {
    x: width - 50,
    y: 20,
    size: 8
})
            

Page Numbers - check ✅