Syntax highlighting and CSS support added to wordinserter

I recently added syntax highlighting and support for CSS stylesheets to wordinserter, and the implementation was satisfying enough that I thought I would blog about it.

Wordinserter is a library I maintain that lets you insert HTML documents/snippets into Word documents: It’s primary use case is when you have a WYSIWYG editor in a users browser that outputs HTML and you want to put that HTML into some kind of word document. I guess you could say it inserts things… into word. Doing this with wordinserter is as simple as:

from wordinserter import parse, insert

html = "<h1>Hello there!</h1>"\
       "<p><strong>I'm strong!</strong></p>"
operations = parse(html, parser="html")
insert(operations, document=document, constants=constants)

You can read some more examples and documentation on the Github project here: https://github.com/orf/wordinserter, and you can see comparison images between how Firefox and wordinserter renders particular HTML snippets here: https://rawgit.com/orf/wordinserter/master/Tests/report.html.

Anyway, back to the topic at hand. One of the features our WYSIWYG editor supports is syntax highlighting, and I’ve always wanted to add proper support for this in wordinserter. In HTML code is usually represented by a pre or code tag like so:

<pre>
def test():
    pass

import urllib
urllib.urlopen("https://google.com")
</pre>

The pre/code tag has some unusual properties such as respecting all whitespace included within it, but other than that it’s just a normal tag. Websites (like this one) use various JS libraries or server-side processing to highlight the contents of these tags to make them more visually appealing which usually boils down to sticking a bunch of span tags with CSS classes/inline styles in the right places to highlight the code. For example the snippet below is the highlighted HTML contents of the snippet above:

def
<span class="hljs-function"
  ><span class="hljs-title">test</span><span class="hljs-params">()</span></span
>: pass import urllib urllib.<span class="hljs-function"
  ><span class="hljs-title">urlopen</span
  ><span class="hljs-params"
    >(<span class="hljs-string">"https://google.com"</span>)</span
  ></span
>

So how does wordinserter highlight code?

You can tell wordinserter to insert highlighted code in two ways. The first, and the simplest, is to simply send a <pre> tag with a language attribute like so:

<pre language="python">
def test():
    pass

import urllib
urllib.urlopen("https://google.com")
</pre>

This uses the awesome pygments library under the hood and will highlight the code using that, using magic.

This worked for a while, then some clever chap walked up to me and said “Hey, the syntax highlighting works for this code in the WYSIWYG editor but it doesn’t display correctly in the document”. I looked into it and the problem was a mismatch between using pygments to highlight the code in the document and hljs to do it on the frontend. So the only natural way forward was to unify them both to use a single style, and so I added support for CSS files to wordinserter.

You what?

Yeah. That’s what I thought to myself when I first had the idea. We have a CSS file that styles stuff on the frontend, and we want the highlighted code to be the same on the generated document. The only way forward that I could see would be to send the CSS file along with the hljs highlighted code (the one with all the spans) to wordinserter. It would then see a span tag with hljs-functions, look at the CSS file and see the appropriate style and then apply it. You can use it like so:

from wordinserter import parse, insert

html = "<h1>Hello there!</h1>"\
       "<p><strong>I'm strong!</strong></p>"
operations = parse(html, parser="html", stylesheets=["h1 { color: red; }"])
insert(operations, document=document, constants=constants)

The implementation was actually really simple: https://github.com/orf/wordinserter/blob/master/wordinserter/parsers/html.py#L55-L73

All it does is parse the CSS file using the awesome cssutils library then crudely run through each rule, find all elements that match that rule and copy the CSS rules as inline-styles. Not amazing but it gets the job done and required minimal modifications to any other part of the library. I had to make some big changes later when I figured out that the inheritance of these styles was wonky (parent styles overrode the child styles), but that’s fixed now so it’s all groovy.

In both cases the finished document will look like this: