2 years of blogging

When I first came to University lots of people (like Rob Miles) were trying to get undergraduates to start blogging. On the 6th of March 2012 I registered this domain and started blogging, getting myself added to the awesome Hull Compsci blogs syndicate. That was two years ago and a lot has changed so I thought I would write a summary post.

Open source stuff

My first blog was a Wordpress one, but I felt that Wordpress was trying to be too much and I just wanted something a bit more lightweight. So, inspired by svbtle I created and released Simple which powers this blog. A least a few other people use it for their blogs as well which is kind of cool depite me not giving the project much attention recently.

I primarily use GitHub to share all my open source code. Over the last year I've made 379 contributions to both my own repositories and others that I use, and somehow gathered 48 followers. My most popular repository is Simple which has 481 stars and 96(!) forks, followed by my Django Debug toolbar panel with 191 stars.

I've put a few projects on the Python package index as well, and a couple seem to be used a fair bit: The Debug toolbar panel had 3690 downloads in the last month and my HtmlToWord library had 3812 (which I find weird because unlike the other package I can't locate a single project other than mine that actually uses it).

In the last two years this blog has had 67,184 visitors with a peak of over 1,000 concurrent viewers, which is pretty neat. Simple handled every request speedily with no errors.

The most popular posts are:

  1. Purchasing a £30,000 numberplate for the price of a bus ticket
  2. Breaking out of secured Python environments
  3. Just how slow are Django templates
  4. Opera is really nice
  5. Automatically inline Python function calls
  6. Using Python metaclasses to make awesome Django model field choices
  7. Adding tail call optimization to Python

According to Google Webmaster tools some of my popular posts have quite high google rankings for some queries:

Viewer information

I use Google Analytics for my website stats and it also tracks things like browser usage, operating system and screen resolution.

Security stuff

I found a 4 "grave" security issues in some software that got issued their own CVE numbers (which is awesome!): osvdb.org/creditees/11331-tom-forbes. Slightly before I started this blog I also presented a research paper at the BlackHat security conference which was the most scary hour of my life. I haven't really blogged about it but I intend to in the coming weeks.

Conclusion

I'm pretty happy with the last two years and I'm both excited for the future and a bit sad that my time at Hull is coming to a close.

Opera is really nice

I really like the Opera browser.

A couple of months ago I got a bit tired of using Google Chrome, it was just a bit sluggish sometimes and I fancied a change. So I switched to Firefox, which I enjoyed using for a month or so until it too became irksome - it used a hell of a lot of memory and was also often very sluggish (moreso than Chrome was). In desperation I decided to give Opera a spin and to my surprise I really liked it.

Opera had always been something I was aware existed but had never looked into. To be honest I didn't really consider it to be a real browser (how could it be with less than 4% market share?) and when one of my business partners told me he used it I felt a feeling of contempt: this guy hasn't seen the light that is Chrome and is stuck using some dated crummy browser. What a chump.

How wrong I was. Opera is fast, clean, responsive and is a pleasure to use. Since 2013 Opera has used the Blink layout engine which is the same one that powers Chrome and so it inherits some of Chromes features like a multi-process model, it's dev tools, sandboxing and even compatibility with its extensions. Talking of extensions I was pleasantly surprised at the number available - over 600 are listed on the add-on site with all of the popular ones (Adblock, LastPass, Reddit enhancement, Ghostery, etc).

It's not all roses though, I installed the Opera Next edition which is slightly more bleeding edge than the standard Opera release. The only issues I have had so far is when I have a large number of tabs that use Flash open. That tends to kill the browser, but it promptly restarts and re-opens all my tabs. Oh, and dragging a tab from one window to another is a little bit tricky, I think it has latched onto the new window but when I release it a new window opens (you just have to hold it there a bit longer than FireFox or Chrome). Neither of those are particularly critical and I can live with them.

If you are a bit tired of Chrome and want to try something different then I highly recommend Opera. Go give it a try.

Submitting a patch to Python’s lxml library

While working on a system for work I ran into a bug with Python’s lxml library and decided to fix it. I thought I would document how easy the process was, hopefully to encourage others to contribute to open source projects.

Lxml is a “pythonic binding for the libxml2 and libxslt libraries” which put plainly means it’s a Python library that makes calling functions from those native libraries (c/c++) easy. Which is nice because you get all the features and speed from those two very mature libraries wrapped up in a lovely Python API. It’s also hugely popular with nearly 600,000 downloads in the last month alone from the Python central package server.

One of the features it provides is the ability to diff two portions of HTML together. For example:

from lxml.html.diff import htmldiff
html_1 = "<p>some text here</p>"
html_2 = "<p>some more text</p>"
print htmldiff(html_1, html_2)
"""<p>some <ins>more</ins> text <del>here</del></p>"""

Any text removed is wrapped in a <del> tag, and any text inserted is wrapped in a <ins> tag. The bug in question is that htmldiff ignored whitespace in HTML input, meaning things like newlines were lost while diffing:

html_1 = "<pre>some text\n new line\n more lines</pre>"
html_2 = "<pre>some text\n new line</pre>"
print htmldiff(html_1, html_2)
"""<pre>some text new line <del>more lines</del></pre>"""

For the most part whitespace in HTML can be ignored, but in the case of a <pre> tag it cannot – this tag is used for displaying preformatted text which means whitespace is respected and needs to be preserved.

The first step to fixing this was to take a look at the source. Lxml’s source code is all available here on GitHub so I just forked the whole repository and started fiddling (forking means creating a copy of the repository that you can edit). I tracked down the problem quickly enough: when tokenising the input if there was any whitespace it simply set a boolean flag which was used to add a single space after it when outputting the tokens. This made the fix very simple: rather than store a boolean indicating if there was any trailing whitespace I could just use that field to store the actual trailing whitespace, which I did in this commit: https://github.com/lxml/lxml/commit/44e697fa9a8b580326bfaf6ffffeda3220c6c733#diff-c70277a0d76841a33e740a43edf14d99. After writing a few tests and making a few more commits I had fixed the problem and so I was ready to submit the changes to lxml.

What’s the best way to do this? When using GitHub the best way is to make a pull request – this takes any changes you have made to your clone of the code and makes a nice page listing them. Here is the pull request I made: https://github.com/lxml/lxml/pull/125. One of the project leaders suggested a small change, which I implemented, and the pull request was accepted – the full list of changes made can be found here.

The bug was marked as fixed as of version 3.2.4 and overall it was pretty painless experience to go from finding the bug to submitting the pull request. GitHub definitely streamlines the whole process with their pull request functionality, otherwise I might have had to make a patch file and submitted it somewhere manually. That’s not impossible, but its certainly not as simple as clicking a button.

How much code is there in the Python Package Index?

Sometimes python related questions pop into my head, like how slow are Django templates or how hard would it be to inline Python function calls and I usually end up spending a couple of hours trying to answer them. The question that conveniently popped into my head yesterday as I was trying to avoid revising was "How much code is there in the Python Packaging Index, and how hard is it to count?"

This seemed like a nice distraction from real work, so I set about trying to answer it. I will quickly run through how I made a program to answer this question, then present some statistics I gathered.

Basic outline

The Python Packaging Index (PyPi for short) is a central index of Python packages published by developers. At the time of writing there are 37,887 published packages, any of which can be downloaded and installed with a single "pip install [packagename]" command.

For each package in the PyPi repository we need to do the following things:

PyPi exposes an XML-RPC API that anyone can use. Retrieving a list of all registered package names is as simple as:

import xmlrpclib
client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
all_packages = client.list_packages()

PyPi also exposes a convenient JSON API to retrieve data about a package. You can access this by simply appending a "/json" onto the end of any package page, for example https://pypi.python.org/pypi/Django/json retrieves a JSON object describing the Django package. This object contains metadata about the package as well as the latest download URL's.

You can then find a suitable release (source distribution preferred), download it to a temporary file and extract it. Once its extracted a program like CLOC can be run over the source tree to count the number of lines of code. You can find my attempt at this program here.

I ran the above script and after a couple of hours it had parsed every Python package it could, then I did some ad-hoc analysis on the data.

Some statistics

The script managed to gather information about 36,940 packages. The script could not process the source code for 4,400 of those packages - this could be because no release was present, the download_url pointed to a HTML page rather than a package or the archive was corrupt/unsafe. This leaves 32,540 packages.

Those 32,540 packages contained 7.4GB of data and had a total monthly download count of 54,340,576. CLOC detected 127,635,341 lines of source code across 807,993 files, and of those 72,631,329 lines were Python across 484,788 files. The average package weighs in at 239kb, contains 2,232 lines of Python code and has been downloaded 1,669 times in the last month.

Packages can contain more than just Python files. The following graph is a breakdown of the most common languages detected in PyPi packages (the following graphs are interactive, please enable JavaScript if you can't view them):

It should be noted that CLOC is not perfect at detecting languages. I highly doubt there is much Pascal code on the PyPi, but CLOC may have counted it due to files having a .p extension. It's good for a rough estimate though.

Source distribution is by far the most common Python package format with 32,490 packages. The Wheel format is starting to appear but still has a long way to go with only 318 releases.

GitHub is by far the most popular homepage for packages with over 16,000 references. BitBucket is beating Google Code with double the number of packages and SourceForge is quite rightfully languishing at the near bottom.

This graph plots the percentage of comments across all packages. 1101 packages contained 50% or more comments, and 14,199 contained less than 15%

University Presentation

So I did a presentation on Information Security at University today. I think it went rather well, however I couldn't show a couple of the demonstrations due to some SkyDrive files only being available online. That sucked because those were my best demonstrations, but overall I was happy.

A few people asked me to put the website code up, so you can find it here: https://github.com/orf/vulnerable_website (or just view the code online here)

If you haven't used Python before then I really recommend it. To get the website up and running follow these steps:

  • Download Python 2.7.6 here and install
  • Download PIP (a python package installer) from here
  • Extract, and run "python setup.py install" from within the directory. If you get an error complaining that "python" doesn't exist then you need to add C:\python27 to your system path. Give it a google for detailed instructions.
  • Once that's finished just run "pip install flask"
  • Go grab the vulnerable website code and then run "python vulnerable_website.py" and you are ready to roll.