Sometimes python related questions pop into my head, like how slow are Django templates or how hard would it be to inline Python function calls and I usually end up spending a couple of hours trying to answer them. The question that conveniently popped into my head yesterday as I was trying to avoid revising was "How much code is there in the Python Packaging Index, and how hard is it to count?"
This seemed like a nice distraction from real work, so I set about trying to answer it. I will quickly run through how I made a program to answer this question, then present some statistics I gathered.
The Python Packaging Index (PyPi for short) is a central index of Python packages published by developers. At the time of writing there are 37,887 published packages, any of which can be downloaded and installed with a single "pip install [packagename]" command.
For each package in the PyPi repository we need to do the following things:
PyPi exposes an XML-RPC API that anyone can use. Retrieving a list of all registered package names is as simple as:
import xmlrpclib client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi') all_packages = client.list_packages()
PyPi also exposes a convenient JSON API to retrieve data about a package. You can access this by simply appending a "/json" onto the end of any package page, for example https://pypi.python.org/pypi/Django/json retrieves a JSON object describing the Django package. This object contains metadata about the package as well as the latest download URL's.
You can then find a suitable release (source distribution preferred), download it to a temporary file and extract it. Once its extracted a program like CLOC can be run over the source tree to count the number of lines of code. You can find my attempt at this program here.
I ran the above script and after a couple of hours it had parsed every Python package it could, then I did some ad-hoc analysis on the data.
The script managed to gather information about 36,940 packages. The script could not process the source code for 4,400 of those packages - this could be because no release was present, the download_url pointed to a HTML page rather than a package or the archive was corrupt/unsafe. This leaves 32,540 packages.
Those 32,540 packages contained 7.4GB of data and had a total monthly download count of 54,340,576. CLOC detected 127,635,341 lines of source code across 807,993 files, and of those 72,631,329 lines were Python across 484,788 files. The average package weighs in at 239kb, contains 2,232 lines of Python code and has been downloaded 1,669 times in the last month.
It should be noted that CLOC is not perfect at detecting languages. I highly doubt there is much Pascal code on the PyPi, but CLOC may have counted it due to files having a .p extension. It's good for a rough estimate though.
Source distribution is by far the most common Python package format with 32,490 packages. The Wheel format is starting to appear but still has a long way to go with only 318 releases.
GitHub is by far the most popular homepage for packages with over 16,000 references. BitBucket is beating Google Code with double the number of packages and SourceForge is quite rightfully languishing at the near bottom.
This graph plots the percentage of comments across all packages. 1101 packages contained 50% or more comments, and 14,199 contained less than 15%