I scanned every package on PyPi and found 57 live AWS keys
After inadvertently finding that InfoSys leaked an AWS key on PyPi I wanted to know
how many other live AWS keys may be present on Python package index. After scanning every release published to PyPi
I found 57 valid access keys from organisations like:
Detecting AWS keys is actually fairly simple. A keypair consists of two components: the key ID and the key secret.
The key ID can be detected with the regular expression ((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16})), and the secret key
can be detected with a much more general [a-zA-Z0-9+/]{40}. We can tie these together to find ākey IDs close to secret keys, surrounded by quotesā
with this monster regex (regex101):
The devil is in the details though - the -z flag doesnāt support searching zip files, so we need to use a custom shell
script to handle this, and there are some nuances around extracting the matches (using the JSON output) etc.
Once overcome, we can combine this a static dump of pypi data to run the whole
pipeline in parallel like so:
This tool runs periodically via Github Actions and sca ns new releases from PyPi, HexPM and RubyGems for AWS keys. If there
are any keys found then a report is generated and committed to repository.
This report contains the keys that have been found, as well as a public link to the keys and other metadata about the
release. Because these keys are committed to a public Github repository, Githubās Secret Scanning
service kicks in and notifies AWS that the keys are leaked.
This causes AWS to open a support ticket with you to notify you of the leak, and apply a quarantine policy on the key to
prevent some high-risk actions from being taken (i.e starting EC2 instances, adjusting permissions, etc).
Running a rust tool every half an hour via Github Actions isnāt exactly a web-scale solution to the problem, but itās
an effective āpost-proof-of-conceptā that is pretty cost effective ($0!). Rather than use a database to keep track of
packages that have been processed Github Actions commits a JSON file
keeping a kind of cursor to the last package it has processed. Subsequent invocations fetch the next set of packages
after this cursor.
I quite like this pattern - itās not perfect and does not handle individual package failures well, but it works quite
nicely for periodic tasks like this.
Authors of Python libraries can push different platform-specific files for each version they release.
Tensorflow has 16 different versions for different Windows, Linux and MacOS
versions. While usually these releases all contain exactly the same code, differences are possible.
One example comes from a package version published by Terradata.
The file containing the key, presto-local/debug, does not exist in the source version
and only exists in the platform-specific .egg releases.
It seems that while publishing a release, some files where accidentally bundled before publishing. This implies the release
was not fully automated.
There are some packages that attempt to use long-lived IAM keys for ālegitimateā uses. I put legitimate in quotes here,
as there are much better ways to allow public access to specific AWS resources that donāt involve shared IAM keys.
The majority of these ālegitimateā usages involve uploading temporary files to S3.
Another example in the same vein comes from mootoo.
QuantPanda is a funny example of this. In version 0.0.22 to 0.0.25 the AWS key was hardcoded.
However, after AWS flagged their key they switched from hard-coding the key to fetching it from a Gist š:
I found a lot more keys than I was expecting. Why? I think thereās a conflux of different reasons that have led to the
large number of keys published to PyPi:
Testing against AWS is hard. It is often simpler to just test against AWS themselves than set up moto or localstack.
Itās easy to accidentally commit files when publishing releases. Python tooling could add an āare you sureā prompt
if anything youāre publishing has not been fully committed to git.
Python allows multiple individual downloads for a given release. Combined with point 2, this makes it easy to publish
a ācorrectā release, make some changes and publish a sneakily incorrect one without realising.
Python is used heavily for data-science and ML. A lot of the packages I found where around this domain, and perhaps
best practices around AWS key management are not clear to those practitioners?