Scraping

How To

Step Instructions

This module provides a simplified interface to parse XML trees with a set of predefined steps. These need to be supplied in a list as dicts with the keys steptype and ‘instruction’, although the second may be omitted for steps that do not have any further instructions.

The following steps are possible:

  • Steps that work for both single elements and lists:
    xpath
    follows the xpath down the tree, returns first element (node -> node or string)
    prefix
    adds a prefix to the string (string -> string)
    suffix
    appends a suffix to the string (string -> string)
    rmprefix
    removes a prefix if present (string -> string)
    regex
    Replaces the string matched by the supplied regex with its first capture group (string -> string)
    last
    splits the string and returns last element (string -> string)
  • Steps that work for single elements and return a single element:
    follow
    follows the specified link and returns the root node of the resulting document (string -> node)
  • Steps that work for single elements and split them into a list:
    split
    splits the string (string -> stringlist)
    makelist
    turns an element into a list consisting of that element (string -> stringlist, node -> nodelist)
    xpathls
    follows the xpath down the tree, returns all elements (node -> nodelist or stringlist)
  • Steps that work for lists and merge them back into a single element:
    pick
    picks the n-th element from the list (nodelist -> node, stringlist -> string)
    combine
    combines all strings of the list (stringlist -> string)

Scraping feeds

parse_all() is a function to scrape any well-structured feed of regular elements. Since its arguments may be confusing, let’s look at a simple example. Say we want to scrape all locations of a website that shows 3 entries per page and its URLs look like this:

We would then supply base_url="https://bestgallery.tld/newest?start={page}", start_page=0 and page_multiplier=3 (since Page 0 needs a 0, page 1 needs a 3 and so on).

If our page has a weird URL logic, we can simply supply a function instead that takes the logical page number (0, 1, 2, …) as input and returns the string that should be inserted into the URL.

Now let’s have a look at the relevant part of our webpage:

<body>
        <div id="cards_area">
                <div class="place_box" id="place_box_rivendell">
                        <div style="background-image('/rivendell.png');"></div>
                        <h3 class="place_name">Rivendell</h3>
                        <span class="place_leader">Leader: Elrond</span>
                </div>
                <div class="place_box" id="place_box_gondolin">
                        <div style="background-image('/tumladen_vale.jpg');"></div>
                        <h3 class="place_name">Gondolin</h3>
                        <span class="place_leader">Leader: Turgon</span>
                </div>
                <div class="place_box" id="place_box_holymountain">
                        <div style="background-image('/oiolosse.png');"></div>
                        <h3 class="place_name">Taniquetil</h3>
                        <span class="place_leader">Leader: Manwë</span>
                </div>
        </div>
</body>

As steps_elements we need to supply the steps to acquire a list of elements - simple enough:

[
        {"type":"xpath","instruction":"//div[@id='cards_area']//div[@class='place_box']"}
]

Now, we want to return several pieces of information from each element. As steps_content, we pass:

{
        "identifier":[
                {"type":"xpath","instruction":"./@id"},
                {"type":"rmprefix","instruction":"place_box_"}
        ],
        "image_url":[
                {"type":"xpath","instruction":"./div/@style"},
                {"type":"regex","instruction":"background-image('(.*)');"}
        ],
        "name":[
                {"type":"xpath","instruction":"./h3/text()"}
        ],
        "leader":[
                {"type":"xpath","instruction":"./span/text()"},
                {"type":"regex","instruction":"Leader: (.*)"}
        ]
}

This will iterate through all places and save the according values in a dictionary:

[
        {
                "identifier": "rivendell",
                "image_url": "rivendell.png",
                "name": "Rivendell",
                "leader": "Elrond"
        },
        {
                "identifier": "gondolin",
                "image_url": "tumladen_vale.jpg",
                "name": "Gondolin",
                "leader": "Turgon"
        },
        {
                "identifier": "holymountain",
                "image_url": "oiolosse.png",
                "name": "Taniquetil",
                "leader": "Manwë"
        },
]

If we pass the argument stop=42, the parsing will stop after we have found 42 arguments. Alternatively (or additionally), we can pass as stopif the following:

{
        "leader":lambda x: x=="Morgoth" or x=="Sauron",
        "image_url":lambda x: x.endswith(".gif")
}

This means that if we parse a place with the leader “Morgoth” or “Sauron”, or if we parse a place that has a .gif-image, we immediately stop parsing.

API Reference