Contents

Overview

Doreah is a useful little toolkit that offers shortcuts and abstractions for common operations.

Installing

Install doreah with the simple command pip install doreah.

Configuration

Each module can be configured with a call to the function config(). However, it is recommended to use a .doreah configuration file in your project’s directory. This way, the correct configuration will be used from the first import.

The .doreah file follows a simple key-value-format where the key is comprised of the module name, a dot and the configuration parameter, e.g.:

logging.verbosity = 2

Modules

Authentication

API Reference

Caching

API Reference

Database

API Reference

Input/Output

API Reference

Logging

API Reference

Persistence

API Reference

PyHP / Python Hypertext Processor

How To

PyHP Syntax

Any strict (XML-compliant) html document is a valid pyhp document. Server-side processing can be achieved with the help of <pyhp> nodes. Their meaning is determined by their arguments. Nodes without any arguments are code blocks: they may contain arbitrary python code. Indentation can be chosen to match the position in the xml tree, but must then be consistent within the block. The first and last line (those containing the <pyhp> and </pyhp> tags) must not contain code.

The following pyhp nodes are available

echo
The supplied expression will be evaluated and returned as string
<pyhp echo="len(stuff)" />
if
Everything within this node is only sent to the client if the condition evaluates to true
<pyhp if="me['rank'] == 1 or !me.isintouchwithreality()">I'm the best!</pyhp>
for loop
Everything within this node will be evaluated for each element in the iterable / mapping
<pyhp for="city" in="patriarchs" separator=" | ">The current patriarch of <pyhp echo="city" /> is <pyhp echo="patriarchs['city']" />. </pyhp>
assignment
Assigns to a variable
<pyhp save="complicated_db_call(somestuff)[len(somelist)]['info']['important']" as="importantinfo" />
include
Includes another pyhp file at this location.
<pyhp include="sidebar.pyhp" />

You can also access variables inside arguments of regular html nodes with curly braces:

<a href="{site.url}"><pyhp echo="site.name" /></a>

API Reference

Regular

API Reference

Scraping

How To

Step Instructions

This module provides a simplified interface to parse XML trees with a set of predefined steps. These need to be supplied in a list as dicts with the keys steptype and ‘instruction’, although the second may be omitted for steps that do not have any further instructions.

The following steps are possible:

  • Steps that work for both single elements and lists:
    xpath
    follows the xpath down the tree, returns first element (node -> node or string)
    prefix
    adds a prefix to the string (string -> string)
    suffix
    appends a suffix to the string (string -> string)
    rmprefix
    removes a prefix if present (string -> string)
    regex
    Replaces the string matched by the supplied regex with its first capture group (string -> string)
    last
    splits the string and returns last element (string -> string)
  • Steps that work for single elements and return a single element:
    follow
    follows the specified link and returns the root node of the resulting document (string -> node)
  • Steps that work for single elements and split them into a list:
    split
    splits the string (string -> stringlist)
    makelist
    turns an element into a list consisting of that element (string -> stringlist, node -> nodelist)
    xpathls
    follows the xpath down the tree, returns all elements (node -> nodelist or stringlist)
  • Steps that work for lists and merge them back into a single element:
    pick
    picks the n-th element from the list (nodelist -> node, stringlist -> string)
    combine
    combines all strings of the list (stringlist -> string)
Scraping feeds

parse_all() is a function to scrape any well-structured feed of regular elements. Since its arguments may be confusing, let’s look at a simple example. Say we want to scrape all locations of a website that shows 3 entries per page and its URLs look like this:

We would then supply base_url="https://bestgallery.tld/newest?start={page}", start_page=0 and page_multiplier=3 (since Page 0 needs a 0, page 1 needs a 3 and so on).

If our page has a weird URL logic, we can simply supply a function instead that takes the logical page number (0, 1, 2, …) as input and returns the string that should be inserted into the URL.

Now let’s have a look at the relevant part of our webpage:

<body>
        <div id="cards_area">
                <div class="place_box" id="place_box_rivendell">
                        <div style="background-image('/rivendell.png');"></div>
                        <h3 class="place_name">Rivendell</h3>
                        <span class="place_leader">Leader: Elrond</span>
                </div>
                <div class="place_box" id="place_box_gondolin">
                        <div style="background-image('/tumladen_vale.jpg');"></div>
                        <h3 class="place_name">Gondolin</h3>
                        <span class="place_leader">Leader: Turgon</span>
                </div>
                <div class="place_box" id="place_box_holymountain">
                        <div style="background-image('/oiolosse.png');"></div>
                        <h3 class="place_name">Taniquetil</h3>
                        <span class="place_leader">Leader: Manwë</span>
                </div>
        </div>
</body>

As steps_elements we need to supply the steps to acquire a list of elements - simple enough:

[
        {"type":"xpath","instruction":"//div[@id='cards_area']//div[@class='place_box']"}
]

Now, we want to return several pieces of information from each element. As steps_content, we pass:

{
        "identifier":[
                {"type":"xpath","instruction":"./@id"},
                {"type":"rmprefix","instruction":"place_box_"}
        ],
        "image_url":[
                {"type":"xpath","instruction":"./div/@style"},
                {"type":"regex","instruction":"background-image('(.*)');"}
        ],
        "name":[
                {"type":"xpath","instruction":"./h3/text()"}
        ],
        "leader":[
                {"type":"xpath","instruction":"./span/text()"},
                {"type":"regex","instruction":"Leader: (.*)"}
        ]
}

This will iterate through all places and save the according values in a dictionary:

[
        {
                "identifier": "rivendell",
                "image_url": "rivendell.png",
                "name": "Rivendell",
                "leader": "Elrond"
        },
        {
                "identifier": "gondolin",
                "image_url": "tumladen_vale.jpg",
                "name": "Gondolin",
                "leader": "Turgon"
        },
        {
                "identifier": "holymountain",
                "image_url": "oiolosse.png",
                "name": "Taniquetil",
                "leader": "Manwë"
        },
]

If we pass the argument stop=42, the parsing will stop after we have found 42 arguments. Alternatively (or additionally), we can pass as stopif the following:

{
        "leader":lambda x: x=="Morgoth" or x=="Sauron",
        "image_url":lambda x: x.endswith(".gif")
}

This means that if we parse a place with the leader “Morgoth” or “Sauron”, or if we parse a place that has a .gif-image, we immediately stop parsing.

API Reference

Settings

API Reference

Timing

API Reference

TSV

API Reference