Real World Data

Warning

Big fat “unfinished” warning: structa is still very much incomplete at this time and there’s plenty of rough edges (like not showing CSV column titles).

If you run into unfinished stuff, do check the issues first as I may have a ticket for that already. If you run into genuinely “implemented but broken” stuff, please do file an issue; it’s these things I’m most interested in at this stage.

Pre-requisites

You’ll need the following to start this tutorial:

  • A structa installation; see Installation for more information on this.

  • A Python 3 installation; given that structa requires this to run at all, if you’ve got structa installed, you’ve got this too. However, it’ll help enormously if Python is in your system’s “PATH” so that you can run python scripts at the command line.

  • The scipy library must be installed for the scripts we’re going to be using to generate data. On Debian/Ubuntu systems you can run the following:

    $ sudo apt install python3-scipy
    

    On Windows, or if you’re running in a virtual environment, you should run the following:

    $ pip install scipy
    
  • Some basic command line knowledge. In particular, it’ll help if you’re familiar with shell redirection and piping (note: while that link is on askubuntu.com the contents are equally applicable to the vast majority of UNIX shells, and even to Windows’ cmd!)

“Real World” Data

For this tutorial, we’ll use a custom made data-set which will allow us to tweak things and see what’s going on under structa’s hood a bit more easily.

The following script generates a fairly sizeable JSON file (~11MB) apparently recording various air quality readings from places which bear absolutely no resemblance whatsoever to my adoptive city (ahem):

air-quality.py
import sys
import json
import random
import datetime as dt
from scipy.stats import skewnorm

readings = {
    # stat:  (min, max),
    'O3':    (0, 50),
    'NO':    (0, 200),
    'NO2':   (0, 100),
    'PM10':  (0, 100),
    'PM2.5': (0, 100),
}

locations = {
    # location: {stat: (skew, scale), ...}
    'Mancford Peccadillo': {
        'O3':    (0,  1),
        'NO':    (5,  1),
        'NO2':   (0,  1),
        'PM10':  (10, 3),
        'PM2.5': (10, 1),
    },
    'Mancford Shartson': {
        'O3':    (-10, 1),
        'NO':    (10,  1),
        'NO2':   (0,   1),
    },
    'Salport': {
        'NO':    (10,  1),
        'NO2':   (-10, 1/2),
        'PM10':  (5,   1/2),
        'PM2.5': (5,   1/2),
    },
    'Prestchester': {
        'O3':    (1,  1),
        'NO':    (5,  1/2),
        'NO2':   (0,  1),
        'PM10':  (5,  1/2),
        'PM2.5': (10, 1/2),
    },
    'Blackshire': {
        'O3':    (-10, 1),
        'NO':    (50,  1/2),
        'NO2':   (10,  1/2),
        'PM10':  (10,  1/2),
        'PM2.5': (10,  1/2),
    },
    'St. Wigpools': {
        'O3':    (0,  1),
        'NO':    (10, 1),
        'NO2':   (5,  3/4),
        'PM10':  (5,  1/2),
        'PM2.5': (5,  1/2),
    },
}

def skewfunc(min, max, a=0, scale=1):
    s = skewnorm(a)
    real_min = s.ppf(0.0001)
    real_max = s.ppf(0.9999)
    real_range = real_max - real_min
    res_range = max - min
    def skewrand():
        return min + res_range * scale * (s.rvs() - real_min) / real_range
    return skewrand

generators = {
    location: {
        reading: skewfunc(read_min, read_max, skew, scale)
        for reading, params in loc_readings.items()
        for read_min, read_max in (readings[reading],)
        for skew, scale in (params,)
    }
    for location, loc_readings in locations.items()
}

timestamps = [
    dt.datetime(2020, 1, 1) + dt.timedelta(hours=n)
    for n in range(10000)
]

data = {
    location: {
        'euid': 'GB{:04d}A'.format(random.randint(200, 2000)),
        'ukid': 'UKA{:05d}'.format(random.randint(100, 800)),
        'lat': random.random() + 53.0,
        'long': random.random() - 3.0,
        'alt': random.randint(5, 100),
        'readings': {
            reading: {
                timestamp.isoformat(): loc_gen()
                for timestamp in timestamps
            }
            for reading, loc_gen in loc_gens.items()
        }
    }
    for location, loc_gens in generators.items()
}

json.dump(data, sys.stdout)

If you run the script it will output JSON on stdout, which you can redirect to a file (or straight to structa, but given the script takes a while to run you may wish to capture the output to a file for experimentation purposes). Passing the output to structa should produce output something like this:

$ python3 air-quality.py > air-quality.json
$ structa air-quality.json
{
    str range="Blackshire".."St. Wigpools": {
        'alt': int range=31..85,
        'euid': str range="GB1012A".."GB1958A" pattern="GB1[0-139][13-58][2-37-9]A",
        'lat': float range=53.29812..53.6833,
        'long': float range=-2.901626..-2.362118,
        'readings': {
            str range="NO".."PM2.5": { str of timestamp range=2020-01-01 00:00:00..2021-02-20 15:00:00 pattern="%Y-%m-%dT%H:%M:%S": float range=-5.634479..335.6384 }
        },
        'ukid': str range="UKA00129".."UKA00713" pattern="UKA00[1-24-57][1-38][0-13579]"
    }
}

Note

It should be notable that the output of structa looks rather similar to the end of the air-quality.py script, where the “data” variable that is ultimately dumped is constructed. This neatly illustrates the purpose of structa: to summarize repeating structures in a mass of hierarchical data.

Looking at this output we can see that the data consists of a mapping (or Javascript “object”) at the top level, keyed by strings in the range “Blackshire” to “St. Wigpools” (when sorted).

Under these keys are more mappings which have six keys (which structa has displayed in alphabetical order for ease of reading):

  • alt which maps to an integer in some range (in the example above 31 to 85, but this will likely be different for you)

  • euid which maps to a string which always started with “GB” and is followed by several numerals

  • lat which maps to a floating point value around 53

  • long which maps to another floating point roughly around -2

  • ukid which maps to a string always starting with UKA00 followed by several numerals

  • And finally, readings which maps to another dictionary of strings …

  • Which maps to another dictionary which is keyed by timestamps in string format, which map to floating point values

If you have a terminal capable of ANSI codes, you may note that types are displayed in a different color (to distinguish them from literals like the “ukid” and “euid” keys), as are patterns within fixed length strings, and various keywords like “range=”.

Note

You may also notice that several of the types (definitely the outer “str”, but possibly other types within the top-level dictionary, like lat/long) are underlined. This indicates that these values are unique throughout the entire dataset, and thus potentially suitable as top-level keys if entered into a database.

Just because you can use something as a unique key, however, doesn’t mean you should (floating point values being a classic example).

Optional Keys

Let’s explore how structa handles various “problems” in the data. Firstly, we’ll make a copy of our script and add a chunk of code to remove approximately half of the altitude readings:

$ cp air-quality.py air-quality-opt.py
$ editor air-quality-opt.py
air-quality-opt.py
data = {
    location: {
        'euid': 'GB{:04d}A'.format(random.randint(200, 2000)),
        'ukid': 'UKA{:05d}'.format(random.randint(100, 800)),
        'lat': random.random() + 53.0,
        'long': random.random() - 3.0,
        'alt': random.randint(5, 100),
        'readings': {
            reading: {
                timestamp.isoformat(): loc_gen()
                for timestamp in timestamps
            }
            for reading, loc_gen in loc_gens.items()
        }
    }
    for location, loc_gens in generators.items()
}

for location in data:
    if random.random() < 0.5:
        del data[location]['alt']

json.dump(data, sys.stdout)

What does structa make of this?

$ python3 air-quality-opt.py > air-quality-opt.json
$ structa air-quality-opt.json
{
    str range="Blackshire".."St. Wigpools": {
        'alt'?: int range=31..85,
        'euid': str range="GB1012A".."GB1958A" pattern="GB1[0-139][13-58][2-37-9]A",
        'lat': float range=53.29812..53.6833,
        'long': float range=-2.901626..-2.362118,
        'readings': {
            str range="NO".."PM2.5": { str of timestamp range=2020-01-01 00:00:00..2021-02-20 15:00:00 pattern="%Y-%m-%dT%H:%M:%S": float range=-5.634479..335.6384 }
        },
        'ukid': str range="UKA00129".."UKA00713" pattern="UKA00[1-24-57][1-38][0-13579]"
    }
}

Note that a question-mark has now been appended to the “alt” key in the second-level dictionary (if your terminal supports color codes, this should appear in red). This indicates that the “alt” key is optional and not present in every single dictionary at that level.

“Bad” Data

Next, we’ll make another script (a copy of air-quality-opt.py), which adds some more code to “corrupt” some of the timestamps:

$ cp air-quality-opt.py air-quality-bad.py
$ editor air-quality-bad.py
air-quality-bad.py
for location in data:
    if random.random() < 0.5:
        reading = random.choice(list(data[location]['readings']))
        date = random.choice(list(data[location]['readings'][reading]))
        value = data[location]['readings'][reading].pop(date)
        # Change the date to the 31st of February...
        data[location]['readings'][reading]['2020-02-31T12:34:56'] = value

json.dump(data, sys.stdout)

What does structa make of this?

$ python3 air-quality.py > air-quality-bad.json
$ structa air-quality-bad.json
{
    str range="Blackshire".."St. Wigpools": {
        'alt'?: int range=31..85,
        'euid': str range="GB1012A".."GB1958A" pattern="GB1[0-139][13-58][2-37-9]A",
        'lat': float range=53.29812..53.6833,
        'long': float range=-2.901626..-2.362118,
        'readings': {
            str range="NO".."PM2.5": { str of timestamp range=2020-01-01 00:00:00..2021-02-20 15:00:00 pattern="%Y-%m-%dT%H:%M:%S": float range=-5.634479..335.6384 }
        },
        'ukid': str range="UKA00129".."UKA00713" pattern="UKA00[1-24-57][1-38][0-13579]"
    }
}

Apparently nothing! It may seem odd that structa raised no errors, or even warnings when encountering subtly incorrect data. One might (incorrectly) assume that structa just thinks anything that vaguely looks like a timestamp in a string is such.

For the avoidance of doubt, this is not the case: structa does attempt to convert timestamps correctly and does not think February 31st is a valid date (unlike certain databases!). However, structa does have a “bad threshold” setting (structa --bad-threshold) which means not all data in a given sequence has to match the pattern under test.

Multiple Inputs

Time for another script (based on a copy of the prior air-quality-bad.py script), which produces each location as its own separate JSON file:

$ cp air-quality-bad.py air-quality-multi.py
$ editor air-quality-multi.py
air-quality-multi.py
for location in data:
    filename = location.lower().replace(' ', '-').replace('.', '')
    filename = 'air-quality-{filename}.json'.format(filename=filename)
    with open(filename, 'w') as out:
        json.dump({location: data[location]}, out)

We can pass all the files as inputs to structa simultaneously, which will cause it to assume that they should all be processed as if they have comparable structures:

$ python3 air-quality-multi.py
$ ls *.json
air-quality-blackshire.json           air-quality-prestchester.json
air-quality-mancford-peccadillo.json  air-quality-salport.json
air-quality-mancford-shartson.json    air-quality-st-wigpools.json
$ structa air-quality-*.json
{
    str range="Blackshire".."St. Wigpools": {
        'alt': int range=15..92,
        'euid': str range="GB0213A".."GB1029A" pattern="GB[01][028-9][1-26-7][2-379]A",
        'lat': float range=53.49709..53.98315,
        'long': float range=-2.924566..-2.021445,
        'readings': {
            str range="NO".."PM2.5": { str of timestamp range=2020-01-01 00:00:00..2021-02-20 15:00:00 pattern="%Y-%m-%dT%H:%M:%S": float range=-2.982586..327.4161 }
        },
        'ukid': str range="UKA00148".."UKA00786" pattern="UKA00[135-7][13-47-8][06-9]"
    }
}

In this case, structa has merged the top-level mapping in each file into one large top-level mapping. It would do the same if a top-level list were found in each file too.

Conclusion

This concludes the structa tutorial series. You should now have some experience of using structa with more complex datasets, how to tune its various settings for different scenarios, and what to look out for in the results to get the most out of its analysis.

In other words, if you wish to use structa from the command line, you should be all set. If you want help dealing with some specific scenarios, the sections in Recipes may be of interest. Alternatively, if you wish to use structa in your own Python scripts, the API Reference may prove useful.

Finally, if you wish to hack on structa yourself, please see the Development chapter for more information.