structa
structa is a small utility for analyzing repeating structures in large data sources. Typically this is something like a document oriented database in JSON format, or a CSV file of a database dump, or a YAML document.
Usage
Use from the command line:
structa <filename>
The usual --help
and --version
switches are available for more
information. The full documentation may also help understanding the myriad
switches!
Examples
The People in Space API shows the number of people currently in space, and their names and craft name:
curl -s http://api.open-notify.org/astros.json | structa
Output:
{
'message': str range="success" pattern="success",
'number': int range=10,
'people': [
{
'craft': str range="ISS".."Tiangong",
'name': str range="Akihiko Hoshide".."Thomas Pesquet"
}
]
}
The Python Package Index (PyPI) provides a JSON API for packages:
curl -s https://pypi.org/pypi/numpy/json | structa
Output:
{
'info': { str: value },
'last_serial': int range=9.0M,
'releases': {
str range="0.9.6".."1.9.3": [
{
'comment_text': str,
'digests': {
'md5': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
'sha256': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
},
'downloads': int range=-1,
'filename': str,
'has_sig': bool,
'md5_digest': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
'packagetype': str range="bdist_wheel".."sdist",
'python_version': str range="2.5".."source",
'requires_python': value,
'size': int range=1.9M..24.5M,
'upload_time': str of timestamp range=2006-12-02 02:07:43..2020-12-25 03:30:00 pattern=%Y-%m-%dT%H:%M:%S,
'upload_time_iso_8601': str of timestamp range=2009-04-06 06:19:25..2020-12-25 03:30:00 pattern=%Y-%m-%dT%H:%M:%S.%f%z,
'url': URL,
'yanked': bool,
'yanked_reason': value
}
]
},
'urls': [
{
'comment_text': str range="",
'digests': {
'md5': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
'sha256': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
},
'downloads': int range=-1,
'filename': str,
'has_sig': bool,
'md5_digest': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
'packagetype': str range="bdist_wheel" pattern="bdist_wheel",
'python_version': str range="cp36".."pp36" pattern="Ip3d",
'requires_python': str range=">=3.6" pattern=">=3.6",
'size': int range=7.3M..15.4M,
'upload_time': str of timestamp range=2020-11-02 15:46:22..2020-11-02 16:18:20 pattern=%Y-%m-%dT%H:%M:%S,
'upload_time_iso_8601': str of timestamp range=2020-11-02 15:46:22..2020-11-02 16:18:20 pattern=%Y-%m-%dT%H:%M:%S.%f%z,
'url': URL,
'yanked': bool,
'yanked_reason': value
}
]
}
The Ubuntu Security Notices database contains the list of all security issues in releases of Ubuntu (warning, this one takes some time to analyze and eats about a gigabyte of RAM while doing so):
curl -s https://usn.ubuntu.com/usn-db/database.json | structa
Output:
{
str range="1430-1".."4630-1" pattern="dddd-d": {
'action'?: str,
'cves': [ str ],
'description': str,
'id': str range="1430-1".."4630-1" pattern="dddd-d",
'isummary'?: str,
'releases': {
str range="artful".."zesty": {
'allbinaries'?: {
str: { 'version': str }
},
'archs'?: {
str range="all".."source": {
'urls': {
URL: {
'md5': str pattern="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
'size': int range=20..1.2G
}
}
}
},
'binaries': {
str: { 'version': str }
},
'sources': {
str: {
'description': str,
'version': str
}
}
}
},
'summary': str,
'timestamp': float of timestamp range=2012-04-27 12:57:41..2020-11-11 18:01:48,
'title': str
}
}
Contents
Installation
structa is distributed in several formats. The following sections detail installation on a variety of platforms.
Ubuntu Linux
For Ubuntu Linux, it is simplest to install from the author’s PPA as follows (this also ensures you are kept up to date as new releases are made):
$ sudo add-apt-repository ppa://waveform/structa
$ sudo apt update
$ sudo apt install structa
If you wish to remove structa:
$ sudo apt remove structa
Microsoft Windows
Firstly, install a version of Python 3 (this must be Python 3.5 or later), or ensure you have an existing installation of Python 3.
Ideally, for the purposes of following the Getting Started you should add your Python 3 install to the system PATH variable so that python can be easily run from any command line.
You can install structa with the “pip” tool like so:
C:\Users\me> pip install structa
Upgrading structa can be done via pip too:
C:\Users\me> pip install --upgrade structa
And removal can be performed via pip:
C:\Users\me> pip uninstall structa
Other Platforms
If your platform is not covered by one of the sections above, structa is available from PyPI and can therefore be installed with the Python setuptools “pip” tool:
$ pip install structa
On some platforms you may need to use a Python 3 specific alias of pip:
$ pip3 install structa
If you do not have either of these tools available, please install the Python setuptools package first.
You can upgrade structa via pip:
$ pip install --upgrade structa
And removal can be performed as follows:
$ pip uninstall structa
Getting Started
Warning
Big fat “unfinished” warning: structa is still very much incomplete at this time and there’s plenty of rough edges (like not showing CSV column titles).
If you run into unfinished stuff, do check the issues first as I may have a ticket for that already. If you run into genuinely “implemented but broken” stuff, please do file an issue; it’s these things I’m most interested in at this stage.
Getting the most out of structa is part science, part art. The science part is understanding how structa works and what knobs it has to twiddle. The art bit is figuring out what to twiddle them to!
Pre-requisites
You’ll need the following to start this tutorial:
A structa installation; see Installation for more information on this.
A Python 3 installation; given that structa requires this to run at all, if you’ve got structa installed, you’ve got this too. However, it’ll help enormously if Python is in your system’s “PATH” so that you can run python scripts at the command line.
Some basic command line knowledge. In particular, it’ll help if you’re familiar with shell redirection and piping (note: while that link is on askubuntu.com the contents are equally applicable to the vast majority of UNIX shells, and even to Windows’ cmd!)
Basic Usage
We’ll start with some basic data structures and see how structa handles them. The following Python script dumps a list of strings representing integers to stdout in JSON format:
import sys
import json
json.dump([str(i) for i in range(1000)] * 3, sys.stdout)
This produces output that looks (partially) like this:
["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25",
"26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
"38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
"50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61",
"62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73",
"74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85",
"86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97",
"98", "99", "100", "101", "102", "103", "104", "105", "106", "107", "108",
"109", "110", "111", "112", "113", "114", "115", "116", "117", "118",
"119", "120", "121", "122", "123", "124", "125", "126", "127", "128",
"129", "130",
// lots more output...
]
We can capture the output in a file and pass this to structa:
$ python3 str-nums.py > str-nums.json
$ structa str-nums.json
[ str of int range=0..999 pattern="d" ]
Alternatively, we can pipe the output straight to structa:
$ python3 str-nums.py | structa
[ str of int range=0..999 pattern="d" ]
The output shows that the data contains a list (indicated by the square-brackets surrounding the output) of strings of integers (“str of int”), which have values between 0 and 999 (inclusive). The “pattern” at the end indicates that the strings are in decimal (“d”) form (structa would also recognize octal, “o”, and hexadecimal “x” forms of integers).
Bad Data (--bad-threshold
)
Let’s see how structa handles bad data. We’ll add a non-numeric string into our list of numbers:
import sys
import json
json.dump(['foo'] + [str(i) for i in range(1000)] * 3, sys.stdout)
What does structa do in the presence of this “corrupt” data?
$ python3 bad-nums.py | structa
[ str of int range=0..999 pattern="d" ]
Apparently nothing! It may seem odd that structa raised no errors, or even
warnings when encountering subtly incorrect data. However, structa has a “bad
threshold” setting (structa --bad-threshold
) which means not all data
in a given sequence has to match the pattern under test.
This setting defaults to 1% (or 0.01) meaning that up to 1% of the values can fail to match and the pattern will still be considered valid. If we lower the bad threshold to zero, this is what happens:
$ python3 bad-nums.py | structa --bad-threshold 0
[ str range="0".."foo" ]
It’s still recognized as a list of strings, but no longer as string representations of integers.
How about mixing types? The following script outputs our errant string, “foo”, along with a list of numbers. However, note that this time the numbers are integers, not strings of integers. In other words we have a list of a string, and lots of integers:
import sys
import json
json.dump(['foo'] + list(range(1000)) * 3, sys.stdout)
$ python3 bad-types.py | structa
[ value ]
In this case, even with the default 1% bad threshold, structa doesn’t exclude the bad data; the analysis simply returns it as a list of mixed “values”.
This is because structa assumes that the types of data are at least consistent and correct, under the assumption that if whatever is generating your data hasn’t even got the data types right, you’ve got bigger problems! The bad threshold mechanism only applies to bad data within a homogenous type (typically bad string representations of numeric or boolean types).
Missing Data (--empty-threshold
)
Another type of “bad” data commonly encountered is empty strings which are
typically used to represent missing data, and (predictably) structa has
another knob that can be twiddled for this: structa
--empty-threshold
. The following script generates a list of strings of
integers in which most of the strings (~70%) are blank:
import sys
import json
import random
json.dump([
'' if random.random() < 0.7 else str(random.randint(0, 100))
for i in range(10000)
], sys.stdout)
Despite the vast majority of the data being blank, structa handles this as normal:
$ python3 mostly-blank.py | structa
[ str of int range=0..100 pattern="d" ]
This is because the default for structa --empty-threshold
is 99% or
0.99. If the proportion of blank strings in a field exceeds the empty
threshold, the field will simply be marked as a string without any further
processing. Hence, when we re-run this script with the setting turned down to
50%, the output changes:
$ python3 mostly-blank.py | structa --empty-threshold 50%
[ str range="".."99" ]
Note
For those slightly confused by the above output: structa hasn’t lost the “100” value, but because it’s now considered a string (not a string of integers), “100” sorts before “99” alphabetically.
It is also worth nothing that, by default, structa strips whitespace from
strings prior to analysis. This is probably not necessary for the vast majority
of modern datasets, but it’s a reasonably safe default, and can be controlled
with the structa --strip-whitespace
and structa
--no-strip-whitespace
options in any case.
Fields or Tables (--field-threshold
)
The next major knob that can be twiddled in structa is the structa
--field-threshold
. This is used to distinguish between mappings that act as a
“table” (mapping keys to records) and mappings that act as a record (mapping
field-names, typically strings, to their values).
To illustrate the difference between these, consider the following script:
import sys
import json
import random
json.dump({
str(flight_id): {
"flight_id": flight_id,
"passengers": random.randint(50, 200),
"from": random.choice([
"MAN", "LON", "LHR", "ABZ", "AMS", "AUS", "BCN",
"BER", "BHX", "BRU", "CHI", "ORK", "DAL", "EDI",
]),
}
for flight_id in range(200)
}, sys.stdout)
The generates a JSON file containing a mapping of mappings which looks something like this snippet (but with a lot more output):
{
"0": { "flight_id": 0, "passengers": 53, "from": "BHX" },
"1": { "flight_id": 1, "passengers": 157, "from": "AMS" },
"2": { "flight_id": 2, "passengers": 118, "from": "DAL" },
"3": { "flight_id": 3, "passengers": 111, "from": "MAN" },
"4": { "flight_id": 4, "passengers": 192, "from": "BRU" },
"5": { "flight_id": 5, "passengers": 69, "from": "DAL" },
"6": { "flight_id": 6, "passengers": 147, "from": "LON" },
"7": { "flight_id": 7, "passengers": 187, "from": "LON" },
"8": { "flight_id": 8, "passengers": 171, "from": "AMS" },
"9": { "flight_id": 9, "passengers": 89, "from": "DAL" },
"10": { "flight_id": 10, "passengers": 169, "from": "LHR" },
// lots more output...
}
The outer mapping is what structa would consider a “table” since it maps keys (in this case a string representation of an integer) to records. The inner mappings are what structa would consider “records” since they map a relatively small number of field names to values.
Note
Record fields don’t have to be simple scalar values (although they are here); they can be complex structures including lists or indeed further embedded records.
If structa finds mappings with more keys than the threshold, those mappings will be treated as tables. However, if mappings are found with fewer (or equal) keys to the threshold, they will be analyzed as records. It’s a rather arbitrary value that (unfortunately) usually requires some fore-knowledge of the data being analyzed. However, it’s usually quite easy to spot when the threshold is wrong, as we’ll see.
First, let’s take a look at what happens when the threshold is set correctly. When passed to structa, with the default field threshold of 20, we see the following output:
$ python3 simple-fields.py | structa
{
str of int range=0..199 pattern="d": {
'flight_id': int range=0..199,
'from': str range="ABZ".."ORK" pattern="Iii",
'passengers': int range=50..200
}
}
This indicates that structa has recognized the data as consisting of a mapping (indicated by the surrounding braces), which is keyed by a decimal string representation of an integer (in the range 0 to 199), and the values of which are another mapping with the keys “flight_id”, “from”, and “passengers”.
The reason the inner mappings were treated as a set of records was because all those mappings had less than 20 entries. The outer mapping had more than 20 entries (200 in this case) and thus was treated as a table.
What happens if we force the field threshold down so low that the inner mappings are also treated as a table?
$ python3 simple-fields.py | structa --field-threshold 2
{
str of int range=0..199 pattern="d": { str range="flight_id".."passengers": value }
}
The inner mappings are now defined simply as mappings of strings (in the range
“flight_id” to “passengers”, sorted alphabetically) which map to “value” (an
arbitrary mix of types). Anytime you see a mapping of { str: value }
in
structa’s output, it’s a fairly good clue that structa
--field-threshold
might be too low.
Merging Structures (--merge-threshold
)
The final major knob available for twiddling is the structa
--merge-threshold
which dictates how similar record mappings have to be in
order to be considered for merging. This only applies to mappings at the same
“level” with similar (but not necessarily perfectly identical) structures.
To illustrate, consider the following example script:
import sys
import json
import random
airports = {
"MAN", "LON", "LHR", "ABZ", "AMS", "AUS", "BCN",
"BER", "BHX", "BRU", "CHI", "ORK", "DAL", "EDI",
}
facilities = [
"WiFi", "Shopping", "Conferences", "Chapel", "Parking",
"Lounge", "Spotters Area", "Taxi Rank", "Train Station",
"Tram Stop", "Bus Station", "Duty Free",
]
data = {
airport: {
"code": airport,
"facilities": random.sample(
facilities, random.randint(3, len(facilities))),
"terminals": random.randint(1, 4),
"movements": random.randint(10000, 300000),
"passengers": random.randint(1000000, 30000000),
"cargo": random.randint(10000, 1000000),
}
for airport in airports
}
for entry in data.values():
# Exclude reporting terminals if the airport only has one
if entry['terminals'] == 1:
del entry['terminals']
# Exclude some other stats semi-randomly
if random.random() > 0.7:
del entry['movements']
if random.random() > 0.9:
del entry['cargo']
json.dump(data, sys.stdout)
In keeping with the prior examples, this generates a list of airports with associated statistics. When we run the results through structa they seem to produce sensible output:
$ python3 merge-dicts.py | structa
{
str range="ABZ".."ORK" pattern="Iii": {
'cargo'?: int range=55.0K..949.1K,
'code': str range="ABZ".."ORK" pattern="[A-EL-MO][A-EHMORU][IK-LNR-SUXZ]",
'facilities': [ str range="Bus Station".."WiFi" ],
'movements'?: int range=10.0K..295.7K,
'passengers': int range=1.0M..24.9M,
'terminals'?: int range=2..4
}
}
However, there are several things to note about the data:
The number of top-level entries (the airport codes) is less than the default field threshold (20). This means that the “outer” mapping will initially be treated as a record rather than a table (see the explanation of
--field-threshold
above).In some entries, statistics are missing. When “terminals” would be 1, it’s excluded, and 30% and 10% of entries will be missing their “movements” and “cargo” stats respectively.
The “code”, “facilities”, and “passengers” entries are always present out of a total of 6 fields that could be present. This means that at least 50% of all the fields are guaranteed to be present, which is the default level of
--merge-threshold
.
As noted above, structa’s initial pass will treat the outer mapping as a record so each airport will be analyzed as a separate entity. After this phase a first merge pass will run, which will compare all the airport records. After concluding that all contain at least 50% of the same fields as the rest, and that all field values found are compatible, those rows will be merged. What happens if we raise the merge threshold to 100%, which would require that every single airport record shared exactly the same fields?
$ python3 docs/examples/merge-dicts.py | structa --merge-threshold 100%
{
'ABZ': {
'cargo': int range=192.6K,
'code': str range="ABZ" pattern="ABZ",
'facilities': [ str range="Bus Station".."WiFi" ],
'passengers': int range=27.5M,
'terminals': int range=4
},
'AMS': {
'cargo': int range=606.4K,
'code': str range="AMS" pattern="AMS",
'facilities': [ str range="Bus Station".."WiFi" ],
'movements': int range=132.5K,
'passengers': int range=4.8M,
'terminals': int range=3
},
'AUS': {
'cargo': int range=607.4K,
'code': str range="AUS" pattern="AUS",
'facilities': [ str range="Bus Station".."WiFi" ],
'movements': int range=212.2K,
'passengers': int range=13.7M
},
...
A whole lot of output! When you get excessively large output consisting of
largely (but not completely) similar records, it’s a reasonable sign that
structa --merge-threshold
is set too high.
That said, the merge threshold is fairly forgiving. The specific algorithm used is as follows:
For two given mappings, find the length (number of fields) of the shortest mapping.
Calculate the minimum required number of common fields as the merge threshold percentage of the shortest length. For example, if the shortest mapping contains 8 fields, and the merge threshold is 50%, then there must be at least 4 common fields.
Note that in the case that one side is an empty mapping this will always permit the match as at least 0 common fields will be required percentage of the shortest length.
Other Switches
There are quite a few other switches in structa, but all are less important
than the four covered in the prior sections. The rest largely have to do with
specific formats (structa --csv-format
for CSV files,
structa --no-json-strict
for JSON files), the character encoding of
files (structa --encoding
, structa --encoding-strict
), or
tweaking the style of the output (structa --show-count
,
structa --show-lengths
).
Integer Handling
However, there are a couple that may be important for specific types of data.
The first is structa --max-numeric-len
which dictates the maximum
number of digits structa will consider as a number. This defaults to 30 which
is more than sufficient to represent all 64-bit integer values (which only
require 20 digits), with some lee-way for data that includes large integers
(which Python handles happily).
However, the default is deliberately lower than 32 because at that point, data which includes hex-encoded hash values (MD5, SHA1, etc.) typically wind up mis-representing those hashes as literal integers (which, technically, they are, but that’s not typically how users wish hash values to be interpreted).
Date Handling
The other important switches are those used in the detection of dates encoded
as numbers: structa --min-timestamp
and structa
--max-timestamp
. When dates are encoded as (potentially fractional)
day-offsets from the UNIX epoch (the 1st January, 1970), how does structa
determine that it’s looking at a set of dates rather than a set of numbers?
In a typical set of (arbitrary) numbers, it’s quite normal to find “0” or “1” commonly represented, or for the set of numbers to span over a large range (consider file-sizes which might span over millions or billions of bytes). However, most date-based sets, don’t tend to include values around the 1st or 2nd of January, 1970 (most data that’s dealt with is, to some degree, fairly contemporary), and moreover tends to cluster around values that vary by no more than a few thousand (after all 3000 is enough to represent nearly a decade’s worth of days).
Thus if we find that all numbers in a given set fall within some “reasonable” limits (structa defaults to 20 years prior, and 10 years after the current date) it’s a reasonable guess that we’re looking at dates encoded as numbers rather than an arbitrary set of numbers.
Conclusion
At this point, you should have a pretty good idea of the major controls that structa provides, what they do, and the circumstances under which you will need to fiddle with them. The next tutorial goes through a variety of scenarios with some datasets that are closer to the sort of size and complexity one might encounter in the real world.
However, it won’t be introducing any new functionality that we haven’t covered above and at this point you may simply want to take structa for a spin with your own datasets.
Real World Data
Warning
Big fat “unfinished” warning: structa is still very much incomplete at this time and there’s plenty of rough edges (like not showing CSV column titles).
If you run into unfinished stuff, do check the issues first as I may have a ticket for that already. If you run into genuinely “implemented but broken” stuff, please do file an issue; it’s these things I’m most interested in at this stage.
Pre-requisites
You’ll need the following to start this tutorial:
A structa installation; see Installation for more information on this.
A Python 3 installation; given that structa requires this to run at all, if you’ve got structa installed, you’ve got this too. However, it’ll help enormously if Python is in your system’s “PATH” so that you can run python scripts at the command line.
The scipy library must be installed for the scripts we’re going to be using to generate data. On Debian/Ubuntu systems you can run the following:
$ sudo apt install python3-scipy
On Windows, or if you’re running in a virtual environment, you should run the following:
$ pip install scipy
Some basic command line knowledge. In particular, it’ll help if you’re familiar with shell redirection and piping (note: while that link is on askubuntu.com the contents are equally applicable to the vast majority of UNIX shells, and even to Windows’ cmd!)
“Real World” Data
For this tutorial, we’ll use a custom made data-set which will allow us to tweak things and see what’s going on under structa’s hood a bit more easily.
The following script generates a fairly sizeable JSON file (~11MB) apparently recording various air quality readings from places which bear absolutely no resemblance whatsoever to my adoptive city (ahem):
import sys
import json
import random
import datetime as dt
from scipy.stats import skewnorm
readings = {
# stat: (min, max),
'O3': (0, 50),
'NO': (0, 200),
'NO2': (0, 100),
'PM10': (0, 100),
'PM2.5': (0, 100),
}
locations = {
# location: {stat: (skew, scale), ...}
'Mancford Peccadillo': {
'O3': (0, 1),
'NO': (5, 1),
'NO2': (0, 1),
'PM10': (10, 3),
'PM2.5': (10, 1),
},
'Mancford Shartson': {
'O3': (-10, 1),
'NO': (10, 1),
'NO2': (0, 1),
},
'Salport': {
'NO': (10, 1),
'NO2': (-10, 1/2),
'PM10': (5, 1/2),
'PM2.5': (5, 1/2),
},
'Prestchester': {
'O3': (1, 1),
'NO': (5, 1/2),
'NO2': (0, 1),
'PM10': (5, 1/2),
'PM2.5': (10, 1/2),
},
'Blackshire': {
'O3': (-10, 1),
'NO': (50, 1/2),
'NO2': (10, 1/2),
'PM10': (10, 1/2),
'PM2.5': (10, 1/2),
},
'St. Wigpools': {
'O3': (0, 1),
'NO': (10, 1),
'NO2': (5, 3/4),
'PM10': (5, 1/2),
'PM2.5': (5, 1/2),
},
}
def skewfunc(min, max, a=0, scale=1):
s = skewnorm(a)
real_min = s.ppf(0.0001)
real_max = s.ppf(0.9999)
real_range = real_max - real_min
res_range = max - min
def skewrand():
return min + res_range * scale * (s.rvs() - real_min) / real_range
return skewrand
generators = {
location: {
reading: skewfunc(read_min, read_max, skew, scale)
for reading, params in loc_readings.items()
for read_min, read_max in (readings[reading],)
for skew, scale in (params,)
}
for location, loc_readings in locations.items()
}
timestamps = [
dt.datetime(2020, 1, 1) + dt.timedelta(hours=n)
for n in range(10000)
]
data = {
location: {
'euid': 'GB{:04d}A'.format(random.randint(200, 2000)),
'ukid': 'UKA{:05d}'.format(random.randint(100, 800)),
'lat': random.random() + 53.0,
'long': random.random() - 3.0,
'alt': random.randint(5, 100),
'readings': {
reading: {
timestamp.isoformat(): loc_gen()
for timestamp in timestamps
}
for reading, loc_gen in loc_gens.items()
}
}
for location, loc_gens in generators.items()
}
json.dump(data, sys.stdout)
If you run the script it will output JSON on stdout, which you can redirect to a file (or straight to structa, but given the script takes a while to run you may wish to capture the output to a file for experimentation purposes). Passing the output to structa should produce output something like this:
$ python3 air-quality.py > air-quality.json
$ structa air-quality.json
{
str range="Blackshire".."St. Wigpools": {
'alt': int range=31..85,
'euid': str range="GB1012A".."GB1958A" pattern="GB1[0-139][13-58][2-37-9]A",
'lat': float range=53.29812..53.6833,
'long': float range=-2.901626..-2.362118,
'readings': {
str range="NO".."PM2.5": { str of timestamp range=2020-01-01 00:00:00..2021-02-20 15:00:00 pattern="%Y-%m-%dT%H:%M:%S": float range=-5.634479..335.6384 }
},
'ukid': str range="UKA00129".."UKA00713" pattern="UKA00[1-24-57][1-38][0-13579]"
}
}
Note
It should be notable that the output of structa looks rather similar to the
end of the air-quality.py
script, where the “data” variable that is
ultimately dumped is constructed. This neatly illustrates the purpose of
structa: to summarize repeating structures in a mass of hierarchical data.
Looking at this output we can see that the data consists of a mapping (or Javascript “object”) at the top level, keyed by strings in the range “Blackshire” to “St. Wigpools” (when sorted).
Under these keys are more mappings which have six keys (which structa has displayed in alphabetical order for ease of reading):
alt which maps to an integer in some range (in the example above 31 to 85, but this will likely be different for you)
euid which maps to a string which always started with “GB” and is followed by several numerals
lat which maps to a floating point value around 53
long which maps to another floating point roughly around -2
ukid which maps to a string always starting with UKA00 followed by several numerals
And finally, readings which maps to another dictionary of strings …
Which maps to another dictionary which is keyed by timestamps in string format, which map to floating point values
If you have a terminal capable of ANSI codes, you may note that types are displayed in a different color (to distinguish them from literals like the “ukid” and “euid” keys), as are patterns within fixed length strings, and various keywords like “range=”.
Note
You may also notice that several of the types (definitely the outer “str”, but possibly other types within the top-level dictionary, like lat/long) are underlined. This indicates that these values are unique throughout the entire dataset, and thus potentially suitable as top-level keys if entered into a database.
Just because you can use something as a unique key, however, doesn’t mean you should (floating point values being a classic example).
Optional Keys
Let’s explore how structa handles various “problems” in the data. Firstly, we’ll make a copy of our script and add a chunk of code to remove approximately half of the altitude readings:
$ cp air-quality.py air-quality-opt.py
$ editor air-quality-opt.py
data = {
location: {
'euid': 'GB{:04d}A'.format(random.randint(200, 2000)),
'ukid': 'UKA{:05d}'.format(random.randint(100, 800)),
'lat': random.random() + 53.0,
'long': random.random() - 3.0,
'alt': random.randint(5, 100),
'readings': {
reading: {
timestamp.isoformat(): loc_gen()
for timestamp in timestamps
}
for reading, loc_gen in loc_gens.items()
}
}
for location, loc_gens in generators.items()
}
for location in data:
if random.random() < 0.5:
del data[location]['alt']
json.dump(data, sys.stdout)
What does structa make of this?
$ python3 air-quality-opt.py > air-quality-opt.json
$ structa air-quality-opt.json
{
str range="Blackshire".."St. Wigpools": {
'alt'?: int range=31..85,
'euid': str range="GB1012A".."GB1958A" pattern="GB1[0-139][13-58][2-37-9]A",
'lat': float range=53.29812..53.6833,
'long': float range=-2.901626..-2.362118,
'readings': {
str range="NO".."PM2.5": { str of timestamp range=2020-01-01 00:00:00..2021-02-20 15:00:00 pattern="%Y-%m-%dT%H:%M:%S": float range=-5.634479..335.6384 }
},
'ukid': str range="UKA00129".."UKA00713" pattern="UKA00[1-24-57][1-38][0-13579]"
}
}
Note that a question-mark has now been appended to the “alt” key in the second-level dictionary (if your terminal supports color codes, this should appear in red). This indicates that the “alt” key is optional and not present in every single dictionary at that level.
“Bad” Data
Next, we’ll make another script (a copy of air-quality-opt.py
), which
adds some more code to “corrupt” some of the timestamps:
$ cp air-quality-opt.py air-quality-bad.py
$ editor air-quality-bad.py
for location in data:
if random.random() < 0.5:
reading = random.choice(list(data[location]['readings']))
date = random.choice(list(data[location]['readings'][reading]))
value = data[location]['readings'][reading].pop(date)
# Change the date to the 31st of February...
data[location]['readings'][reading]['2020-02-31T12:34:56'] = value
json.dump(data, sys.stdout)
What does structa make of this?
$ python3 air-quality.py > air-quality-bad.json
$ structa air-quality-bad.json
{
str range="Blackshire".."St. Wigpools": {
'alt'?: int range=31..85,
'euid': str range="GB1012A".."GB1958A" pattern="GB1[0-139][13-58][2-37-9]A",
'lat': float range=53.29812..53.6833,
'long': float range=-2.901626..-2.362118,
'readings': {
str range="NO".."PM2.5": { str of timestamp range=2020-01-01 00:00:00..2021-02-20 15:00:00 pattern="%Y-%m-%dT%H:%M:%S": float range=-5.634479..335.6384 }
},
'ukid': str range="UKA00129".."UKA00713" pattern="UKA00[1-24-57][1-38][0-13579]"
}
}
Apparently nothing! It may seem odd that structa raised no errors, or even warnings when encountering subtly incorrect data. One might (incorrectly) assume that structa just thinks anything that vaguely looks like a timestamp in a string is such.
For the avoidance of doubt, this is not the case: structa does attempt to
convert timestamps correctly and does not think February 31st is a valid date
(unlike certain databases!). However, structa does have a “bad threshold”
setting (structa --bad-threshold
) which means not all data in a given
sequence has to match the pattern under test.
Multiple Inputs
Time for another script (based on a copy of the prior
air-quality-bad.py
script), which produces each location as its own
separate JSON file:
$ cp air-quality-bad.py air-quality-multi.py
$ editor air-quality-multi.py
for location in data:
filename = location.lower().replace(' ', '-').replace('.', '')
filename = 'air-quality-{filename}.json'.format(filename=filename)
with open(filename, 'w') as out:
json.dump({location: data[location]}, out)
We can pass all the files as inputs to structa simultaneously, which will cause it to assume that they should all be processed as if they have comparable structures:
$ python3 air-quality-multi.py
$ ls *.json
air-quality-blackshire.json air-quality-prestchester.json
air-quality-mancford-peccadillo.json air-quality-salport.json
air-quality-mancford-shartson.json air-quality-st-wigpools.json
$ structa air-quality-*.json
{
str range="Blackshire".."St. Wigpools": {
'alt': int range=15..92,
'euid': str range="GB0213A".."GB1029A" pattern="GB[01][028-9][1-26-7][2-379]A",
'lat': float range=53.49709..53.98315,
'long': float range=-2.924566..-2.021445,
'readings': {
str range="NO".."PM2.5": { str of timestamp range=2020-01-01 00:00:00..2021-02-20 15:00:00 pattern="%Y-%m-%dT%H:%M:%S": float range=-2.982586..327.4161 }
},
'ukid': str range="UKA00148".."UKA00786" pattern="UKA00[135-7][13-47-8][06-9]"
}
}
In this case, structa has merged the top-level mapping in each file into one large top-level mapping. It would do the same if a top-level list were found in each file too.
Conclusion
This concludes the structa tutorial series. You should now have some experience of using structa with more complex datasets, how to tune its various settings for different scenarios, and what to look out for in the results to get the most out of its analysis.
In other words, if you wish to use structa from the command line, you should be all set. If you want help dealing with some specific scenarios, the sections in Recipes may be of interest. Alternatively, if you wish to use structa in your own Python scripts, the API Reference may prove useful.
Finally, if you wish to hack on structa yourself, please see the Development chapter for more information.
Command Line Reference
Synopsis
structa [-h] [--version] [-f {auto,csv,json,yaml}] [-e ENCODING]
[--encoding-strict] [--no-encoding-strict]
[-F INT] [-M NUM] [-B NUM] [-E NUM] [--str-limit NUM]
[--hide-count] [--show-count] [--hide-lengths] [--show-lengths]
[--hide-pattern] [--show-pattern]
[--hide-range] [--show-range {hidden,limits,median,quartiles,graph}]
[--hide-samples] [--show-samples]
[--min-timestamp WHEN] [--max-timestamp WHEN]
[--max-numeric-len LEN] [--sample-bytes SIZE]
[--strip-whitespace] [--no-strip-whitespace]
[--csv-format FIELD[QUOTE]] [--yaml-safe] [--no-yaml-safe]
[file [file ...]]
Positional Arguments
- file
The data-file(s) to analyze; if this is - or unspecified then stdin will be read for the data; if multiple files are specified all will be read and analyzed as an array of similar structures
Optional Arguments
- -h, --help
show this help message and exit
- --version
show program’s version number and exit
- -f {auto,csv,json,yaml}, --format {auto,csv,json,yaml}
The format of the data file; if this is unspecified, it will be guessed based on the first bytes of the file; valid choices are auto (the default), csv, or json
- -e ENCODING, --encoding ENCODING
The string encoding of the file, e.g. utf-8 (default: auto). If “auto” then the file will be sampled to determine the encoding (see
--sample-bytes
)
- --encoding-strict, --no-encoding-strict
Controls whether character encoding is strictly enforced and will result in an error if invalid characters are found during analysis. If disabled, a replacement character will be inserted for invalid sequences. The default is strict decoding
- -F INT, --field-threshold INT
If the number of distinct keys in a map, or columns in a tuple is less than this then they will be considered distinct fields instead of being lumped under a generic type like str (default: 20)
- -M NUM, --merge-threshold NUM
The proportion of mapping fields which must match other mappings for them to be considered potential merge candidates (default: 50%)
- -B NUM, --bad-threshold NUM
The proportion of string values which are allowed to mismatch a pattern without preventing the pattern from being reported; the proportion of “bad” data permitted in a field (default: 1%)
- -E NUM, --empty-threshold NUM
The proportion of string values permitted to be empty without preventing the pattern from being reported; the proportion of “empty” data permitted in a field (default: 99%)
- --str-limit NUM
The length beyond which only the lengths of strs will be reported; below this the actual value of the string will be displayed (default: 20)
- --hide-count, --show-count
If set, show the count of items in containers, the count of unique scalar values, and the count of all sample values (if
--show-samples
is set). If disabled, counts will be hidden
- --hide-lengths, --show-lengths
If set, display the range of lengths of string fields in the same format as specified by
--show-range
- --hide-pattern, --show-pattern
If set, show the pattern determined for fixed length string fields. If disabled, pattern information will be hidden
- --hide-range, --show-range {hidden,limits,median,quartiles,graph}
Show the range of numeric (and temporal) fields in a variety of forms. The default is ‘limits’ which simply displays the minimum and maximum; ‘median’ includes the median between these; ‘quartiles’ shows all three quartiles between the minimum and maximum; ‘graph’ displays a crude chart showing the positions of the quartiles relative to the limits. Use
--hide-range
to hide all range info
- --hide-samples, --show-samples
If set, show samples of non-unique scalar values including the most and least common values. If disabled, samples will be hidden
- --min-timestamp WHEN
The minimum timestamp to use when guessing whether floating point fields represent UNIX timestamps (default: 20 years). Can be specified as an absolute timestamp (in ISO-8601 format) or a duration to be subtracted from the current timestamp
- --max-timestamp WHEN
The maximum timestamp to use when guessing whether floating point fields represent UNIX timestamps (default: 10 years). Can be specified as an absolute timestamp (in ISO-8601 format) or a duration to be added to the current timestamp
- --max-numeric-len LEN
The maximum number of characters that a number, integer or floating-point, may use in its representation within the file. Defaults to 30
- --sample-bytes SIZE
The number of bytes to sample from the file for the purposes of encoding and format detection. Defaults to 1m. Typical suffixes of k, m, g, etc. may be specified
- --strip-whitespace, --no-strip-whitespace
Controls whether leading and trailing found in strings in the will be left alone and thus included or excluded in any data-type analysis. The default is to strip whitespace
- --csv-format FIELD[QUOTE]
The characters used to delimit fields and strings in a CSV file. Can be specified as a single character which will be used as the field delimiter, or two characters in which case the second will be used as the string quotation character. Can also be “auto” which indicates the delimiters should be detected. Bear in mind that some characters may require quoting for the shell, e.g. ‘;”’
- --yaml-safe, --no-yaml-safe
Controls whether the “safe” or “unsafe” YAML loader is used to parse YAML files. The default is the “safe” parser. Only use
--no-yaml-safe
if you trust the source of your data
- --json-strict, --no-json-strict
Controls whether the JSON decoder permits control characters within strings, which isn’t technically valid JSON. The default is to be strict and disallow such characters
Recipes
The following sections cover analyzing various common data scenarios with structa, and how structa’s various options should be set to handle them.
Analyzing from a URL
While structa itself can’t read URLs directly, the fact you can pipe data to it makes it ideal for use with something like curl:
$ curl -s https://piwheels.org/packages.json | structa
[
(
str,
int range=0..32.8K,
int range=0..1.7M
)
]
Dealing with large records
In the Getting Started we saw the following script, which generates a
mapping of mappings, for the purposes of learning about structa
--field-threshold
:
import sys
import json
import random
json.dump({
str(flight_id): {
"flight_id": flight_id,
"passengers": random.randint(50, 200),
"from": random.choice([
"MAN", "LON", "LHR", "ABZ", "AMS", "AUS", "BCN",
"BER", "BHX", "BRU", "CHI", "ORK", "DAL", "EDI",
]),
}
for flight_id in range(200)
}, sys.stdout)
We saw what happens when the threshold is too low:
$ python3 simple-fields.py | structa --field-threshold 2
{
str of int range=0..199 pattern="d": { str range="flight_id".."passengers": value }
}
What happens if the threshold is set too high, resulting in the outer mapping being treated as a (very large!) record?
$ python3 simple-fields.py | structa --field-threshold 300
{
str of int range=0..199 pattern="d": {
'flight_id': int range=0..199,
'from': str range="ABZ".."ORK" pattern="[A-EL-MO][A-EHMORU][IK-LNR-SUXZ]",
'passengers': int range=50..199
}
}
Curiously it seems to have worked happily anyway, although the pattern of the “from” field is now considerably more complex. The reasons for this are relatively complicated, but has to do with a later pass of structa’s algorithm merging common sub-structures of records. The merging process unfortunately handles certain things (like the merging of string field patterns) rather crudely.
Hence, while it’s generally safe to bump structa --field-threshold
up
quite high whenever you need to, be aware that it will:
significantly slow down analysis of large files (because the merging process is quite slow)
complicate the pattern analysis of repeated string fields and a few other things (e.g. string representations of date-times)
In other words, whenever you find yourself in a situation where you need to bump up the field threshold, a reasonable procedure to follow is:
Bump the threshold very high (e.g. 1000) and run the analysis with
structa --show-count
enabled.Run the analysis again with the field threshold set below the count of the outer container(s), but above the count of the inner record mappings
The first run will probably be quite slow, but the second run will be much faster and will produce better output.
API Reference
In addition to being a utility, structa can also be used as an API from Python (either in a script, or just at the console).
The primary class of interest will generally be
Analyzer
in the structa.analyzer
module, but
it is important to understand the various classes in the structa.types
module to interpret the output of the analyzer.
Modules
structa.analyzer
The structa.analyzer
module contains the Analyzer
class which
is the primary entry point for using structa’s as an API. It can be constructed
without any arguments, and the analyze()
method can be
immediately used to determine the structure of some data. The
merge()
method can be used to further refine the returned
structure, and measure()
can be used before-hand if you wish to
use the progress callback to track the progress of long analysis runs.
A typical example of basic usage would be:
from structa.analyzer import Analyzer
data = {
str(i): i
for i in range(1000)
}
an = Analyzer()
structure = an.analyze(data)
print(structure)
The structure returned by analyze()
(and by
merge()
) will be an instance of one of the classes in the
structa.types
module, all of which have sensible str
and
repr()
output.
A more complete example, using Source
to figure out
the source format and encoding:
from structa.analyzer import Analyzer
from structa.source import Source
from urllib.request import urlopen
with urlopen('https://usn.ubuntu.com/usn-db/database-all.json') as f:
src = Source(data)
an = Analyzer()
an.measure(src.data)
structure = an.analyze(src.data)
structure = an.merge(structure)
print(structure)
- class structa.analyzer.Analyzer(*, bad_threshold=Fraction(1, 50), empty_threshold=Fraction(49, 50), field_threshold=20, merge_threshold=Fraction(1, 2), max_numeric_len=30, strip_whitespace=False, min_timestamp=None, max_timestamp=None, progress=None)[source]
This class is the core of structa. The various keyword-arguments to the constructor correspond to the command line options (see Command Line Reference).
The
analyze()
method is the primary method for analysis, which simply accepts the data to be analyzed. Themeasure()
method can be used to perform some pre-processing for the purposes of progress reporting (useful with very large datasets), whilemerge()
can be used for additional post-processing to improve the analysis output.- Parameters
bad_threshold (numbers.Rational) – The proportion of data within a field (across repetitive structures) which is permitted to be invalid without affecting the type match. Primarily useful with string representations. Valid values are between 0 and 1.
empty_threshold (numbers.Rational) – The proportion of strings within a field (across repetitive structures) which can be blank without affecting the type match. Empty strings falling within this threshold will be discounted by the analysis. Valid values are between 0 and 1.
field_threshold (int) – The minimum number of fields in a mapping before it will be treated as a “table” (a mapping of keys to records) rather than a record (a mapping of fields to values). Valid values are any positive integer.
merge_threshold (numbers.Rational) – The proportion of fields within repetitive mappings that must match for the mappings to be considered “mergeable” by the
merge()
method. Note that the proportion is calculated with the length of the shorter mapping in the comparision. Valid values are between 0 and 1.strip_whitespace (bool) – If
True
, whitespace is stripped from all strings prior to any further analysis.min_timestamp (datetime.datetime or None) – The minimum timestamp to use when determining whether floating point values potentially represent epoch-based datetime values.
max_timestamp (datetime.datetime or None) – The maximum timestamp to use when determining whether floating point values potentially represent epoch-based datetime values.
progress (object or None) – If specificed, must be an object with
update
andreset
methods that will be called to provide progress feedback. Seeprogress
for further details.
- analyze(data)[source]
Given some value data (typically an iterable or a mapping), return a
Type
descendent describing its structure.
- measure(data)[source]
Given some value data (typically an iterable or mapping), measure the number of items within it, for the purposes of accurately reporting progress during the running of the
analyze()
andmerge()
methods.If this is not called prior to these methods, they will still run successfully, but progress tracking (via the
progress
object) will be inaccurate as the total number of steps to process will never be calculated.As measurement is itself a potentially lengthy process, progress will be reported as a function of the top-level items within data during the run of this method.
- merge(struct)[source]
Given some struct (as returned by
analyze()
), merge common sub-structures within it, returning the new top level structure (anotherType
instance).
- property progress
The object passed as the progress parameter on construction.
If this is not
None
, it must be an object which implements the following methods:reset(*, total: int=None)
update(n: int=None)
The “reset” method of the object will be called with either the keyword argument “total”, indicating the new number of steps that have yet to complete, or with no arguments indicating the progress display should be cleared as a task is complete.
The “update” method of the object will be called with either the number of steps to increment by (as the positional “n” argument), or with no arguments indicating that the display should simply be refreshed (e.g. to recalculate the time remaining, or update a time elapsed display).
It is no coincidence that this is a sub-set of the public API of the tqdm progress bar project (as that’s what structa uses in its CLI implementation).
structa.chars
The structa.chars
module provides classes and constants for defining
and manipulating character classes (in the sense of regular expressions).
The primary class of interest is CharClass
, but most uses can likely
be covered by the set of constants defined in the module.
- class structa.chars.CharClass(chars)[source]
A descendent of
frozenset
intended to represent a character class in a regular expression. Can be instantiated from any iterable of single characters (including astr
).All operations of
frozenset
are supported, but return instances ofCharClass
instead (and thus, are only valid for operations which result in sets containing individual character values). For example:>>> abc = CharClass('abc') >>> abc CharClass('abc') >>> ghi = CharClass('ghi') >>> abc == ghi False >>> abc < ghi False >>> abc | ghi CharClass('abcghi') >>> abc < abc | ghi True
- difference(*others)[source]
Return the difference of two or more sets as a new set.
(i.e. all elements that are in this set but not the others.)
- intersection(*others)[source]
Return the intersection of two sets as a new set.
(i.e. all elements that are in both sets.)
- class structa.chars.AnyChar[source]
A singleton class (all instances are the same) which represents any possible character. This is comparable with, and compatible in operations with, instances of
CharClass
. For instance:>>> abc = CharClass('abc') >>> any_ = AnyChar() >>> any_ AnyChar() >>> abc < any_ True >>> abc > any_ False >>> abc | any_ AnyChar()
- structa.chars.char_range(start, stop)[source]
Returns a
CharClass
containing all the characters from start to stop inclusive (in unicode codepoint order). For example:>>> char_range('a', 'c') CharClass('abc') >>> char_range('0', '9') CharClass('0123456789')
Constants
- structa.chars.oct_digit
Represents any valid digit in base 8 (octal).
- structa.chars.dec_digit
Represents any valid digit in base 10 (decimal).
- structa.chars.hex_digit
Represents any valid digit in base 16 (hexidecimal).
- structa.chars.ident_first
Represents any character which is valid as the first character of a Python identifier.
- structa.chars.ident_char
Represents any character which is valid within a Python identifier.
structa.collections
- class structa.collections.FrozenCounter(it)[source]
An immutable variant of the
collections.Counter
class from the Python standard library.This implements all readable properties and behaviours of the
collections.Counter
class, but excludes all methods and behaviours which permit modification of the counter. The resulting instances are hashable and can be used as keys in mappings.- classmethod from_counter(counter)[source]
Construct a
FrozenCounter
from acollections.Counter
instance. This is generally much faster than attempting to construct from the elements of an existing counter.The counter parameter must either be a
collections.Counter
instance, or aFrozenCounter
instance (in which case it is returned verbatim).
structa.conversions
- structa.conversions.try_conversion(sample, conversion, threshold=0)[source]
Given a
Counter
sample of strings, call the specified conversion on each string returning the set of converted values.conversion must be a callable that accepts a single string parameter and returns the converted value. If the conversion fails it must raise a
ValueError
exception.If threshold is specified (defaults to 0), it defines the number of “bad” conversions (which result in
ValueError
being raised) that will be ignored. If threshold is exceeded, thenValueError
will be raised (or rather passed through from the underlying conversion). Likewise, if threshold is not exceeded, but zero conversions are successful thenValueError
will also be raised.
- structa.conversions.parse_bool(s, false='0', true='1')[source]
Convert the string s (stripped and lower-cased) to a bool, if it matches either the false string (defaults to ‘0’) or true (defaults to ‘1’). If it matches neither, raises a
ValueError
.
- structa.conversions.parse_duration(s)[source]
Convert the string s to a
relativedelta
. The string must consist of white-space and/or comma separated values which are a number followed by a suffix indicating duration. For example:>>> parse_duration('1s') relativedelta(seconds=+1) >>> parse_duration('5 minutes, 30 seconds') relativedelta(minutes=+5, seconds=+30) >>> parse_duration('1 year') relativedelta(years=+1)
Note that some suffixes like “m” can be ambiguous; using common abbreviations should avoid ambiguity:
>>> parse_duration('1 m') relativedelta(months=+1) >>> parse_duration('1 min') relativedelta(minutes=+1) >>> parse_duration('1 mon') relativedelta(months=+1)
The set of possible durations, and their recognized suffixes is as follows:
Microseconds: microseconds, microsecond, microsec, micros, micro, mseconds, msecond, msecs, msec, ms
Seconds: seconds, second, secs, sec, s
Minutes: minutes, minute, mins, min, mi
Hours: hours, hour, hrs, hr, h
Days: days, day, d
Weeks: weeks, week, wks, wk, w
Months: months, month, mons, mon, mths, mth, m
Years: years, year, yrs, yr, y
If conversion fails,
ValueError
is raised.
- structa.conversions.parse_duration_or_timestamp(s)[source]
Convert the string s to a
datetime
or arelativedelta
. Duration conversion is attempted to and, if this fails, date-time conversion is attempted. AValueError
is raised if both conversions fail.
structa.errors
The structa.errors
module defines all the custom exception and warning
classes used in structa.
structa.format
The structa.format
module contains various simple routines for “nicely”
formatting certain structures for output.
- structa.format.format_chars(chars, range_sep='-', list_sep='')[source]
Given a set of chars, returns a compressed string representation of all values in the set. For example:
>>> char_ranges({'a', 'b'}) 'ab' >>> char_ranges({'a', 'b', 'c'}) 'a-c' >>> char_ranges({'a', 'b', 'c', 'd', 'h'}) 'a-dh' >>> char_ranges({'a', 'b', 'c', 'd', 'h', 'i'}) 'a-dh-i'
range_sep and list_sep can be optionally specified to customize the strings used to separate ranges and lists of ranges respectively.
- structa.format.format_int(i)[source]
Reduce i by some appropriate power of 1000 and suffix it with an appropriate Greek qualifier (K for kilo, M for mega, etc). For example:
>>> format_int(0) '0' >>> format_int(10) '10' >>> format_int(1000) '1.0K' >>> format_int(1600) '1.6K' >>> format_int(2**32) '4.3G'
- structa.format.format_repr(self, **override)[source]
Returns a
repr()
style string for self in the formclass(name=value, name=value, ...)
.Note
At present, this function does not handle recursive structures unlike
reprlib.recursive_repr()
.
- structa.format.format_sample(value)[source]
Format a scalar value for output. The value can be a
str
,int
,float
,bool
,datetime
, orNone
.The result is a
str
containing a “nicely” formatted representation of the value. For example:>>> format_sample(1.0) '1' >>> format_sample(1.5) '1.5' >>> format_sample(200000000000) '200.0G' >>> format_sample(200000000000.0) '2e+11' >>> format_sample(None) 'null' >>> format_sample(False) 'false' >>> format_sample('foo') '"foo"' >>> format_sample(datetime.now()) '2021-08-16 14:05:04'
structa.source
- class structa.source.Source(source, *, encoding='auto', encoding_strict=True, format='auto', csv_delimiter='auto', csv_quotechar='auto', yaml_safe=True, json_strict=True, sample_limit=1048576)[source]
A generalized data source capable of automatically recognizing certain popular data formats, and guessing character encodings. Constructed with a mandatory file-like object as the source, and a multitude of keyword-only options, the decoded content can be access from
data
The source must have a
read()
method which, given a number of bytes to return, returns abytes
string up to that length, but has no requirements beyond this. Note that this means files over sockets or pipes are acceptable inputs.- Parameters
source (file) – The file-like object to decode (must have a
read
method).encoding (str) – The character encoding used in the source, or “auto” (the default) if it should be guessed from a sample of the data.
encoding_strict (bool) – If
True
(the default), raise an exception if character decoding errors occur. Otherwise, replace invalid characters silently.format (str) – If “auto” (the default), guess the format of the data source. Otherwise can be explicitly set to “csv”, “yaml”, or “json” to force parsing of that format.
csv_delimiter (str) – If “auto” (the default), attempt to guess the field delimiter when the “csv” format is being decoded using the
csv.Sniffer
class. Comma, semi-colon, space, and tab characters will be attempted. Otherwise must be set to the single characterstr
used as the field delimiter (e.g. “,”).csv_quotechar (str) – If “auto” (the default), attempt to guess the string delimiter when the “csv” format is being decoded using the
csv.Sniffer
class. Otherwise must be set to the single characterstr
used as the string delimiter (e.g. ‘”’).yaml_safe (bool) – If
True
(the default) the “safe” YAML parser from ruamel.yaml will be used.json_strict (bool) – If
True
(the default), control characters will not be permitted inside decoded strings.sample_limit (int) – The number of bytes to sample from the beginning of the stream when attempting to determine character encoding. Defaults to 1MB.
- property csv_dialect
The
csv.Dialect
used whenformat
is “csv”, orNone
otherwise.
- property data
The decoded data. Typically a
list
ordict
of values, but can be any value representable in the source format.
- property encoding
The character encoding detected or specified for the source, e.g. “utf-8”.
- property format
The data format detected or specified for the source, e.g. “csv”, “yaml”, or “json”.
structa.types
The structa.types
module defines the class hierarchy used to represent
the structural types of analyzed data. The root of the hierarchy is the
Type
class. The rest of the hierarchy is illustrated in the chart
below:
- class structa.types.Type[source]
The abstract base class of all types recognized by structa.
This class ensures that instances are hashable (can be used as keys in dictionaries), have a reasonable
repr()
value for ease of use at the REPL, can be passed to thexml()
function.However, the most important thing implemented by this base class is the equality test which can be used to test whether a given type is “compatible” with another type. The base test implemented at this level is that one type is compatible with another if one is a sub-class of the other.
Hence,
Str
is compatible withStr
as they are the same class (and hence one is, redundantly, a sub-class of the other). AndInt
is compatible withFloat
as it is a sub-class of the latter. HoweverInt
is not compatbile withStr
as both descend fromScalar
and are siblings rather than parent-child.
- class structa.types.Container(sample, content=None)[source]
Abstract base of all types that can contain other types. Constructed with a sample of values, and an optional definition of content.
This is the base class of
List
,Tuple
, andDict
. Note that it is not the base class ofStr
as, although that is a compound type, it cannot contain other types; structa treatsStr
as a scalar type.Container
extendsType
by permitting instances to be added to (compatible, by equality) instances, combining theircontent
appropriately.
- class structa.types.Dict(sample, content=None, *, similarity_threshold=0.5)[source]
Represents mappings (or dictionaries).
This concrete refinement of
Container
usesDictField
instances in itscontent
list.In the case that a mapping is analyzed as a “record” mapping (of fields to values), the
content
list will contain one or moreDictField
instances, for which thekey
attribute(s) will beField
instances.However, if the mapping is analyzed as a “table” mapping (of keys to records), the
content
list will contain a singleDictField
instance mapping the key’s type to the value structure.
- class structa.types.Tuple(sample, content=None)[source]
Represents sequences of heterogeneous types (typically tuples).
This concrete refinement of
Container
usesTupleField
instances in itscontent
list.Tuples are typically the result of an analysis of some homogeneous outer sequence (usually a
List
though sometimes aDict
) that contains heterogeneous sequences (theTuple
instance).
- class structa.types.List(sample, content=None)[source]
Represents sequences of homogeneous types. This only ever has a single
Type
descendent in itscontent
list.
- class structa.types.DictField(key, value=None)[source]
Represents a single mapping within a
Dict
, from thekey
to its correspondingvalue
. For example, aField
of a record mapping to some other type, or a genericStr
mapping to anInt
value.
- class structa.types.TupleField(index, value=None)[source]
Represents a single field within a
Tuple
, with theindex
(an integer number) and its correspondingvalue
.
- class structa.types.Scalar(sample)[source]
Abstract base of all types that cannot contain other types. Constructed with a sample of values.
This is the base class of
Float
(from whichInt
and thenBool
descend),Str
, andDateTime
.- property sample
A sequence of the sample values that the instance was constructed from (this will not be the original sequence, but one derived from that).
- class structa.types.Float(sample)[source]
Represents scalar floating-point values in datasets. Constructed with a sample of values.
- classmethod from_strings(sample, pattern, bad_threshold=0)[source]
Class method for constructing an instance wrapped in a
StrRepr
to indicate a string representation of a set of floating-point values. Constructed with an sample of strings, a pattern (which currently must simply be “f”), and a bad_threshold of values which are permitted to fail conversion.
- class structa.types.Int(sample)[source]
Represents scalar integer values in datasets. Constructed with a sample of values.
- classmethod from_strings(sample, pattern, bad_threshold=0)[source]
Class method for constructing an instance wrapped in a
StrRepr
to indicate a string representation of a set of integer values. Constructed with an sample of strings, a pattern (which may be “d”, “o”, or “x” to represent the base used in the string representation), and a bad_threshold of values which are permitted to fail conversion.
- class structa.types.Bool(sample)[source]
Represents scalar boolean values in datasets. Constructed with a sample of values.
- classmethod from_strings(iterable, pattern, bad_threshold=0)[source]
Class method for constructing an instance wrapped in a
StrRepr
to indicate a string representation of a set of booleans. Constructed with an sample of strings, a pattern (which is a string of the form “false|true”, i.e. the expected string representations of theFalse
andTrue
values separated by a bar), and a bad_threshold of values which are permitted to fail conversion.
- class structa.types.DateTime(sample)[source]
Represents scalar timestamps (a date, and a time) in datasets. Constructed with a sample of values.
- classmethod from_numbers(pattern)[source]
Class method for constructing an instance wrapped in a
NumRepr
to indicate a numeric representation of a set of timestamps (e.g. day offset from the UNIX epoch).Constructed with an sample of number, a pattern (which can be a
StrRepr
instance if the numbers are themselves represented as strings, otherwise must be theInt
orFloat
instance representing the numbers), and a bad_threshold of values which are permitted to fail conversion.
- classmethod from_strings(iterable, pattern, bad_threshold=0)[source]
Class method for constructing an instance wrapped in a
StrRepr
to indicate a string representation of a set of timestamps.Constructed with an sample of strings, a pattern (which must be compatible with
datetime.datetime.strptime()
), and a bad_threshold of values which are permitted to fail conversion.
- validate(value)[source]
Validate that value (which must be a
datetime
) lies within the range of sampled values.- Raises
TypeError – if value is not a
datetime.datetime
ValueError – if value is outside the range of sampled values
- class structa.types.Str(sample, pattern=None)[source]
Represents string values in datasets. Constructed with a sample of values, and an optional pattern (a sequence of
CharClass
instances indicating which characters are valid at which position in fixed-length strings).- pattern: [structa.chars.CharClass]
None
if the string is variable length or has no discernable pattern to its values. Otherwise a sequence ofCharClass
instances indicating the valid characters at each position of the string.
- class structa.types.Repr(content, pattern=None)[source]
Abstract base class for representations (string, numeric) of other types. Parent of
StrRepr
andNumRepr
.
- class structa.types.StrRepr(content, pattern=None)[source]
A string representation of an inner type. Typically used to wrap
Int
,Float
,Bool
, orDateTime
. Descends fromRepr
.
- class structa.types.NumRepr(content, pattern=None)[source]
A numeric representation of an inner type. Typically used to wrap
DateTime
. Descends fromRepr
.
- class structa.types.URL(sample, pattern=None)[source]
A specialization of
Str
for representing URLs. Currently does little more than trivial validation of the scheme.- validate(value)[source]
Validate that value starts with “http://” or “https://”
- Raises
ValueError – if value does not start with a valid scheme
- class structa.types.Field(value, count, optional=False)[source]
Represents a single key in a
DictField
mapping. This is used by the analyzer when it decides a mapping represents a “record” (a mapping of fields to values) rather than a “table” (a mapping of keys to records).Constructed with the value of the key, the count of mappings that the key appears in, and a flag indicating if the key is optional (defaults to
False
for mandatory).- optional: bool
If
True
, the key may be ommitted from certain mappings in the data. IfFalse
(the default), the key always appears in the owning mapping.
- validate(value)[source]
Validates that value matches the expected key value.
- Raises
ValueError – if value does not match the expected value
- class structa.types.Value(sample)[source]
A descendent of
Type
that represents any arbitrary type at all. This is used when the analyzer comes across a container of a multitude of (incompatible) types, e.g. a list of both strings and integers.It compares equal to all other types, and when added to other types, the result is a new
Value
instance.
- class structa.types.Empty[source]
A descendent of
Type
that represents a container with no content. For example, if the analyzer comes across a field which always contains an empty list, it would be represented as aList
instance whereList.content
was a sequence containing anEmpty
instance.It compares equal to all other types, and when added to other types, the result is the other type. This allows the merge phase to combine empty lists with a list of integers found at the same level, for example.
- validate(value)[source]
Trivial validation; always passes.
Note
This counter-intuitive behaviour is because the
Empty
value indicates a lack of type-information rather than a definitely empty container (after all, there’s usually little sense in having a container field which will always be empty in most hierarchical structures).The way this differs from
Value
is in the additive action.
- class structa.types.Stats(sample, card, min, q1, q2, q3, max)[source]
Stores cardinality, minimum, maximum, and (high) median of a sample of numeric values (or lengths of strings or containers), along with the specified sample of values.
Typically instances of this class are constructed via the
from_sample()
orfrom_lengths()
class methods rather than directly. However, instances can also be added to other instances to generate statistics for the combined sample set. Instances may also be compared for equality.- q2: int | float | str | datetime.datetime | ...
- sample: structa.collections.FrozenCounter
The sample data that the statistics were calculated from. This is always an instance of
FrozenCounter
.
- classmethod from_lengths(sample)[source]
Given an iterable of sample values, which must be of a homogeneous compound type (e.g.
str
,tuple
), construct an instance after calculating thelen()
of each item of the sample, and then the minimum, maximum, and quartile values of the lengths.
structa.xml
The structa.xml
module provides methods for generating and manipulating
XML, primarily in the form of xml.etree.ElementTree
objects. The main
class of interest is ElementFactory
, which can be used to generate
entire element-tree documents in a functional manner.
The xml()
function can be used in a similar manner to str
or
repr()
to generate XML representations of supported objects (most
classes within structa.types
support this). Finally,
get_transform()
can be used to obtain XSLT trees defined by structa
(largely for display purposes).
- class structa.xml.ElementFactory(namespace=None)[source]
A class inspired by Genshi for easy creation of ElementTree Elements.
The ElementFactory class was inspired by the Genshi builder unit in that it permits simple creation of Elements by calling methods on the tag object named after the element you wish to create. Positional arguments become content within the element, and keyword arguments become attributes.
If you need an attribute or element tag that conflicts with a Python keyword, simply append an underscore to the name (which will be automatically stripped off).
Content can be just about anything, including booleans, integers, longs, dates, times, etc. This class simply applies their default string conversion to them (except basestring derived types like string and unicode which are simply used verbatim).
For example:
>>> tostring(tag.a('A link')) '<a>A link</a>' >>> tostring(tag.a('A link', class_='menuitem')) '<a class="menuitem">A link</a>' >>> tostring(tag.p('A ', tag.a('link', class_='menuitem'))) '<p>A <a class="menuitem">link</a></p>'
- structa.xml.xml(obj)[source]
In a similar manner to
str
, this function calls the__xml__
method (if any) on obj, returning the result which is expected to be anElement
instance representing the object.
- structa.xml.get_transform(name)[source]
Return the XSLT transform defined by name in the
structa.ui
module.
- structa.xml.merge_siblings(elem)[source]
Consolidate the content of adjacent sibling child elements with the same tag. For example:
>>> x = XML('<doc><a>a</a><a>b</a><a>c</a><b>d</b><a>e</a></doc>') >>> tostring(merge_siblings(x)) b'<doc><a>abc</a><b>d</b><a>e</a></doc>'
Note that the function only deals with direct child elements of elem; it does nothing to descendents of those children, even if they have the same tag as their parent:
>>> x = XML('<doc><a>a<a>b</a></a><a>c</a><b>d</b><a>e</a></doc>') >>> tostring(merge_siblings(x)) b'<doc><a>a<a>b</a>c</a><b>d</b><a>e</a></doc>'
Development
The main GitHub repository for the project can be found at:
The project is currently in its early stages, but is quite useable and the documentation, while incomplete, should be useful to both users and developers wishing to hack on the project itself. The test suite is also nearing full coverage.
Development installation
If you wish to develop structa, obtain the source by cloning the GitHub repository and then use the “develop” target of the Makefile which will install the package as a link to the cloned repository allowing in-place development. The following example demonstrates this method within a virtual Python environment:
$ sudo apt install build-essential git virtualenvwrapper
After installing virtualenvwrapper
you’ll need to restart your shell before
commands like mkvirtualenv will operate correctly. Once you’ve
restarted your shell, continue:
$ cd
$ mkvirtualenv -p /usr/bin/python3 structa
$ workon structa
(structa) $ git clone https://github.com/waveform80/structa.git
(structa) $ cd structa
(structa) $ make develop
To pull the latest changes from git into your clone and update your installation:
$ workon structa
(structa) $ cd ~/structa
(structa) $ git pull
(structa) $ make develop
To remove your installation, destroy the sandbox and the clone:
(structa) $ deactivate
$ rmvirtualenv structa
$ rm -rf ~/structa
Building the docs
If you wish to build the docs, you’ll need a few more dependencies. Inkscape is used for conversion of SVGs to other formats, Graphviz is used for rendering certain charts, and TeX Live is required for building PDF output. The following command should install all required dependencies:
$ sudo apt install texlive-latex-recommended texlive-latex-extra \
texlive-fonts-recommended texlive-xetex graphviz inkscape \
python3-sphinx python3-sphinx-rtd-theme latexmk xindy
Once these are installed, you can use the “doc” target to build the documentation in all supported formats (HTML, ePub, and PDF):
$ workon structa
(structa) $ cd ~/structa
(structa) $ make doc
However, the easiest way to develop the documentation is with the “preview” target which will build the HTML version of the docs, and start a web-server to preview the output. The web-server will then watch for source changes (in both the documentation source, and the application’s source) and rebuild the HTML automatically as required:
$ workon structa
(structa) $ cd ~/structa
(structa) $ make preview
The HTML output is written to build/html
while the PDF output
goes to build/latex
.
Test suite
If you wish to run the structa test suite, follow the instructions in Development installation above and then make the “test” target within the sandbox:
$ workon structa
(structa) $ cd ~/structa
(structa) $ make test
The test suite is also setup for usage with the tox utility, in which case it will attempt to execute the test suite with all supported versions of Python. If you are developing under Ubuntu you may wish to look into the Dead Snakes PPA in order to install old/new versions of Python; the tox setup should work with the version of tox shipped with Ubuntu Focal, but more features (like parallel test execution) are available with later versions.
For example, to execute the test suite under tox, skipping interpreter versions which are not installed:
$ tox -s
To execute the test suite under all installed interpreter versions in parallel, using as many parallel tasks as there are CPUs, then displaying a combined report of coverage from all environments:
$ tox -p auto -s
$ coverage combine .coverage.py*
$ coverage report
Changelog
Release 0.3 (2021-10-27)
Fixed dictionary merging of scalar and field keys (#19)
Wrote full documentation including tutorials and API reference
Lots of other minor fixes …
Release 0.2.1 (2021-08-17 … later)
It’d help if you included the XSL for the UI …
Release 0.2 (2021-08-17)
Release 0.1 (2018-12-17)
Initial commit of something that works … ish
License
This file is part of structa.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA or see <https://www.gnu.org/licenses/>.