structa.analyzer

The structa.analyzer module contains the Analyzer class which is the primary entry point for using structa’s as an API. It can be constructed without any arguments, and the analyze() method can be immediately used to determine the structure of some data. The merge() method can be used to further refine the returned structure, and measure() can be used before-hand if you wish to use the progress callback to track the progress of long analysis runs.

A typical example of basic usage would be:

from structa.analyzer import Analyzer

data = {
    str(i): i
    for i in range(1000)
}
an = Analyzer()
structure = an.analyze(data)
print(structure)

The structure returned by analyze() (and by merge()) will be an instance of one of the classes in the structa.types module, all of which have sensible str and repr() output.

A more complete example, using Source to figure out the source format and encoding:

from structa.analyzer import Analyzer
from structa.source import Source
from urllib.request import urlopen

with urlopen('https://usn.ubuntu.com/usn-db/database-all.json') as f:
    src = Source(data)
    an = Analyzer()
    an.measure(src.data)
    structure = an.analyze(src.data)
    structure = an.merge(structure)
    print(structure)
class structa.analyzer.Analyzer(*, bad_threshold=Fraction(1, 50), empty_threshold=Fraction(49, 50), field_threshold=20, merge_threshold=Fraction(1, 2), max_numeric_len=30, strip_whitespace=False, min_timestamp=None, max_timestamp=None, progress=None)[source]

This class is the core of structa. The various keyword-arguments to the constructor correspond to the command line options (see Command Line Reference).

The analyze() method is the primary method for analysis, which simply accepts the data to be analyzed. The measure() method can be used to perform some pre-processing for the purposes of progress reporting (useful with very large datasets), while merge() can be used for additional post-processing to improve the analysis output.

Parameters
  • bad_threshold (numbers.Rational) – The proportion of data within a field (across repetitive structures) which is permitted to be invalid without affecting the type match. Primarily useful with string representations. Valid values are between 0 and 1.

  • empty_threshold (numbers.Rational) – The proportion of strings within a field (across repetitive structures) which can be blank without affecting the type match. Empty strings falling within this threshold will be discounted by the analysis. Valid values are between 0 and 1.

  • field_threshold (int) – The minimum number of fields in a mapping before it will be treated as a “table” (a mapping of keys to records) rather than a record (a mapping of fields to values). Valid values are any positive integer.

  • merge_threshold (numbers.Rational) – The proportion of fields within repetitive mappings that must match for the mappings to be considered “mergeable” by the merge() method. Note that the proportion is calculated with the length of the shorter mapping in the comparision. Valid values are between 0 and 1.

  • strip_whitespace (bool) – If True, whitespace is stripped from all strings prior to any further analysis.

  • min_timestamp (datetime.datetime or None) – The minimum timestamp to use when determining whether floating point values potentially represent epoch-based datetime values.

  • max_timestamp (datetime.datetime or None) – The maximum timestamp to use when determining whether floating point values potentially represent epoch-based datetime values.

  • progress (object or None) – If specificed, must be an object with update and reset methods that will be called to provide progress feedback. See progress for further details.

analyze(data)[source]

Given some value data (typically an iterable or a mapping), return a Type descendent describing its structure.

measure(data)[source]

Given some value data (typically an iterable or mapping), measure the number of items within it, for the purposes of accurately reporting progress during the running of the analyze() and merge() methods.

If this is not called prior to these methods, they will still run successfully, but progress tracking (via the progress object) will be inaccurate as the total number of steps to process will never be calculated.

As measurement is itself a potentially lengthy process, progress will be reported as a function of the top-level items within data during the run of this method.

merge(struct)[source]

Given some struct (as returned by analyze()), merge common sub-structures within it, returning the new top level structure (another Type instance).

property progress

The object passed as the progress parameter on construction.

If this is not None, it must be an object which implements the following methods:

  • reset(*, total: int=None)

  • update(n: int=None)

The “reset” method of the object will be called with either the keyword argument “total”, indicating the new number of steps that have yet to complete, or with no arguments indicating the progress display should be cleared as a task is complete.

The “update” method of the object will be called with either the number of steps to increment by (as the positional “n” argument), or with no arguments indicating that the display should simply be refreshed (e.g. to recalculate the time remaining, or update a time elapsed display).

It is no coincidence that this is a sub-set of the public API of the tqdm progress bar project (as that’s what structa uses in its CLI implementation).