structa.analyzer
The structa.analyzer
module contains the Analyzer
class which
is the primary entry point for using structa’s as an API. It can be constructed
without any arguments, and the analyze()
method can be
immediately used to determine the structure of some data. The
merge()
method can be used to further refine the returned
structure, and measure()
can be used before-hand if you wish to
use the progress callback to track the progress of long analysis runs.
A typical example of basic usage would be:
from structa.analyzer import Analyzer
data = {
str(i): i
for i in range(1000)
}
an = Analyzer()
structure = an.analyze(data)
print(structure)
The structure returned by analyze()
(and by
merge()
) will be an instance of one of the classes in the
structa.types
module, all of which have sensible str
and
repr()
output.
A more complete example, using Source
to figure out
the source format and encoding:
from structa.analyzer import Analyzer
from structa.source import Source
from urllib.request import urlopen
with urlopen('https://usn.ubuntu.com/usn-db/database-all.json') as f:
src = Source(data)
an = Analyzer()
an.measure(src.data)
structure = an.analyze(src.data)
structure = an.merge(structure)
print(structure)
- class structa.analyzer.Analyzer(*, bad_threshold=Fraction(1, 50), empty_threshold=Fraction(49, 50), field_threshold=20, merge_threshold=Fraction(1, 2), max_numeric_len=30, strip_whitespace=False, min_timestamp=None, max_timestamp=None, progress=None)[source]
This class is the core of structa. The various keyword-arguments to the constructor correspond to the command line options (see Command Line Reference).
The
analyze()
method is the primary method for analysis, which simply accepts the data to be analyzed. Themeasure()
method can be used to perform some pre-processing for the purposes of progress reporting (useful with very large datasets), whilemerge()
can be used for additional post-processing to improve the analysis output.- Parameters
bad_threshold (numbers.Rational) – The proportion of data within a field (across repetitive structures) which is permitted to be invalid without affecting the type match. Primarily useful with string representations. Valid values are between 0 and 1.
empty_threshold (numbers.Rational) – The proportion of strings within a field (across repetitive structures) which can be blank without affecting the type match. Empty strings falling within this threshold will be discounted by the analysis. Valid values are between 0 and 1.
field_threshold (int) – The minimum number of fields in a mapping before it will be treated as a “table” (a mapping of keys to records) rather than a record (a mapping of fields to values). Valid values are any positive integer.
merge_threshold (numbers.Rational) – The proportion of fields within repetitive mappings that must match for the mappings to be considered “mergeable” by the
merge()
method. Note that the proportion is calculated with the length of the shorter mapping in the comparision. Valid values are between 0 and 1.strip_whitespace (bool) – If
True
, whitespace is stripped from all strings prior to any further analysis.min_timestamp (datetime.datetime or None) – The minimum timestamp to use when determining whether floating point values potentially represent epoch-based datetime values.
max_timestamp (datetime.datetime or None) – The maximum timestamp to use when determining whether floating point values potentially represent epoch-based datetime values.
progress (object or None) – If specificed, must be an object with
update
andreset
methods that will be called to provide progress feedback. Seeprogress
for further details.
- analyze(data)[source]
Given some value data (typically an iterable or a mapping), return a
Type
descendent describing its structure.
- measure(data)[source]
Given some value data (typically an iterable or mapping), measure the number of items within it, for the purposes of accurately reporting progress during the running of the
analyze()
andmerge()
methods.If this is not called prior to these methods, they will still run successfully, but progress tracking (via the
progress
object) will be inaccurate as the total number of steps to process will never be calculated.As measurement is itself a potentially lengthy process, progress will be reported as a function of the top-level items within data during the run of this method.
- merge(struct)[source]
Given some struct (as returned by
analyze()
), merge common sub-structures within it, returning the new top level structure (anotherType
instance).
- property progress
The object passed as the progress parameter on construction.
If this is not
None
, it must be an object which implements the following methods:reset(*, total: int=None)
update(n: int=None)
The “reset” method of the object will be called with either the keyword argument “total”, indicating the new number of steps that have yet to complete, or with no arguments indicating the progress display should be cleared as a task is complete.
The “update” method of the object will be called with either the number of steps to increment by (as the positional “n” argument), or with no arguments indicating that the display should simply be refreshed (e.g. to recalculate the time remaining, or update a time elapsed display).
It is no coincidence that this is a sub-set of the public API of the tqdm progress bar project (as that’s what structa uses in its CLI implementation).