structa.types
The structa.types
module defines the class hierarchy used to represent
the structural types of analyzed data. The root of the hierarchy is the
Type
class. The rest of the hierarchy is illustrated in the chart
below:
- class structa.types.Type[source]
The abstract base class of all types recognized by structa.
This class ensures that instances are hashable (can be used as keys in dictionaries), have a reasonable
repr()
value for ease of use at the REPL, can be passed to thexml()
function.However, the most important thing implemented by this base class is the equality test which can be used to test whether a given type is “compatible” with another type. The base test implemented at this level is that one type is compatible with another if one is a sub-class of the other.
Hence,
Str
is compatible withStr
as they are the same class (and hence one is, redundantly, a sub-class of the other). AndInt
is compatible withFloat
as it is a sub-class of the latter. HoweverInt
is not compatbile withStr
as both descend fromScalar
and are siblings rather than parent-child.
- class structa.types.Container(sample, content=None)[source]
Abstract base of all types that can contain other types. Constructed with a sample of values, and an optional definition of content.
This is the base class of
List
,Tuple
, andDict
. Note that it is not the base class ofStr
as, although that is a compound type, it cannot contain other types; structa treatsStr
as a scalar type.Container
extendsType
by permitting instances to be added to (compatible, by equality) instances, combining theircontent
appropriately.
- class structa.types.Dict(sample, content=None, *, similarity_threshold=0.5)[source]
Represents mappings (or dictionaries).
This concrete refinement of
Container
usesDictField
instances in itscontent
list.In the case that a mapping is analyzed as a “record” mapping (of fields to values), the
content
list will contain one or moreDictField
instances, for which thekey
attribute(s) will beField
instances.However, if the mapping is analyzed as a “table” mapping (of keys to records), the
content
list will contain a singleDictField
instance mapping the key’s type to the value structure.
- class structa.types.Tuple(sample, content=None)[source]
Represents sequences of heterogeneous types (typically tuples).
This concrete refinement of
Container
usesTupleField
instances in itscontent
list.Tuples are typically the result of an analysis of some homogeneous outer sequence (usually a
List
though sometimes aDict
) that contains heterogeneous sequences (theTuple
instance).
- class structa.types.List(sample, content=None)[source]
Represents sequences of homogeneous types. This only ever has a single
Type
descendent in itscontent
list.
- class structa.types.DictField(key, value=None)[source]
Represents a single mapping within a
Dict
, from thekey
to its correspondingvalue
. For example, aField
of a record mapping to some other type, or a genericStr
mapping to anInt
value.
- class structa.types.TupleField(index, value=None)[source]
Represents a single field within a
Tuple
, with theindex
(an integer number) and its correspondingvalue
.
- class structa.types.Scalar(sample)[source]
Abstract base of all types that cannot contain other types. Constructed with a sample of values.
This is the base class of
Float
(from whichInt
and thenBool
descend),Str
, andDateTime
.- property sample
A sequence of the sample values that the instance was constructed from (this will not be the original sequence, but one derived from that).
- class structa.types.Float(sample)[source]
Represents scalar floating-point values in datasets. Constructed with a sample of values.
- classmethod from_strings(sample, pattern, bad_threshold=0)[source]
Class method for constructing an instance wrapped in a
StrRepr
to indicate a string representation of a set of floating-point values. Constructed with an sample of strings, a pattern (which currently must simply be “f”), and a bad_threshold of values which are permitted to fail conversion.
- class structa.types.Int(sample)[source]
Represents scalar integer values in datasets. Constructed with a sample of values.
- classmethod from_strings(sample, pattern, bad_threshold=0)[source]
Class method for constructing an instance wrapped in a
StrRepr
to indicate a string representation of a set of integer values. Constructed with an sample of strings, a pattern (which may be “d”, “o”, or “x” to represent the base used in the string representation), and a bad_threshold of values which are permitted to fail conversion.
- class structa.types.Bool(sample)[source]
Represents scalar boolean values in datasets. Constructed with a sample of values.
- classmethod from_strings(iterable, pattern, bad_threshold=0)[source]
Class method for constructing an instance wrapped in a
StrRepr
to indicate a string representation of a set of booleans. Constructed with an sample of strings, a pattern (which is a string of the form “false|true”, i.e. the expected string representations of theFalse
andTrue
values separated by a bar), and a bad_threshold of values which are permitted to fail conversion.
- class structa.types.DateTime(sample)[source]
Represents scalar timestamps (a date, and a time) in datasets. Constructed with a sample of values.
- classmethod from_numbers(pattern)[source]
Class method for constructing an instance wrapped in a
NumRepr
to indicate a numeric representation of a set of timestamps (e.g. day offset from the UNIX epoch).Constructed with an sample of number, a pattern (which can be a
StrRepr
instance if the numbers are themselves represented as strings, otherwise must be theInt
orFloat
instance representing the numbers), and a bad_threshold of values which are permitted to fail conversion.
- classmethod from_strings(iterable, pattern, bad_threshold=0)[source]
Class method for constructing an instance wrapped in a
StrRepr
to indicate a string representation of a set of timestamps.Constructed with an sample of strings, a pattern (which must be compatible with
datetime.datetime.strptime()
), and a bad_threshold of values which are permitted to fail conversion.
- validate(value)[source]
Validate that value (which must be a
datetime
) lies within the range of sampled values.- Raises
TypeError – if value is not a
datetime.datetime
ValueError – if value is outside the range of sampled values
- class structa.types.Str(sample, pattern=None)[source]
Represents string values in datasets. Constructed with a sample of values, and an optional pattern (a sequence of
CharClass
instances indicating which characters are valid at which position in fixed-length strings).- pattern: [structa.chars.CharClass]
None
if the string is variable length or has no discernable pattern to its values. Otherwise a sequence ofCharClass
instances indicating the valid characters at each position of the string.
- class structa.types.Repr(content, pattern=None)[source]
Abstract base class for representations (string, numeric) of other types. Parent of
StrRepr
andNumRepr
.
- class structa.types.StrRepr(content, pattern=None)[source]
A string representation of an inner type. Typically used to wrap
Int
,Float
,Bool
, orDateTime
. Descends fromRepr
.
- class structa.types.NumRepr(content, pattern=None)[source]
A numeric representation of an inner type. Typically used to wrap
DateTime
. Descends fromRepr
.
- class structa.types.URL(sample, pattern=None)[source]
A specialization of
Str
for representing URLs. Currently does little more than trivial validation of the scheme.- validate(value)[source]
Validate that value starts with “http://” or “https://”
- Raises
ValueError – if value does not start with a valid scheme
- class structa.types.Field(value, count, optional=False)[source]
Represents a single key in a
DictField
mapping. This is used by the analyzer when it decides a mapping represents a “record” (a mapping of fields to values) rather than a “table” (a mapping of keys to records).Constructed with the value of the key, the count of mappings that the key appears in, and a flag indicating if the key is optional (defaults to
False
for mandatory).- optional: bool
If
True
, the key may be ommitted from certain mappings in the data. IfFalse
(the default), the key always appears in the owning mapping.
- validate(value)[source]
Validates that value matches the expected key value.
- Raises
ValueError – if value does not match the expected value
- class structa.types.Value(sample)[source]
A descendent of
Type
that represents any arbitrary type at all. This is used when the analyzer comes across a container of a multitude of (incompatible) types, e.g. a list of both strings and integers.It compares equal to all other types, and when added to other types, the result is a new
Value
instance.
- class structa.types.Empty[source]
A descendent of
Type
that represents a container with no content. For example, if the analyzer comes across a field which always contains an empty list, it would be represented as aList
instance whereList.content
was a sequence containing anEmpty
instance.It compares equal to all other types, and when added to other types, the result is the other type. This allows the merge phase to combine empty lists with a list of integers found at the same level, for example.
- validate(value)[source]
Trivial validation; always passes.
Note
This counter-intuitive behaviour is because the
Empty
value indicates a lack of type-information rather than a definitely empty container (after all, there’s usually little sense in having a container field which will always be empty in most hierarchical structures).The way this differs from
Value
is in the additive action.
- class structa.types.Stats(sample, card, min, q1, q2, q3, max)[source]
Stores cardinality, minimum, maximum, and (high) median of a sample of numeric values (or lengths of strings or containers), along with the specified sample of values.
Typically instances of this class are constructed via the
from_sample()
orfrom_lengths()
class methods rather than directly. However, instances can also be added to other instances to generate statistics for the combined sample set. Instances may also be compared for equality.- q2: int | float | str | datetime.datetime | ...
- sample: structa.collections.FrozenCounter
The sample data that the statistics were calculated from. This is always an instance of
FrozenCounter
.
- classmethod from_lengths(sample)[source]
Given an iterable of sample values, which must be of a homogeneous compound type (e.g.
str
,tuple
), construct an instance after calculating thelen()
of each item of the sample, and then the minimum, maximum, and quartile values of the lengths.