structa.types

The structa.types module defines the class hierarchy used to represent the structural types of analyzed data. The root of the hierarchy is the Type class. The rest of the hierarchy is illustrated in the chart below:

_images/types.svg
class structa.types.Type[source]

The abstract base class of all types recognized by structa.

This class ensures that instances are hashable (can be used as keys in dictionaries), have a reasonable repr() value for ease of use at the REPL, can be passed to the xml() function.

However, the most important thing implemented by this base class is the equality test which can be used to test whether a given type is “compatible” with another type. The base test implemented at this level is that one type is compatible with another if one is a sub-class of the other.

Hence, Str is compatible with Str as they are the same class (and hence one is, redundantly, a sub-class of the other). And Int is compatible with Float as it is a sub-class of the latter. However Int is not compatbile with Str as both descend from Scalar and are siblings rather than parent-child.

class structa.types.Container(sample, content=None)[source]

Abstract base of all types that can contain other types. Constructed with a sample of values, and an optional definition of content.

This is the base class of List, Tuple, and Dict. Note that it is not the base class of Str as, although that is a compound type, it cannot contain other types; structa treats Str as a scalar type.

Container extends Type by permitting instances to be added to (compatible, by equality) instances, combining their content appropriately.

content: list[Type]

A list of Type descendents representing the content of this instance.

lengths: Stats

The Stats of the lengths of the sample values.

sample: [list] | [tuple] | [dict]

The sample of values that this instance represents.

with_content(content)[source]

Return a new copy of this container with the content replaced with content.

class structa.types.Dict(sample, content=None, *, similarity_threshold=0.5)[source]

Represents mappings (or dictionaries).

This concrete refinement of Container uses DictField instances in its content list.

In the case that a mapping is analyzed as a “record” mapping (of fields to values), the content list will contain one or more DictField instances, for which the key attribute(s) will be Field instances.

However, if the mapping is analyzed as a “table” mapping (of keys to records), the content list will contain a single DictField instance mapping the key’s type to the value structure.

validate(value)[source]

Validate that value (which must be a dict) matches the analyzed mapping structure.

Raises

TypeError – if value is not a dict

class structa.types.Tuple(sample, content=None)[source]

Represents sequences of heterogeneous types (typically tuples).

This concrete refinement of Container uses TupleField instances in its content list.

Tuples are typically the result of an analysis of some homogeneous outer sequence (usually a List though sometimes a Dict) that contains heterogeneous sequences (the Tuple instance).

validate(value)[source]

Validate that value (which must be a tuple) matches the analyzed mapping structure.

Raises
class structa.types.List(sample, content=None)[source]

Represents sequences of homogeneous types. This only ever has a single Type descendent in its content list.

validate(value)[source]

Validate that value (which must be a list) matches the analyzed mapping structure.

Raises

TypeError – if value is not a list

class structa.types.DictField(key, value=None)[source]

Represents a single mapping within a Dict, from the key to its corresponding value. For example, a Field of a record mapping to some other type, or a generic Str mapping to an Int value.

key: Type

The Type descendent representing a single key in the mapping. This is usually a Scalar descendent, or a Field.

value: Type

The Type descendent representing a value in the mapping.

class structa.types.TupleField(index, value=None)[source]

Represents a single field within a Tuple, with the index (an integer number) and its corresponding value.

index: int

The index of the field within the tuple.

value: Type

The Type descendent representing a value in the tuple.

class structa.types.Scalar(sample)[source]

Abstract base of all types that cannot contain other types. Constructed with a sample of values.

This is the base class of Float (from which Int and then Bool descend), Str, and DateTime.

values: Stats

The Stats of the sample values.

property sample

A sequence of the sample values that the instance was constructed from (this will not be the original sequence, but one derived from that).

class structa.types.Float(sample)[source]

Represents scalar floating-point values in datasets. Constructed with a sample of values.

classmethod from_strings(sample, pattern, bad_threshold=0)[source]

Class method for constructing an instance wrapped in a StrRepr to indicate a string representation of a set of floating-point values. Constructed with an sample of strings, a pattern (which currently must simply be “f”), and a bad_threshold of values which are permitted to fail conversion.

validate(value)[source]

Validate that value (which must be a float) lies within the range of sampled values.

Raises
class structa.types.Int(sample)[source]

Represents scalar integer values in datasets. Constructed with a sample of values.

classmethod from_strings(sample, pattern, bad_threshold=0)[source]

Class method for constructing an instance wrapped in a StrRepr to indicate a string representation of a set of integer values. Constructed with an sample of strings, a pattern (which may be “d”, “o”, or “x” to represent the base used in the string representation), and a bad_threshold of values which are permitted to fail conversion.

validate(value)[source]

Validate that value (which must be an int) lies within the range of sampled values.

Raises
class structa.types.Bool(sample)[source]

Represents scalar boolean values in datasets. Constructed with a sample of values.

classmethod from_strings(iterable, pattern, bad_threshold=0)[source]

Class method for constructing an instance wrapped in a StrRepr to indicate a string representation of a set of booleans. Constructed with an sample of strings, a pattern (which is a string of the form “false|true”, i.e. the expected string representations of the False and True values separated by a bar), and a bad_threshold of values which are permitted to fail conversion.

validate(value)[source]

Validate that value is an int (with the value 0 or 1), or a bool. Raises TypeError or ValueError in the event that value fails to validate.

Raises
class structa.types.DateTime(sample)[source]

Represents scalar timestamps (a date, and a time) in datasets. Constructed with a sample of values.

classmethod from_numbers(pattern)[source]

Class method for constructing an instance wrapped in a NumRepr to indicate a numeric representation of a set of timestamps (e.g. day offset from the UNIX epoch).

Constructed with an sample of number, a pattern (which can be a StrRepr instance if the numbers are themselves represented as strings, otherwise must be the Int or Float instance representing the numbers), and a bad_threshold of values which are permitted to fail conversion.

classmethod from_strings(iterable, pattern, bad_threshold=0)[source]

Class method for constructing an instance wrapped in a StrRepr to indicate a string representation of a set of timestamps.

Constructed with an sample of strings, a pattern (which must be compatible with datetime.datetime.strptime()), and a bad_threshold of values which are permitted to fail conversion.

validate(value)[source]

Validate that value (which must be a datetime) lies within the range of sampled values.

Raises
class structa.types.Str(sample, pattern=None)[source]

Represents string values in datasets. Constructed with a sample of values, and an optional pattern (a sequence of CharClass instances indicating which characters are valid at which position in fixed-length strings).

lengths: Stats

The Stats of the lengths of the sample values.

pattern: [structa.chars.CharClass]

None if the string is variable length or has no discernable pattern to its values. Otherwise a sequence of CharClass instances indicating the valid characters at each position of the string.

validate(value)[source]

Validate that value (which must be a str) lies within the range of sampled values and, if pattern is not None, that it matches the pattern stored there.

Raises
class structa.types.Repr(content, pattern=None)[source]

Abstract base class for representations (string, numeric) of other types. Parent of StrRepr and NumRepr.

content: Type

The Type that this instance is a representation of. For example, a string representation of integer numbers would be represented by a StrRepr instance with content being a Int instance.

pattern: str | Type | None

Particulars of the representation. For example, in the case of string representations of integers, this is a string indicating the base (“o”, “d”, “x”). In the case of a numeric representation of a datetime, this is the Type (Int or Float) of the values.

class structa.types.StrRepr(content, pattern=None)[source]

A string representation of an inner type. Typically used to wrap Int, Float, Bool, or DateTime. Descends from Repr.

class structa.types.NumRepr(content, pattern=None)[source]

A numeric representation of an inner type. Typically used to wrap DateTime. Descends from Repr.

class structa.types.URL(sample, pattern=None)[source]

A specialization of Str for representing URLs. Currently does little more than trivial validation of the scheme.

validate(value)[source]

Validate that value starts with “http://” or “https://

Raises

ValueError – if value does not start with a valid scheme

class structa.types.Field(value, count, optional=False)[source]

Represents a single key in a DictField mapping. This is used by the analyzer when it decides a mapping represents a “record” (a mapping of fields to values) rather than a “table” (a mapping of keys to records).

Constructed with the value of the key, the count of mappings that the key appears in, and a flag indicating if the key is optional (defaults to False for mandatory).

value: str | int | float | tuple | ...

The value of the key.

count: int

The number of mappings that the key belongs to.

optional: bool

If True, the key may be ommitted from certain mappings in the data. If False (the default), the key always appears in the owning mapping.

validate(value)[source]

Validates that value matches the expected key value.

Raises

ValueError – if value does not match the expected value

class structa.types.Value(sample)[source]

A descendent of Type that represents any arbitrary type at all. This is used when the analyzer comes across a container of a multitude of (incompatible) types, e.g. a list of both strings and integers.

It compares equal to all other types, and when added to other types, the result is a new Value instance.

validate(value)[source]

Trivial validation; always passes, never raises an exception.

class structa.types.Empty[source]

A descendent of Type that represents a container with no content. For example, if the analyzer comes across a field which always contains an empty list, it would be represented as a List instance where List.content was a sequence containing an Empty instance.

It compares equal to all other types, and when added to other types, the result is the other type. This allows the merge phase to combine empty lists with a list of integers found at the same level, for example.

validate(value)[source]

Trivial validation; always passes.

Note

This counter-intuitive behaviour is because the Empty value indicates a lack of type-information rather than a definitely empty container (after all, there’s usually little sense in having a container field which will always be empty in most hierarchical structures).

The way this differs from Value is in the additive action.

class structa.types.Stats(sample, card, min, q1, q2, q3, max)[source]

Stores cardinality, minimum, maximum, and (high) median of a sample of numeric values (or lengths of strings or containers), along with the specified sample of values.

Typically instances of this class are constructed via the from_sample() or from_lengths() class methods rather than directly. However, instances can also be added to other instances to generate statistics for the combined sample set. Instances may also be compared for equality.

card: int

The number of items in the sample that the statistics were calculated from.

q1: int | float | str | datetime.datetime | ...

The first (lower) quartile of the sample.

q2: int | float | str | datetime.datetime | ...

The second quartile (aka the median) of the sample.

q3: int | float | str | datetime.datetime | ...

The third (upper) quartile of the sample.

max: int | float | str | datetime.datetime | ...

The largest value in the sample.

min: int | float | str | datetime.datetime | ...

The smallest value in the sample.

sample: structa.collections.FrozenCounter

The sample data that the statistics were calculated from. This is always an instance of FrozenCounter.

classmethod from_lengths(sample)[source]

Given an iterable of sample values, which must be of a homogeneous compound type (e.g. str, tuple), construct an instance after calculating the len() of each item of the sample, and then the minimum, maximum, and quartile values of the lengths.

classmethod from_sample(sample)[source]

Given an iterable of sample values, which must be of a homogeneous comparable type (e.g. int, str, float), construct an instance after calculating the minimum, maximum, and quartile values of the sample.

property median

An alias for the second quartile, q2.