structa.source

class structa.source.Source(source, *, encoding='auto', encoding_strict=True, format='auto', csv_delimiter='auto', csv_quotechar='auto', yaml_safe=True, json_strict=True, sample_limit=1048576)[source]

A generalized data source capable of automatically recognizing certain popular data formats, and guessing character encodings. Constructed with a mandatory file-like object as the source, and a multitude of keyword-only options, the decoded content can be access from data

The source must have a read() method which, given a number of bytes to return, returns a bytes string up to that length, but has no requirements beyond this. Note that this means files over sockets or pipes are acceptable inputs.

Parameters
  • source (file) – The file-like object to decode (must have a read method).

  • encoding (str) – The character encoding used in the source, or “auto” (the default) if it should be guessed from a sample of the data.

  • encoding_strict (bool) – If True (the default), raise an exception if character decoding errors occur. Otherwise, replace invalid characters silently.

  • format (str) – If “auto” (the default), guess the format of the data source. Otherwise can be explicitly set to “csv”, “yaml”, or “json” to force parsing of that format.

  • csv_delimiter (str) – If “auto” (the default), attempt to guess the field delimiter when the “csv” format is being decoded using the csv.Sniffer class. Comma, semi-colon, space, and tab characters will be attempted. Otherwise must be set to the single character str used as the field delimiter (e.g. “,”).

  • csv_quotechar (str) – If “auto” (the default), attempt to guess the string delimiter when the “csv” format is being decoded using the csv.Sniffer class. Otherwise must be set to the single character str used as the string delimiter (e.g. ‘”’).

  • yaml_safe (bool) – If True (the default) the “safe” YAML parser from ruamel.yaml will be used.

  • json_strict (bool) – If True (the default), control characters will not be permitted inside decoded strings.

  • sample_limit (int) – The number of bytes to sample from the beginning of the stream when attempting to determine character encoding. Defaults to 1MB.

property csv_dialect

The csv.Dialect used when format is “csv”, or None otherwise.

property data

The decoded data. Typically a list or dict of values, but can be any value representable in the source format.

property encoding

The character encoding detected or specified for the source, e.g. “utf-8”.

property format

The data format detected or specified for the source, e.g. “csv”, “yaml”, or “json”.