Command Line Reference

Synopsis

structa [-h] [--version] [-f {auto,csv,json,yaml}] [-e ENCODING]
        [--encoding-strict] [--no-encoding-strict]
        [-F INT] [-M NUM] [-B NUM] [-E NUM] [--str-limit NUM]
        [--hide-count] [--show-count] [--hide-lengths] [--show-lengths]
        [--hide-pattern] [--show-pattern]
        [--hide-range] [--show-range {hidden,limits,median,quartiles,graph}]
        [--hide-samples] [--show-samples]
        [--min-timestamp WHEN] [--max-timestamp WHEN]
        [--max-numeric-len LEN] [--sample-bytes SIZE]
        [--strip-whitespace] [--no-strip-whitespace]
        [--csv-format FIELD[QUOTE]] [--yaml-safe] [--no-yaml-safe]
        [file [file ...]]

Positional Arguments

file

The data-file(s) to analyze; if this is - or unspecified then stdin will be read for the data; if multiple files are specified all will be read and analyzed as an array of similar structures

Optional Arguments

-h, --help

show this help message and exit

--version

show program’s version number and exit

-f {auto,csv,json,yaml}, --format {auto,csv,json,yaml}

The format of the data file; if this is unspecified, it will be guessed based on the first bytes of the file; valid choices are auto (the default), csv, or json

-e ENCODING, --encoding ENCODING

The string encoding of the file, e.g. utf-8 (default: auto). If “auto” then the file will be sampled to determine the encoding (see --sample-bytes)

--encoding-strict, --no-encoding-strict

Controls whether character encoding is strictly enforced and will result in an error if invalid characters are found during analysis. If disabled, a replacement character will be inserted for invalid sequences. The default is strict decoding

-F INT, --field-threshold INT

If the number of distinct keys in a map, or columns in a tuple is less than this then they will be considered distinct fields instead of being lumped under a generic type like str (default: 20)

-M NUM, --merge-threshold NUM

The proportion of mapping fields which must match other mappings for them to be considered potential merge candidates (default: 50%)

-B NUM, --bad-threshold NUM

The proportion of string values which are allowed to mismatch a pattern without preventing the pattern from being reported; the proportion of “bad” data permitted in a field (default: 1%)

-E NUM, --empty-threshold NUM

The proportion of string values permitted to be empty without preventing the pattern from being reported; the proportion of “empty” data permitted in a field (default: 99%)

--str-limit NUM

The length beyond which only the lengths of strs will be reported; below this the actual value of the string will be displayed (default: 20)

--hide-count, --show-count

If set, show the count of items in containers, the count of unique scalar values, and the count of all sample values (if --show-samples is set). If disabled, counts will be hidden

--hide-lengths, --show-lengths

If set, display the range of lengths of string fields in the same format as specified by --show-range

--hide-pattern, --show-pattern

If set, show the pattern determined for fixed length string fields. If disabled, pattern information will be hidden

--hide-range, --show-range {hidden,limits,median,quartiles,graph}

Show the range of numeric (and temporal) fields in a variety of forms. The default is ‘limits’ which simply displays the minimum and maximum; ‘median’ includes the median between these; ‘quartiles’ shows all three quartiles between the minimum and maximum; ‘graph’ displays a crude chart showing the positions of the quartiles relative to the limits. Use --hide-range to hide all range info

--hide-samples, --show-samples

If set, show samples of non-unique scalar values including the most and least common values. If disabled, samples will be hidden

--min-timestamp WHEN

The minimum timestamp to use when guessing whether floating point fields represent UNIX timestamps (default: 20 years). Can be specified as an absolute timestamp (in ISO-8601 format) or a duration to be subtracted from the current timestamp

--max-timestamp WHEN

The maximum timestamp to use when guessing whether floating point fields represent UNIX timestamps (default: 10 years). Can be specified as an absolute timestamp (in ISO-8601 format) or a duration to be added to the current timestamp

--max-numeric-len LEN

The maximum number of characters that a number, integer or floating-point, may use in its representation within the file. Defaults to 30

--sample-bytes SIZE

The number of bytes to sample from the file for the purposes of encoding and format detection. Defaults to 1m. Typical suffixes of k, m, g, etc. may be specified

--strip-whitespace, --no-strip-whitespace

Controls whether leading and trailing found in strings in the will be left alone and thus included or excluded in any data-type analysis. The default is to strip whitespace

--csv-format FIELD[QUOTE]

The characters used to delimit fields and strings in a CSV file. Can be specified as a single character which will be used as the field delimiter, or two characters in which case the second will be used as the string quotation character. Can also be “auto” which indicates the delimiters should be detected. Bear in mind that some characters may require quoting for the shell, e.g. ‘;”’

--yaml-safe, --no-yaml-safe

Controls whether the “safe” or “unsafe” YAML loader is used to parse YAML files. The default is the “safe” parser. Only use --no-yaml-safe if you trust the source of your data