minorg.annotation module

GFF3 class

class minorg.annotation.Annotation(entry, gff, **kwargs)[source]

Bases: object

Representation of GFF3 annotation/feature

__init__(entry, gff, **kwargs)[source]: Create an Annotation object.

generate_attr(original=True, fields=None)[source]

generate_bed(standardise=False)[source]

generate_gff(standardise=False)[source]

generate_str(fmt='GFF')[source]

get(*fields)[source]

get_attr(a, **kwargs)[source]

has_attr(a, vals, **kwargs)[source]

is_attr(a, val, **kwargs)[source]

property plus[source]

class minorg.annotation.Attributes(val, gff, entry, field_sep_inter=';', field_sep_intra=',', **for_dummy_gff)[source]

Bases: object

Reprsentation of GFF3 feature attributes

__init__(val, gff, entry, field_sep_inter=';', field_sep_intra=',', **for_dummy_gff)[source]

get(a, fmt=<class 'list'>)[source]

has_attr(a, vals)[source]: Checks if at least 1 value in ‘vals’ is also a value of attribute ‘a’

is_attr(a, val)[source]: Checks if ‘val’ is a value of attribute ‘a’

map_field_name(a)[source]

standardise_fields(return_str=True)[source]

class minorg.annotation.GFF(fname=None, data=None, string=None, attr_mod=None, genetic_code=1, fmt=None, quiet=False, memsave=False, chunk_lines=1000, **kwargs)[source]

Bases: object

Representation of GFF3 file.

Large files can be indexed instead of read to memory by using memsave=True.

>>> my_gff = GFF('/path/to/large_gff.gff', memsave = True)

_fname[source]

path to GFF3 file

Type: str

_fmt[source]

GFF3 file format

Type: str

_data[source]

stores annotation data as list of minorg.annotation.Annotation objects. Not used if fname is not None but memsave=True.

Type: list

_string[source]

stores raw string data if string!=None

Type: str

_attr_mod[source]

attribute modification mapping for non-standard attribute field names

Type: dict

_attr_fields[source]

full attribute field name mapping

Type: dict

_quiet[source]

print only essential messages

Type: bool

_kwargs[source]: stores additional arguments when parsing GFF3 entries to minorg.annotation.Annotation objects

_seqids[source]

stores order of seqids/molecules/chromosomes

Type: dict

_chunk_lines[source]

number of lines between each stored position (used for indexing when memsave=True)

Type: int

_indexed_file[source]

indexed GFF file

Type: minorg.index.IndexedFile

_memsave[source]

index file instead of reading data to memory

Type: bool

__iter__()[source]

Read from data stored in memory if self._data is not empty, else read directly from file.

Yields: minorg.annotation.Annotation

__init__(fname=None, data=None, string=None, attr_mod=None, genetic_code=1, fmt=None, quiet=False, memsave=False, chunk_lines=1000, **kwargs)[source]

Create a GFF object.

Parameters

fname (str) – optional, path to GFF3 file or BED file generated using gff2bed
data (list) – optional, list of minorg.annotation.Annotation objects
string (str) – optional, string contents in the format of a GFF3 file or BED file generated using gff2bed
attr_mod (dict) – optional, dictionary of mapping for non-standard attribute field names
genetic_code (int or str) – NCBI genetic code name or number
fmt (str) – optional, valid values: BED, GFF, GFF3. If not provided and fname != None, will be inferred from fname extension.
quiet (bool) – print only essential messages
memsave (bool) – index file instead of reading data to memory
chunk_lines (int) – number of lines between each stored line index (default=1000)
**kwargs – additional arguments when parsing GFF3 entries to minorg.annotation.Annotation objects

add_entry(gff_entry, duplicate_check=False) → None[source]

Add minorg.annotation.Annotation object to self’s data at self._data.

Parameters

gff_entry (Annotation) – required, Annotation object to add
duplicate_check (bool) – check for duplicates and only add gff_entry if not already in data

empty_copy(other=None) → Optional[minorg.annotation.GFF][source]

Shallow copy self’s attributes (BUT NOT DATA) to another minorg.annotation.GFF object.

If other=None, create new minorg.annotation.GFF object, copy attributes to it, and return the new object.

Parameters: other (minorg.annotation.GFF) – optional. If not provided, creates a new minorg.annotation.GFF object with copied attributes.
Returns: If other=None
Return type: minorg.annotation.GFF

get_features(*feature_types, index=False)[source]

Get entries of specific feature types.

Parameters

*feature_types (str) – GFF3 feature types to retrieve
index (bool) – return line number (index) instead of Annotation objects

Returns

list – Of minorg.annotation.Annotation objects if index=False
list – Of int line number of entries if index=True

get_features_and_subfeatures(*feature_ids, index=False, full=True, preserve_order=True)[source]

Gets features w/ feature_ids AND subfeatures of those features. If full = True, executes get_subfeatures_full for subfeature discovery, else get_subfeatures

Parameters

feature_ids (str) – parent feature IDs
index (bool) – return line number of feature instead of minorg.annotation.Annotation objects
full (bool) – return feature(s) as well as its/their subfeatures
preserve_order (bool) – sort output by line number (i.e. preserve original order)

Returns

list – Of minorg.annotation.Annotation objects if index=False
list – Of int line number of entries if index=True

get_i(*indices, output_list=False, sort=True) → Optional[Union[list, minorg.annotation.Annotation]][source]

Get Annotation of GFF entry/entries by line index.

Parameters

indices (int) – line number(s) (indices) of entries to retrieve
output_list (bool) – return list even if ony one line number is provided
sort (bool) – sort output by line number

Returns

list – Of Annotation if output_list=True or multiple indices were requested
Annotation – If output_list=False and only one index was requested
None – If output_list=False and the specified line does not exist

get_i_raw(*indices, strip_newline=True, output_list=True, sort=True) → Optional[Union[list, str]][source]

Get raw string of entry/entries by line index.

Parameters

indices (int) – line numbers (indices) of entries to retrieve
strip_newline (bool) – remove newline from returned lines
output_list (bool) – return list even if ony one line number is provided
sort (bool) – sort output by line number

Returns

list – Of str of entries if output_list=True or multiple lines were requested
str – Of entry if output_list=False and only one line was requested
None – If output_list=False and the specified line does not exist

get_id(*feature_ids, index=False, output_list=False, preserve_order=True)[source]

Get Annotation of GFF entry/entries by feature ID.

Parameters

*feature_ids (str) – feature ID(s)
index (bool) – return line number(s) of feature(s) instead of minorg.annotation.Annotation object(s)
output_list (bool) – return list even if ony one feature ID is provided
preserve_order (bool) – sort output by line number (preserve original order)

Returns

list – If output_list=True or more than one feature ID was provided. List of minorg.annotation.Annotation objects if index=False. List of int if index=True.
Annotation – If output_list=False and only one feature ID was provided and index=False
int – If output_list=False and only one feature ID was provided and index=True
None – If output_list=False and no feature with the specified feature ID can be found

get_subfeatures(*feature_ids, feature_types=[], index=False)[source]

Get all features that are subfeatures of user-provided feature_ids.

Parameters

*feature_ids (list) – list of feature IDs
feature_types (str) – feature type(s) to retain
index (bool) – return line number of feature instead of minorg.annotation.Annotation objects

Returns

list – Of minorg.annotation.Annotation objects if index=False
list – Of int line number of entries if index=True

get_subfeatures_full(*feature_ids, feature_types=[], index=False, preserve_order=True)[source]

Get all features that are subfeatures of user-provided feature_ids AND subfeatures of those subfeatures, until there are no sub-sub…sub-features left.

Parameters

*feature_ids (str) – parent feature IDs
feature_types (list of str) – GFF3 feature type(s) to retrieve
index (bool) – return line number of feature instead of minorg.annotation.Annotation objects
preserve_order (bool) – sort output by line number (preserve original order)

Returns

list – Of minorg.annotation.Annotation objects if index=False
list – Of int line number of entries if index=True

has_file() → bool[source]: Whether self is associated with a file.

index(chunk_lines=None) → None[source]

Index file.

Parameters: chunk_lines (int) – optional, number of lines between each indexed line. If not provided, defaults to self._chunk_lines.

index_seqids() → None[source]: Store order of seqid.

invert_attr_fields() → dict[source]

Generate {<feature>: {<NONSTANDARD attribute field name>: <STANDARD attribute field name>}} mapping from self._attr_fields (which is in format {<feature>: {STANDARD attribute field name>: <NONSTANDARD attribute field name>}}

Returns: Of reversed attribute field name mapping
Return type: dict

is_bed() → bool[source]: Whether input file format is BED.

is_gff() → bool[source]: Whether input file format is GFF.

is_indexed() → bool[source]: Whether file has been indexed.

iter_fasta_raw() → Generator[str, None, None][source]: Yields FASTA entries (excludes ‘##FASTA’ header)

iter_raw(include_fasta=False) → Generator[str, None, None][source]: Yields raw string read from file.

make_annotation_from_str_gen(strip_newline=True) → Callable[str, minorg.annotation.Annotation][source]

Create function to parse raw string entries into minorg.annotation.Annotation objects based on inferred data format (GFF3 or BED generated by gff2bed)

Parameters: strip_newline (bool) – whether to strip newline if it present in string entry when read from file
Returns
Return type: func

parse() → None[source]: Read file stored at self._fname to memory.

parse_fasta() → None[source]: Read the FASTA section of the file stored at self._fname, and write sequence entries to temporary file at self._fasta. If self._fasta already exists, it will be deleted and replaced.

read_file(fname, fasta=False) → None[source]

Read file to memory.

Parameters

fname (str) – path to GFF3 file
fasta (bool) – parse FASTA section, if it exists, into IndexedFasta object stored at self.fasta

remove_tmp() → None[source]

sort() → None[source]

Sort data stored in memory. Does NOT sort indexed files.

Sort key: seqid, start, -end, source, feature type, str of attributes with standardised field names

subset(feature_ids=None, feature_types=None, subfeatures=True, preserve_order=True)[source]

Subset data by feature ID (feature_ids) and/or feature type (feature_types) and generate new GFF object from them.

Arguments:

update_attr_fields(attr_mod=None) → None[source]

Update attribute fields w/ user-provided attribute field modification dictionary. (Otherwise, use self._attr_fields)

Parameters: attr_mod (dict) – optional, dictionary of attribute modifications

write(fout=None, entries=None, seqs={}, include_fasta=False, fasta_line_length=60, **kwargs)[source]: Writes entries to file

write_i(fout, indices, **kwargs)[source]: Executes get_i, then writes output to file

write_id(fout, feature_ids, **kwargs)[source]: Executes get_id, then writes output to file

minorg.annotation.get_recursively(d, default, *keys)[source]