minorg.annotation module

GFF3 class

class minorg.annotation.Annotation(entry, gff, **kwargs)[source]

Bases: object

Representation of GFF3 annotation/feature

__init__(entry, gff, **kwargs)[source]

Create an Annotation object.

generate_attr(original=True, fields=None)[source]
generate_bed(standardise=False)[source]
generate_gff(standardise=False)[source]
generate_str(fmt='GFF')[source]
get(*fields)[source]
get_attr(a, **kwargs)[source]
has_attr(a, vals, **kwargs)[source]
is_attr(a, val, **kwargs)[source]
property plus[source]
class minorg.annotation.Attributes(val, gff, entry, field_sep_inter=';', field_sep_intra=',', **for_dummy_gff)[source]

Bases: object

Reprsentation of GFF3 feature attributes

__init__(val, gff, entry, field_sep_inter=';', field_sep_intra=',', **for_dummy_gff)[source]
get(a, fmt=<class 'list'>)[source]
has_attr(a, vals)[source]

Checks if at least 1 value in ‘vals’ is also a value of attribute ‘a’

is_attr(a, val)[source]

Checks if ‘val’ is a value of attribute ‘a’

map_field_name(a)[source]
standardise_fields(return_str=True)[source]
class minorg.annotation.GFF(fname=None, data=None, string=None, attr_mod=None, genetic_code=1, fmt=None, quiet=False, memsave=False, chunk_lines=1000, **kwargs)[source]

Bases: object

Representation of GFF3 file.

Large files can be indexed instead of read to memory by using memsave=True.

>>> my_gff = GFF('/path/to/large_gff.gff', memsave = True)
_fname[source]

path to GFF3 file

Type

str

_fmt[source]

GFF3 file format

Type

str

_data[source]

stores annotation data as list of minorg.annotation.Annotation objects. Not used if fname is not None but memsave=True.

Type

list

_string[source]

stores raw string data if string!=None

Type

str

_attr_mod[source]

attribute modification mapping for non-standard attribute field names

Type

dict

_attr_fields[source]

full attribute field name mapping

Type

dict

_quiet[source]

print only essential messages

Type

bool

_kwargs[source]

stores additional arguments when parsing GFF3 entries to minorg.annotation.Annotation objects

_seqids[source]

stores order of seqids/molecules/chromosomes

Type

dict

_chunk_lines[source]

number of lines between each stored position (used for indexing when memsave=True)

Type

int

_indexed_file[source]

indexed GFF file

Type

minorg.index.IndexedFile

_memsave[source]

index file instead of reading data to memory

Type

bool

__iter__()[source]

Read from data stored in memory if self._data is not empty, else read directly from file.

Yields

minorg.annotation.Annotation

__init__(fname=None, data=None, string=None, attr_mod=None, genetic_code=1, fmt=None, quiet=False, memsave=False, chunk_lines=1000, **kwargs)[source]

Create a GFF object.

Parameters
  • fname (str) – optional, path to GFF3 file or BED file generated using gff2bed

  • data (list) – optional, list of minorg.annotation.Annotation objects

  • string (str) – optional, string contents in the format of a GFF3 file or BED file generated using gff2bed

  • attr_mod (dict) – optional, dictionary of mapping for non-standard attribute field names

  • genetic_code (int or str) – NCBI genetic code name or number

  • fmt (str) – optional, valid values: BED, GFF, GFF3. If not provided and fname != None, will be inferred from fname extension.

  • quiet (bool) – print only essential messages

  • memsave (bool) – index file instead of reading data to memory

  • chunk_lines (int) – number of lines between each stored line index (default=1000)

  • **kwargs – additional arguments when parsing GFF3 entries to minorg.annotation.Annotation objects

add_entry(gff_entry, duplicate_check=False) None[source]

Add minorg.annotation.Annotation object to self’s data at self._data.

Parameters
  • gff_entry (Annotation) – required, Annotation object to add

  • duplicate_check (bool) – check for duplicates and only add gff_entry if not already in data

empty_copy(other=None) Optional[minorg.annotation.GFF][source]

Shallow copy self’s attributes (BUT NOT DATA) to another minorg.annotation.GFF object.

If other=None, create new minorg.annotation.GFF object, copy attributes to it, and return the new object.

Parameters

other (minorg.annotation.GFF) – optional. If not provided, creates a new minorg.annotation.GFF object with copied attributes.

Returns

If other=None

Return type

minorg.annotation.GFF

get_features(*feature_types, index=False)[source]

Get entries of specific feature types.

Parameters
  • *feature_types (str) – GFF3 feature types to retrieve

  • index (bool) – return line number (index) instead of Annotation objects

Returns

get_features_and_subfeatures(*feature_ids, index=False, full=True, preserve_order=True)[source]

Gets features w/ feature_ids AND subfeatures of those features. If full = True, executes get_subfeatures_full for subfeature discovery, else get_subfeatures

Parameters
  • feature_ids (str) – parent feature IDs

  • index (bool) – return line number of feature instead of minorg.annotation.Annotation objects

  • full (bool) – return feature(s) as well as its/their subfeatures

  • preserve_order (bool) – sort output by line number (i.e. preserve original order)

Returns

get_i(*indices, output_list=False, sort=True) Optional[Union[list, minorg.annotation.Annotation]][source]

Get Annotation of GFF entry/entries by line index.

Parameters
  • indices (int) – line number(s) (indices) of entries to retrieve

  • output_list (bool) – return list even if ony one line number is provided

  • sort (bool) – sort output by line number

Returns

  • list – Of Annotation if output_list=True or multiple indices were requested

  • Annotation – If output_list=False and only one index was requested

  • None – If output_list=False and the specified line does not exist

get_i_raw(*indices, strip_newline=True, output_list=True, sort=True) Optional[Union[list, str]][source]

Get raw string of entry/entries by line index.

Parameters
  • indices (int) – line numbers (indices) of entries to retrieve

  • strip_newline (bool) – remove newline from returned lines

  • output_list (bool) – return list even if ony one line number is provided

  • sort (bool) – sort output by line number

Returns

  • list – Of str of entries if output_list=True or multiple lines were requested

  • str – Of entry if output_list=False and only one line was requested

  • None – If output_list=False and the specified line does not exist

get_id(*feature_ids, index=False, output_list=False, preserve_order=True)[source]

Get Annotation of GFF entry/entries by feature ID.

Parameters
  • *feature_ids (str) – feature ID(s)

  • index (bool) – return line number(s) of feature(s) instead of minorg.annotation.Annotation object(s)

  • output_list (bool) – return list even if ony one feature ID is provided

  • preserve_order (bool) – sort output by line number (preserve original order)

Returns

  • list – If output_list=True or more than one feature ID was provided. List of minorg.annotation.Annotation objects if index=False. List of int if index=True.

  • Annotation – If output_list=False and only one feature ID was provided and index=False

  • int – If output_list=False and only one feature ID was provided and index=True

  • None – If output_list=False and no feature with the specified feature ID can be found

get_subfeatures(*feature_ids, feature_types=[], index=False)[source]

Get all features that are subfeatures of user-provided feature_ids.

Parameters
  • *feature_ids (list) – list of feature IDs

  • feature_types (str) – feature type(s) to retain

  • index (bool) – return line number of feature instead of minorg.annotation.Annotation objects

Returns

get_subfeatures_full(*feature_ids, feature_types=[], index=False, preserve_order=True)[source]

Get all features that are subfeatures of user-provided feature_ids AND subfeatures of those subfeatures, until there are no sub-sub…sub-features left.

Parameters
  • *feature_ids (str) – parent feature IDs

  • feature_types (list of str) – GFF3 feature type(s) to retrieve

  • index (bool) – return line number of feature instead of minorg.annotation.Annotation objects

  • preserve_order (bool) – sort output by line number (preserve original order)

Returns

has_file() bool[source]

Whether self is associated with a file.

index(chunk_lines=None) None[source]

Index file.

Parameters

chunk_lines (int) – optional, number of lines between each indexed line. If not provided, defaults to self._chunk_lines.

index_seqids() None[source]

Store order of seqid.

invert_attr_fields() dict[source]

Generate {<feature>: {<NONSTANDARD attribute field name>: <STANDARD attribute field name>}} mapping from self._attr_fields (which is in format {<feature>: {STANDARD attribute field name>: <NONSTANDARD attribute field name>}}

Returns

Of reversed attribute field name mapping

Return type

dict

is_bed() bool[source]

Whether input file format is BED.

is_gff() bool[source]

Whether input file format is GFF.

is_indexed() bool[source]

Whether file has been indexed.

iter_fasta_raw() Generator[str, None, None][source]

Yields FASTA entries (excludes ‘##FASTA’ header)

iter_raw(include_fasta=False) Generator[str, None, None][source]

Yields raw string read from file.

make_annotation_from_str_gen(strip_newline=True) Callable[str, minorg.annotation.Annotation][source]

Create function to parse raw string entries into minorg.annotation.Annotation objects based on inferred data format (GFF3 or BED generated by gff2bed)

Parameters

strip_newline (bool) – whether to strip newline if it present in string entry when read from file

Returns

Return type

func

parse() None[source]

Read file stored at self._fname to memory.

parse_fasta() None[source]

Read the FASTA section of the file stored at self._fname, and write sequence entries to temporary file at self._fasta. If self._fasta already exists, it will be deleted and replaced.

read_file(fname, fasta=False) None[source]

Read file to memory.

Parameters
  • fname (str) – path to GFF3 file

  • fasta (bool) – parse FASTA section, if it exists, into IndexedFasta object stored at self.fasta

remove_tmp() None[source]
sort() None[source]

Sort data stored in memory. Does NOT sort indexed files.

Sort key: seqid, start, -end, source, feature type, str of attributes with standardised field names

subset(feature_ids=None, feature_types=None, subfeatures=True, preserve_order=True)[source]

Subset data by feature ID (feature_ids) and/or feature type (feature_types) and generate new GFF object from them.

Arguments:

update_attr_fields(attr_mod=None) None[source]

Update attribute fields w/ user-provided attribute field modification dictionary. (Otherwise, use self._attr_fields)

Parameters

attr_mod (dict) – optional, dictionary of attribute modifications

write(fout=None, entries=None, seqs={}, include_fasta=False, fasta_line_length=60, **kwargs)[source]

Writes entries to file

write_i(fout, indices, **kwargs)[source]

Executes get_i, then writes output to file

write_id(fout, feature_ids, **kwargs)[source]

Executes get_id, then writes output to file

minorg.annotation.get_recursively(d, default, *keys)[source]