minorg.reference module

Classes that integrate FASTA and GFF3 files for retrieval of feature sequences

class minorg.reference.AnnotatedFasta(fasta, gff, name='reference', attr_mod=None, genetic_code=1, memsave=True)[source]

Bases: minorg.index.IndexedFasta

Representation of FASTA file with GFF3 annotations

name[source]

name/alias of reference genome

Type

str

fasta[source]

path to genome FASTA file

Type

str

_gff[source]

path to GFF3 file

Type

str

gff[source]

path to GFF3 file. If subset_annotation() has been called, this points to the file of subsetted annotations.

Type

str

genetic_code[source]

NCBI genetic code

Type

str or int

attr_mod[source]

attribute modification

Type

dict

memsave[source]

whether to index GFF3 file instead of reading to memory

Type

bool

assembly[source]

self, stores parsed sequence data. As this class inherits from IndexedFasta, self.assembly provides an analogous way of retrieving sequence data as self.annotation retrieves annotation data.

Type

AnnotatedFasta

annotation[source]

GFF object, stores parsed annotation data. If subset_annotation() has been called, this stores data of subsetted annotations.

Type

GFF

_annotated_features[source]

stores recently retrieved AnnotatedFeature objects

Type

dict

__init__(fasta, gff, name='reference', attr_mod=None, genetic_code=1, memsave=True)[source]

Create an AnnotatedFasta object.

Parameters
  • fasta (str) – path to FASTA file of genome assembly

  • gff (str) – path to GFF3 file of genome annotation

  • name (str) – name of genome (default=’refrence’)

  • attr_mod (dict) – attribute modifications (default={}) (fmt:{‘<feature>’:{‘<standard attribute field name>’:’<nonstandard attribute field name used>’}}) (e.g.: {‘mRNA’: {‘Parent’: ‘Locus_id’}})

  • genetic_code (int or str) – NCBI genetic code (default=1) (see: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) (1 –> Standard Code; 2 –> Vertebrate Mitochondrial Code; etc.)

  • memsave (bool) – index GFF3 file instead of reading data to memory

annotated_feature(feature_id, feature_type=None) minorg.annotation.Annotation[source]

Retrieve annotated feature.

Also store feature’s annotation in self._annotated_features for future quick retrieval. If this method has been called before with the same feature ID, output will be retrieved from self._annotated_features.

Returns

Annotation object of feature

Return type

minorg.annotation.Annotation

feature_pos(feature_id) set[source]

Get range of feature and convert to set of positions.

Parameters

feature_id (str) – feature ID

Returns

Of integer value coordinates within feature range

Return type

set

feature_range(feature_id) tuple[source]

Get range of feature.

Parameters

feature_id (str) – feature ID

Returns

Of range. E.g (<start>, <end>)

Return type

tuple

get_feature_seq(*feature_ids, fout=None, feature_type=None, complete=False, adj_dir=False, translate=False, by_gene=False, pssm_id=None, db=None, remote_rps=False, rpsblast='rpsblast', rps_hits=None, thread=None, mktmp=None, fout_gff=None, quiet=True, seqid_template='$source|$gene|$feature|$isoform|$domain|$n|$complete|$strand|$sense', apply_template_to_dict=False) dict[source]

Get sequences of multiple features. Except for feature_ids, all arguments are strictly speaking optional. If no arguments are specified, the sense strand of the specified features will be returned. For more complicated queries, in addition to modifying flags adj_dir, complete, and/or translate, users may optionally specify a feature type and/or Pssm-Id.

Parameters
  • fout (str) – path to output file; if not provided, dictionary of seqs will be returned

  • feature_type (str) – GFF feature type (e.g. “CDS”)

  • adj_dir (bool) – output sense strand

  • complete (bool) – merge output range(s) together into single range. Output will stll be a list of tuple. (e.g. [(<smallest start>, <largest end>)])

  • translate (bool) – translate sequence. For biologically sound results, translate=True only be used with feature_type="CDS", adj_dir=True, and complete=False.

  • pssm_id (str or list) – Pssm-Id of domain(s) to restrict output sequence range to. If provided and rps_hits=None, RPS-BLAST will be conducted on concatenated CDS to search for domain(s).

  • by_gene (bool) – merge overlapping domain ranges from different isoforms

  • db (str) – path to RPS-BLAST database

  • remote_rps (bool) – use remote RPS-BLAST; if True, db is not required

  • rpsblast (str) – path to rpsblast or rpsblast+ executable (or name of command, if available at command line)

  • rps_hits (str) – path to RPS-BLAST output, outfmt 6, with header. If provided, db and rpsblast are not required.

  • thread (int) – number of threads for RPS-BLAST

  • mktmp (func) – function that takes a file name and generates an absolute path, used for temporary files. tempfile.mkstemp will be used if not provided.

  • quiet (bool) – print only essential messages

  • fout_gff (str) – path to output GFF3 file in which to write domain features, entirely optional

  • seqid_template (str) –

    optional, template for output sequence name. Template will be parsed by strings.Template. The default template is “$source|$gene|$feature|$isoform|$domain|$n|$complete|$strand|$sense”.

    • $source: reference genome alias

    • $gene: gene/isoform ID

    • $feature: GFF3 feature type

    • $isoform: isoform ID if by_gene = False, else same as $gene

    • $domain: Pssm-Id

    • $n: if multiple domains are present, they will be numbered according to proximity to 5’ of sense strand

    • $complete: ‘complete’ if complete=True else ‘stitched’

    • $strand: ‘minus’ if adj_dir=False and not self.plus else ‘plus’

    • $sense: ‘NA’ if self.strand is not set else ‘antisense’ if adj_dir=False and not self.plus else ‘sense’

  • apply_template_to_dict (bool) – flatten output dict and assign sequence names

Returns

  • dict of dict of Bio.Seq.Seq – If apply_template_to_dict=False. Format if pssm_id=None: {self.id: {(0, self.length): <sequence>}} Format if pssm_id is provided: {<isoform_id>: {(start pos of domain in self, end pos of domain in self): <sequence>}}

  • dict of Bio.Seq.Seq – If apply_template_to_dict=True

parse_gff() None[source]

Parse GFF3 file stored gff

reduce_annotation(ids, fout=None, preserve_order=True) None[source]

Wrapper for self.subset_annotation so functions written before the renaming won’t break.

subset_annotation(*ids, fout=None, memsave=None, preserve_order=True) None[source]

Subset GFF3 annotations by feature ID (including subfeatures) and generate new file of subsetted features.

Parameters
  • *ids (str) – required, list or tuple of feature IDs

  • fout (str) – optional, path to output file. If not provided, tempfile.mkstemp will be used to generate a temporary file.

class minorg.reference.AnnotatedFeature(id, annotated_fasta, gff=None, rps_hits=None, feature_type=None)[source]

Bases: minorg.annotation.Annotation

Representation of GFF3 feature annotation, including its subfeatures.

annotated_fasta[source]

parent AnnotatedFasta object

Type

AnnotatedFasta

annotation[source]
Type

minorg.annotation.GFF

__init__(id, annotated_fasta, gff=None, rps_hits=None, feature_type=None)[source]

Create an AnnotatedFeature object.

Annotation data for feature and its subfeatures will be extracted from either gff or annotated_fasta ‘s minorg.reference.AnnotatedFasta.annotation attribute.

Parameters
adj_feature_range(feature_type, ranges, complete=True, **feature_range_kwargs) list[source]

Convert range in subfeatures of feature type to range in current feature. (For example, convert the range of a domain in a concatenated CDS sequence to position in mRNA.)

Parameters
  • feature_type (str) – feature type

  • ranges (list of tuple) – ranges relative to subfeatures of feature type

  • complete (bool) – merge output range(s) together into single range. Output will stll be a list of tuple. (e.g. [(<smallest start>, <largest end>)])

Returns

Of ranges

Return type

list of tuples

adj_subfeature_range(subfeature_id, ranges, complete=True, **subfeature_range_kwargs) list[source]

Convert range in a subfeature to range in current feature. (For example, convert the range in mRNA to position in gene.)

Parameters
  • subfeature_id (str) – subfeature ID

  • ranges (list of tuple) – ranges relative to sub-subfeatures of feature type in subfeature

  • complete (bool) – merge output range(s) together into single range. Output will stll be a list of tuple. (e.g. [(<smallest start>, <largest end>)])

Returns

Of ranges

Return type

list of tuples

convert_range(ranges, index=0, start_incl=True, end_incl=False) list[source]

Convert range(s) from 0-index [start, end).

Parameters
  • index (int) – output range index

  • start_incl (bool) – whether output range is start-inclusive

  • end_incl (bool) – whether output range is end-inclusive

Returns

In same structure as input.

Return type

tuple or list

copy_annotation() minorg.annotation.Annotation[source]

Generate new minorg.annotation.Annotation object with self’s annotation data.

Used to make new Annotation for domains that will be modified with domain range.

Returns

Return type

minorg.annotation.Annotation

feature_range(feature_type, genomic=False, union=True) list[source]

Return ranges of entries of a given feature type (e.g. “CDS”)

Parameters
  • feature_type (str) – GFF3 feature type

  • genomic (bool) – output range in genome instead of relative to sense strand of current feature

  • union (bool) – merge overlapping ranges

Returns

Of ranges

Return type

list of tuples

get_seq(fout=None, feature_type=None, adj_dir=True, complete=True, translate=False, ranges=[], pssm_id=None, by_gene=False, db=None, remote_rps=False, rpsblast='rpsblast', rps_hits=None, thread=None, mktmp=None, quiet=False, fout_gff=None, seqid_template='$source|$gene|$feature|$isoform|$domain|$n|$complete|$strand|$sense', apply_template_to_dict=False) dict[source]

Get sequence of feature or subfeature. All arguments are strictly speaking optional. If no arguments are specified, self’s sense strand sequence will be returned. For more complicated queries, in addition to modifying flags adj_dir, complete, and/or translate, users may optionally specify a feature type, Pssm-Id, and/or restrict range relative to self.

Parameters
  • fout (str) – path to output file; if not provided, dictionary of seqs will be returned

  • feature_type (str) – GFF feature type (e.g. “CDS”)

  • adj_dir (bool) – output sense strand

  • complete (bool) – merge output range(s) together into single range. Output will stll be a list of tuple. (e.g. [(<smallest start>, <largest end>)])

  • translate (bool) – translate sequence. For biologically sound results, translate=True only be used with feature_type="CDS", adj_dir=True, and complete=False.

  • ranges (list of tuple) – 0-indexed ranges, relative to sense strand of self

  • pssm_id (str or list) – Pssm-Id of domain(s) to restrict output sequence range to. If provided and rps_hits=None, RPS-BLAST will be conducted on concatenated CDS to search for domain(s).

  • by_gene (bool) – merge overlapping domain ranges from different isoforms

  • db (str) – path to RPS-BLAST database

  • remote_rps (bool) – use remote RPS-BLAST; if True, db is not required

  • rpsblast (str) – path to rpsblast or rpsblast+ executable (or name of command, if available at command line)

  • rps_hits (str) – path to RPS-BLAST output, outfmt 6, with header. If provided, db and rpsblast are not required.

  • thread (int) – number of threads for RPS-BLAST

  • mktmp (func) – function that takes a file name and generates an absolute path, used for temporary files. tempfile.mkstemp will be used if not provided.

  • quiet (bool) – print only essential messages

  • fout_gff (str) – path to output GFF3 file in which to write domain features, entirely optional

  • seqid_template (str) –

    optional, template for output sequence name. Template will be parsed by strings.Template. The default template is “$source|$gene|$feature|$isoform|$domain|$n|$complete|$strand|$sense”.

    • $source: reference genome alias

    • $gene: gene/isoform ID

    • $feature: GFF3 feature type

    • $isoform: isoform ID if by_gene = False, else same as $gene

    • $domain: Pssm-Id

    • $n: if multiple domains are present, they will be numbered according to proximity to 5’ of sense strand

    • $complete: ‘complete’ if complete=True else ‘stitched’

    • $strand: ‘minus’ if adj_dir=False and not self.plus else ‘plus’

    • $sense: ‘NA’ if self.strand is not set else ‘antisense’ if adj_dir=False and not self.plus else ‘sense’

  • apply_template_to_dict (bool) – flatten output dict and assign sequence names

Returns

  • dict of dict of Bio.Seq.Seq – If apply_template_to_dict=False. Format if pssm_id=None: {self.id: {(0, self.length): <sequence>}} Format if pssm_id is provided: {<isoform_id>: {(start pos of domain in self, end pos of domain in self): <sequence>}}

  • dict of Bio.Seq.Seq – If apply_template_to_dict=True

property length[source]

Length of feature

range_from_genomic(ranges, **conversion_kwargs) list[source]

Output ranges relative to sense strand of self, where first base on sense strand of feature is 0

Parameters
  • ranges (list) – genomic range(s), 0-indexed, start-inclusive, end-exclusive (e.g. [(start1, end1), (start2, end2), …])

  • index (int) – output range index

  • start_incl (bool) – whether output range is start-inclusive

  • end_incl (bool) – whether output range is end-inclusive

Returns

Of ranges

Return type

list of tuples

range_in_genomic_coords(ranges, index=0, start_incl=True, end_incl=False) list[source]

Convert range in feature to genomic coordinates.

Parameters
  • ranges (list of tuple) – [(start1, end1), (start2, end2), …] - relative to sense strand of feature, where first base on sense strand of feature is 0 - 0-indexed; start-inclusive, end-exclusive

  • index (int) – output index

  • start_incl (bool) – start inclusive output ranges

  • end_incl (bool) – end inclusive output ranges

Returns

Of ranges

Return type

list of tuples

range_to_genomic(ranges, **conversion_kwargs) list[source]

Convert ranges relative to sense strand of feature to their genomic positions.

Parameters
  • ranges (list) – range(s) relative to sense strand of self, 0-indexed, start-inclusive, end-exclusive (e.g. [(start1, end1), (start2, end2), …])

  • index (int) – output range index

  • start_incl (bool) – whether output range is start-inclusive

  • end_incl (bool) – whether output range is end-inclusive

Returns

Of ranges

Return type

list of tuples

subfeature_range(subfeature_id, feature_type=None, genomic=False, union=True) list[source]

Return range of given feature type (e.g. “CDS”) of a sub-feature. (e.g. If self.type == “gene”, self.subfeature_range(“<mRNA isoform ID>”, “CDS”) will return CDS range(s) of the specified mRNA isoform)

Parameters
  • subfeature_id (str) – subfeature_id

  • feature_type (str) – GFF3 feature type

  • genomic (bool) – output range in genome instead of relative to sense strand of current feature

  • union (bool) – merge overlapping ranges

Returns

Of ranges

Return type

list of tuples

subset(id) minorg.reference.AnnotatedFeature[source]
exception minorg.reference.InvalidFeatureID(feature_id)[source]

Bases: minorg.MINORgError

Raised when feature_id does not exist.

__init__(feature_id)[source]