minorg.reference module

Classes that integrate FASTA and GFF3 files for retrieval of feature sequences

class minorg.reference.AnnotatedFasta(fasta, gff, name='reference', attr_mod=None, genetic_code=1, memsave=True)[source]

Bases: minorg.index.IndexedFasta

Representation of FASTA file with GFF3 annotations

name[source]

name/alias of reference genome

Type: str

fasta[source]

path to genome FASTA file

Type: str

_gff[source]

path to GFF3 file

Type: str

gff[source]

path to GFF3 file. If subset_annotation() has been called, this points to the file of subsetted annotations.

Type: str

genetic_code[source]

NCBI genetic code

Type: str or int

attr_mod[source]

attribute modification

Type: dict

memsave[source]

whether to index GFF3 file instead of reading to memory

Type: bool

assembly[source]

self, stores parsed sequence data. As this class inherits from IndexedFasta, self.assembly provides an analogous way of retrieving sequence data as self.annotation retrieves annotation data.

Type: AnnotatedFasta

annotation[source]

GFF object, stores parsed annotation data. If subset_annotation() has been called, this stores data of subsetted annotations.

Type: GFF

_annotated_features[source]

stores recently retrieved AnnotatedFeature objects

Type: dict

__init__(fasta, gff, name='reference', attr_mod=None, genetic_code=1, memsave=True)[source]

Create an AnnotatedFasta object.

Parameters

fasta (str) – path to FASTA file of genome assembly
gff (str) – path to GFF3 file of genome annotation
name (str) – name of genome (default=’refrence’)
attr_mod (dict) – attribute modifications (default={}) (fmt:{‘<feature>’:{‘<standard attribute field name>’:’<nonstandard attribute field name used>’}}) (e.g.: {‘mRNA’: {‘Parent’: ‘Locus_id’}})
genetic_code (int or str) – NCBI genetic code (default=1) (see: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) (1 –> Standard Code; 2 –> Vertebrate Mitochondrial Code; etc.)
memsave (bool) – index GFF3 file instead of reading data to memory

annotated_feature(feature_id, feature_type=None) → minorg.annotation.Annotation[source]

Retrieve annotated feature.

Also store feature’s annotation in self._annotated_features for future quick retrieval. If this method has been called before with the same feature ID, output will be retrieved from self._annotated_features.

Returns: Annotation object of feature
Return type: minorg.annotation.Annotation

feature_pos(feature_id) → set[source]

Get range of feature and convert to set of positions.

Parameters: feature_id (str) – feature ID
Returns: Of integer value coordinates within feature range
Return type: set

feature_range(feature_id) → tuple[source]

Get range of feature.

Parameters: feature_id (str) – feature ID
Returns: Of range. E.g (<start>, <end>)
Return type: tuple

get_feature_seq(*feature_ids, fout=None, feature_type=None, complete=False, adj_dir=False, translate=False, by_gene=False, pssm_id=None, db=None, remote_rps=False, rpsblast='rpsblast', rps_hits=None, thread=None, mktmp=None, fout_gff=None, quiet=True, seqid_template='$source|$gene|$feature|$isoform|$domain|$n|$complete|$strand|$sense', apply_template_to_dict=False) → dict[source]

Get sequences of multiple features. Except for feature_ids, all arguments are strictly speaking optional. If no arguments are specified, the sense strand of the specified features will be returned. For more complicated queries, in addition to modifying flags adj_dir, complete, and/or translate, users may optionally specify a feature type and/or Pssm-Id.

Parameters

fout (str) – path to output file; if not provided, dictionary of seqs will be returned
feature_type (str) – GFF feature type (e.g. “CDS”)
adj_dir (bool) – output sense strand
complete (bool) – merge output range(s) together into single range. Output will stll be a list of tuple. (e.g. [(<smallest start>, <largest end>)])
translate (bool) – translate sequence. For biologically sound results, translate=True only be used with feature_type="CDS", adj_dir=True, and complete=False.
pssm_id (str or list) – Pssm-Id of domain(s) to restrict output sequence range to. If provided and rps_hits=None, RPS-BLAST will be conducted on concatenated CDS to search for domain(s).
by_gene (bool) – merge overlapping domain ranges from different isoforms
db (str) – path to RPS-BLAST database
remote_rps (bool) – use remote RPS-BLAST; if True, db is not required
rpsblast (str) – path to rpsblast or rpsblast+ executable (or name of command, if available at command line)
rps_hits (str) – path to RPS-BLAST output, outfmt 6, with header. If provided, db and rpsblast are not required.
thread (int) – number of threads for RPS-BLAST
mktmp (func) – function that takes a file name and generates an absolute path, used for temporary files. tempfile.mkstemp will be used if not provided.
quiet (bool) – print only essential messages
fout_gff (str) – path to output GFF3 file in which to write domain features, entirely optional
seqid_template (str) –
optional, template for output sequence name. Template will be parsed by strings.Template. The default template is “$source|$gene|$feature|$isoform|$domain|$n|$complete|$strand|$sense”.
- $source: reference genome alias
- $gene: gene/isoform ID
- $feature: GFF3 feature type
- $isoform: isoform ID if by_gene = False, else same as $gene
- $domain: Pssm-Id
- $n: if multiple domains are present, they will be numbered according to proximity to 5’ of sense strand
- $complete: ‘complete’ if complete=True else ‘stitched’
- $strand: ‘minus’ if adj_dir=False and not self.plus else ‘plus’
- $sense: ‘NA’ if self.strand is not set else ‘antisense’ if adj_dir=False and not self.plus else ‘sense’
apply_template_to_dict (bool) – flatten output dict and assign sequence names

Returns

dict of dict of Bio.Seq.Seq – If apply_template_to_dict=False. Format if pssm_id=None: {self.id: {(0, self.length): <sequence>}} Format if pssm_id is provided: {<isoform_id>: {(start pos of domain in self, end pos of domain in self): <sequence>}}
dict of Bio.Seq.Seq – If apply_template_to_dict=True

parse_gff() → None[source]: Parse GFF3 file stored gff

reduce_annotation(ids, fout=None, preserve_order=True) → None[source]: Wrapper for self.subset_annotation so functions written before the renaming won’t break.

subset_annotation(*ids, fout=None, memsave=None, preserve_order=True) → None[source]

Subset GFF3 annotations by feature ID (including subfeatures) and generate new file of subsetted features.

Parameters

*ids (str) – required, list or tuple of feature IDs
fout (str) – optional, path to output file. If not provided, tempfile.mkstemp will be used to generate a temporary file.

class minorg.reference.AnnotatedFeature(id, annotated_fasta, gff=None, rps_hits=None, feature_type=None)[source]

Bases: minorg.annotation.Annotation

Representation of GFF3 feature annotation, including its subfeatures.

annotated_fasta[source]

parent AnnotatedFasta object

Type: AnnotatedFasta

annotation[source]

Type: minorg.annotation.GFF

__init__(id, annotated_fasta, gff=None, rps_hits=None, feature_type=None)[source]

Create an AnnotatedFeature object.

Annotation data for feature and its subfeatures will be extracted from either gff or annotated_fasta ‘s minorg.reference.AnnotatedFasta.annotation attribute.

Parameters

id (str) – required, ID of feature (typically gene name or isoform name)
annotated_fasta (AnnotatedFasta) – required, AnnotatedFasta object
gff (str or minorg.annotation.GFF) – optional, path to GFF file or minorg.annotation.GFF object
rps_hits (str) – optional, path to RPS-BLAST output (outfmt 6, with header)

adj_feature_range(feature_type, ranges, complete=True, **feature_range_kwargs) → list[source]

Convert range in subfeatures of feature type to range in current feature. (For example, convert the range of a domain in a concatenated CDS sequence to position in mRNA.)

Parameters

feature_type (str) – feature type
ranges (list of tuple) – ranges relative to subfeatures of feature type
complete (bool) – merge output range(s) together into single range. Output will stll be a list of tuple. (e.g. [(<smallest start>, <largest end>)])

Returns

Of ranges

Return type

list of tuples

adj_subfeature_range(subfeature_id, ranges, complete=True, **subfeature_range_kwargs) → list[source]

Convert range in a subfeature to range in current feature. (For example, convert the range in mRNA to position in gene.)

Parameters

subfeature_id (str) – subfeature ID
ranges (list of tuple) – ranges relative to sub-subfeatures of feature type in subfeature
complete (bool) – merge output range(s) together into single range. Output will stll be a list of tuple. (e.g. [(<smallest start>, <largest end>)])

Returns

Of ranges

Return type

list of tuples

convert_range(ranges, index=0, start_incl=True, end_incl=False) → list[source]

Convert range(s) from 0-index [start, end).

Parameters

index (int) – output range index
start_incl (bool) – whether output range is start-inclusive
end_incl (bool) – whether output range is end-inclusive

Returns

In same structure as input.

Return type

tuple or list

copy_annotation() → minorg.annotation.Annotation[source]

Generate new minorg.annotation.Annotation object with self’s annotation data.

Used to make new Annotation for domains that will be modified with domain range.

Returns
Return type: minorg.annotation.Annotation

feature_range(feature_type, genomic=False, union=True) → list[source]

Return ranges of entries of a given feature type (e.g. “CDS”)

Parameters

feature_type (str) – GFF3 feature type
genomic (bool) – output range in genome instead of relative to sense strand of current feature
union (bool) – merge overlapping ranges

Returns

Of ranges

Return type

list of tuples

get_seq(fout=None, feature_type=None, adj_dir=True, complete=True, translate=False, ranges=[], pssm_id=None, by_gene=False, db=None, remote_rps=False, rpsblast='rpsblast', rps_hits=None, thread=None, mktmp=None, quiet=False, fout_gff=None, seqid_template='$source|$gene|$feature|$isoform|$domain|$n|$complete|$strand|$sense', apply_template_to_dict=False) → dict[source]

Get sequence of feature or subfeature. All arguments are strictly speaking optional. If no arguments are specified, self’s sense strand sequence will be returned. For more complicated queries, in addition to modifying flags adj_dir, complete, and/or translate, users may optionally specify a feature type, Pssm-Id, and/or restrict range relative to self.

Parameters

fout (str) – path to output file; if not provided, dictionary of seqs will be returned
feature_type (str) – GFF feature type (e.g. “CDS”)
adj_dir (bool) – output sense strand
complete (bool) – merge output range(s) together into single range. Output will stll be a list of tuple. (e.g. [(<smallest start>, <largest end>)])
translate (bool) – translate sequence. For biologically sound results, translate=True only be used with feature_type="CDS", adj_dir=True, and complete=False.
ranges (list of tuple) – 0-indexed ranges, relative to sense strand of self
pssm_id (str or list) – Pssm-Id of domain(s) to restrict output sequence range to. If provided and rps_hits=None, RPS-BLAST will be conducted on concatenated CDS to search for domain(s).
by_gene (bool) – merge overlapping domain ranges from different isoforms
db (str) – path to RPS-BLAST database
remote_rps (bool) – use remote RPS-BLAST; if True, db is not required
rpsblast (str) – path to rpsblast or rpsblast+ executable (or name of command, if available at command line)
rps_hits (str) – path to RPS-BLAST output, outfmt 6, with header. If provided, db and rpsblast are not required.
thread (int) – number of threads for RPS-BLAST
mktmp (func) – function that takes a file name and generates an absolute path, used for temporary files. tempfile.mkstemp will be used if not provided.
quiet (bool) – print only essential messages
fout_gff (str) – path to output GFF3 file in which to write domain features, entirely optional
seqid_template (str) –
optional, template for output sequence name. Template will be parsed by strings.Template. The default template is “$source|$gene|$feature|$isoform|$domain|$n|$complete|$strand|$sense”.
- $source: reference genome alias
- $gene: gene/isoform ID
- $feature: GFF3 feature type
- $isoform: isoform ID if by_gene = False, else same as $gene
- $domain: Pssm-Id
- $n: if multiple domains are present, they will be numbered according to proximity to 5’ of sense strand
- $complete: ‘complete’ if complete=True else ‘stitched’
- $strand: ‘minus’ if adj_dir=False and not self.plus else ‘plus’
- $sense: ‘NA’ if self.strand is not set else ‘antisense’ if adj_dir=False and not self.plus else ‘sense’
apply_template_to_dict (bool) – flatten output dict and assign sequence names

Returns

dict of dict of Bio.Seq.Seq – If apply_template_to_dict=False. Format if pssm_id=None: {self.id: {(0, self.length): <sequence>}} Format if pssm_id is provided: {<isoform_id>: {(start pos of domain in self, end pos of domain in self): <sequence>}}
dict of Bio.Seq.Seq – If apply_template_to_dict=True

property length[source]: Length of feature

range_from_genomic(ranges, **conversion_kwargs) → list[source]

Output ranges relative to sense strand of self, where first base on sense strand of feature is 0

Parameters

ranges (list) – genomic range(s), 0-indexed, start-inclusive, end-exclusive (e.g. [(start1, end1), (start2, end2), …])
index (int) – output range index
start_incl (bool) – whether output range is start-inclusive
end_incl (bool) – whether output range is end-inclusive

Returns

Of ranges

Return type

list of tuples

range_in_genomic_coords(ranges, index=0, start_incl=True, end_incl=False) → list[source]

Convert range in feature to genomic coordinates.

Parameters

ranges (list of tuple) – [(start1, end1), (start2, end2), …] - relative to sense strand of feature, where first base on sense strand of feature is 0 - 0-indexed; start-inclusive, end-exclusive
index (int) – output index
start_incl (bool) – start inclusive output ranges
end_incl (bool) – end inclusive output ranges

Returns

Of ranges

Return type

list of tuples

range_to_genomic(ranges, **conversion_kwargs) → list[source]

Convert ranges relative to sense strand of feature to their genomic positions.

Parameters

ranges (list) – range(s) relative to sense strand of self, 0-indexed, start-inclusive, end-exclusive (e.g. [(start1, end1), (start2, end2), …])
index (int) – output range index
start_incl (bool) – whether output range is start-inclusive
end_incl (bool) – whether output range is end-inclusive

Returns

Of ranges

Return type

list of tuples

subfeature_range(subfeature_id, feature_type=None, genomic=False, union=True) → list[source]

Return range of given feature type (e.g. “CDS”) of a sub-feature. (e.g. If self.type == “gene”, self.subfeature_range(“<mRNA isoform ID>”, “CDS”) will return CDS range(s) of the specified mRNA isoform)

Parameters

subfeature_id (str) – subfeature_id
feature_type (str) – GFF3 feature type
genomic (bool) – output range in genome instead of relative to sense strand of current feature
union (bool) – merge overlapping ranges

Returns

Of ranges

Return type

list of tuples

subset(id) → minorg.reference.AnnotatedFeature[source]

exception minorg.reference.InvalidFeatureID(feature_id)[source]

Bases: minorg.MINORgError

Raised when feature_id does not exist.

__init__(feature_id)[source]