minorg.reference module
Classes that integrate FASTA and GFF3 files for retrieval of feature sequences
- class minorg.reference.AnnotatedFasta(fasta, gff, name='reference', attr_mod=None, genetic_code=1, memsave=True)[source]
Bases:
minorg.index.IndexedFasta
Representation of FASTA file with GFF3 annotations
- gff[source]
path to GFF3 file. If
subset_annotation()
has been called, this points to the file of subsetted annotations.- Type
str
- assembly[source]
self, stores parsed sequence data. As this class inherits from IndexedFasta,
self.assembly
provides an analogous way of retrieving sequence data asself.annotation
retrieves annotation data.- Type
- annotation[source]
GFF object, stores parsed annotation data. If
subset_annotation()
has been called, this stores data of subsetted annotations.- Type
- __init__(fasta, gff, name='reference', attr_mod=None, genetic_code=1, memsave=True)[source]
Create an AnnotatedFasta object.
- Parameters
fasta (str) – path to FASTA file of genome assembly
gff (str) – path to GFF3 file of genome annotation
name (str) – name of genome (default=’refrence’)
attr_mod (dict) – attribute modifications (default={}) (fmt:{‘<feature>’:{‘<standard attribute field name>’:’<nonstandard attribute field name used>’}}) (e.g.: {‘mRNA’: {‘Parent’: ‘Locus_id’}})
genetic_code (int or str) – NCBI genetic code (default=1) (see: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) (1 –> Standard Code; 2 –> Vertebrate Mitochondrial Code; etc.)
memsave (bool) – index GFF3 file instead of reading data to memory
- annotated_feature(feature_id, feature_type=None) minorg.annotation.Annotation [source]
Retrieve annotated feature.
Also store feature’s annotation in
self._annotated_features
for future quick retrieval. If this method has been called before with the same feature ID, output will be retrieved fromself._annotated_features
.- Returns
Annotation object of feature
- Return type
- feature_pos(feature_id) set [source]
Get range of feature and convert to set of positions.
- Parameters
feature_id (str) – feature ID
- Returns
Of integer value coordinates within feature range
- Return type
set
- feature_range(feature_id) tuple [source]
Get range of feature.
- Parameters
feature_id (str) – feature ID
- Returns
Of range. E.g (<start>, <end>)
- Return type
tuple
- get_feature_seq(*feature_ids, fout=None, feature_type=None, complete=False, adj_dir=False, translate=False, by_gene=False, pssm_id=None, db=None, remote_rps=False, rpsblast='rpsblast', rps_hits=None, thread=None, mktmp=None, fout_gff=None, quiet=True, seqid_template='$source|$gene|$feature|$isoform|$domain|$n|$complete|$strand|$sense', apply_template_to_dict=False) dict [source]
Get sequences of multiple features. Except for
feature_ids
, all arguments are strictly speaking optional. If no arguments are specified, the sense strand of the specified features will be returned. For more complicated queries, in addition to modifying flagsadj_dir
,complete
, and/ortranslate
, users may optionally specify a feature type and/or Pssm-Id.- Parameters
fout (str) – path to output file; if not provided, dictionary of seqs will be returned
feature_type (str) – GFF feature type (e.g. “CDS”)
adj_dir (bool) – output sense strand
complete (bool) – merge output range(s) together into single range. Output will stll be a list of tuple. (e.g. [(<smallest start>, <largest end>)])
translate (bool) – translate sequence. For biologically sound results,
translate=True
only be used withfeature_type="CDS"
,adj_dir=True
, andcomplete=False
.pssm_id (str or list) – Pssm-Id of domain(s) to restrict output sequence range to. If provided and
rps_hits=None
, RPS-BLAST will be conducted on concatenated CDS to search for domain(s).by_gene (bool) – merge overlapping domain ranges from different isoforms
db (str) – path to RPS-BLAST database
remote_rps (bool) – use remote RPS-BLAST; if True,
db
is not requiredrpsblast (str) – path to rpsblast or rpsblast+ executable (or name of command, if available at command line)
rps_hits (str) – path to RPS-BLAST output, outfmt 6, with header. If provided,
db
andrpsblast
are not required.thread (int) – number of threads for RPS-BLAST
mktmp (func) – function that takes a file name and generates an absolute path, used for temporary files.
tempfile.mkstemp
will be used if not provided.quiet (bool) – print only essential messages
fout_gff (str) – path to output GFF3 file in which to write domain features, entirely optional
seqid_template (str) –
optional, template for output sequence name. Template will be parsed by strings.Template. The default template is “$source|$gene|$feature|$isoform|$domain|$n|$complete|$strand|$sense”.
$source: reference genome alias
$gene: gene/isoform ID
$feature: GFF3 feature type
$isoform: isoform ID if by_gene = False, else same as $gene
$domain: Pssm-Id
$n: if multiple domains are present, they will be numbered according to proximity to 5’ of sense strand
$complete: ‘complete’ if
complete=True
else ‘stitched’$strand: ‘minus’ if
adj_dir=False
and not self.plus else ‘plus’$sense: ‘NA’ if self.strand is not set else ‘antisense’ if
adj_dir=False
and not self.plus else ‘sense’
apply_template_to_dict (bool) – flatten output dict and assign sequence names
- Returns
dict of dict of Bio.Seq.Seq – If
apply_template_to_dict=False
. Format ifpssm_id=None
: {self.id: {(0, self.length): <sequence>}} Format if pssm_id is provided: {<isoform_id>: {(start pos of domain in self, end pos of domain in self): <sequence>}}dict of Bio.Seq.Seq – If
apply_template_to_dict=True
- reduce_annotation(ids, fout=None, preserve_order=True) None [source]
Wrapper for
self.subset_annotation
so functions written before the renaming won’t break.
- subset_annotation(*ids, fout=None, memsave=None, preserve_order=True) None [source]
Subset GFF3 annotations by feature ID (including subfeatures) and generate new file of subsetted features.
- Parameters
*ids (str) – required, list or tuple of feature IDs
fout (str) – optional, path to output file. If not provided,
tempfile.mkstemp
will be used to generate a temporary file.
- class minorg.reference.AnnotatedFeature(id, annotated_fasta, gff=None, rps_hits=None, feature_type=None)[source]
Bases:
minorg.annotation.Annotation
Representation of GFF3 feature annotation, including its subfeatures.
- annotated_fasta[source]
parent
AnnotatedFasta
object- Type
- __init__(id, annotated_fasta, gff=None, rps_hits=None, feature_type=None)[source]
Create an AnnotatedFeature object.
Annotation data for feature and its subfeatures will be extracted from either
gff
orannotated_fasta
‘sminorg.reference.AnnotatedFasta.annotation
attribute.- Parameters
id (str) – required, ID of feature (typically gene name or isoform name)
annotated_fasta (
AnnotatedFasta
) – required,AnnotatedFasta
objectgff (str or
minorg.annotation.GFF
) – optional, path to GFF file orminorg.annotation.GFF
objectrps_hits (str) – optional, path to RPS-BLAST output (outfmt 6, with header)
- adj_feature_range(feature_type, ranges, complete=True, **feature_range_kwargs) list [source]
Convert range in subfeatures of feature type to range in current feature. (For example, convert the range of a domain in a concatenated CDS sequence to position in mRNA.)
- Parameters
feature_type (str) – feature type
ranges (list of tuple) – ranges relative to subfeatures of feature type
complete (bool) – merge output range(s) together into single range. Output will stll be a list of tuple. (e.g. [(<smallest start>, <largest end>)])
- Returns
Of ranges
- Return type
list of tuples
- adj_subfeature_range(subfeature_id, ranges, complete=True, **subfeature_range_kwargs) list [source]
Convert range in a subfeature to range in current feature. (For example, convert the range in mRNA to position in gene.)
- Parameters
subfeature_id (str) – subfeature ID
ranges (list of tuple) – ranges relative to sub-subfeatures of feature type in subfeature
complete (bool) – merge output range(s) together into single range. Output will stll be a list of tuple. (e.g. [(<smallest start>, <largest end>)])
- Returns
Of ranges
- Return type
list of tuples
- convert_range(ranges, index=0, start_incl=True, end_incl=False) list [source]
Convert range(s) from 0-index [start, end).
- Parameters
index (int) – output range index
start_incl (bool) – whether output range is start-inclusive
end_incl (bool) – whether output range is end-inclusive
- Returns
In same structure as input.
- Return type
tuple or list
- copy_annotation() minorg.annotation.Annotation [source]
Generate new
minorg.annotation.Annotation
object with self’s annotation data.Used to make new Annotation for domains that will be modified with domain range.
- Returns
- Return type
- feature_range(feature_type, genomic=False, union=True) list [source]
Return ranges of entries of a given feature type (e.g. “CDS”)
- Parameters
feature_type (str) – GFF3 feature type
genomic (bool) – output range in genome instead of relative to sense strand of current feature
union (bool) – merge overlapping ranges
- Returns
Of ranges
- Return type
list of tuples
- get_seq(fout=None, feature_type=None, adj_dir=True, complete=True, translate=False, ranges=[], pssm_id=None, by_gene=False, db=None, remote_rps=False, rpsblast='rpsblast', rps_hits=None, thread=None, mktmp=None, quiet=False, fout_gff=None, seqid_template='$source|$gene|$feature|$isoform|$domain|$n|$complete|$strand|$sense', apply_template_to_dict=False) dict [source]
Get sequence of feature or subfeature. All arguments are strictly speaking optional. If no arguments are specified, self’s sense strand sequence will be returned. For more complicated queries, in addition to modifying flags
adj_dir
,complete
, and/ortranslate
, users may optionally specify a feature type, Pssm-Id, and/or restrict range relative to self.- Parameters
fout (str) – path to output file; if not provided, dictionary of seqs will be returned
feature_type (str) – GFF feature type (e.g. “CDS”)
adj_dir (bool) – output sense strand
complete (bool) – merge output range(s) together into single range. Output will stll be a list of tuple. (e.g. [(<smallest start>, <largest end>)])
translate (bool) – translate sequence. For biologically sound results,
translate=True
only be used withfeature_type="CDS"
,adj_dir=True
, andcomplete=False
.ranges (list of tuple) – 0-indexed ranges, relative to sense strand of self
pssm_id (str or list) – Pssm-Id of domain(s) to restrict output sequence range to. If provided and
rps_hits=None
, RPS-BLAST will be conducted on concatenated CDS to search for domain(s).by_gene (bool) – merge overlapping domain ranges from different isoforms
db (str) – path to RPS-BLAST database
remote_rps (bool) – use remote RPS-BLAST; if True,
db
is not requiredrpsblast (str) – path to rpsblast or rpsblast+ executable (or name of command, if available at command line)
rps_hits (str) – path to RPS-BLAST output, outfmt 6, with header. If provided,
db
andrpsblast
are not required.thread (int) – number of threads for RPS-BLAST
mktmp (func) – function that takes a file name and generates an absolute path, used for temporary files.
tempfile.mkstemp
will be used if not provided.quiet (bool) – print only essential messages
fout_gff (str) – path to output GFF3 file in which to write domain features, entirely optional
seqid_template (str) –
optional, template for output sequence name. Template will be parsed by strings.Template. The default template is “$source|$gene|$feature|$isoform|$domain|$n|$complete|$strand|$sense”.
$source: reference genome alias
$gene: gene/isoform ID
$feature: GFF3 feature type
$isoform: isoform ID if by_gene = False, else same as $gene
$domain: Pssm-Id
$n: if multiple domains are present, they will be numbered according to proximity to 5’ of sense strand
$complete: ‘complete’ if
complete=True
else ‘stitched’$strand: ‘minus’ if
adj_dir=False
and not self.plus else ‘plus’$sense: ‘NA’ if self.strand is not set else ‘antisense’ if
adj_dir=False
and not self.plus else ‘sense’
apply_template_to_dict (bool) – flatten output dict and assign sequence names
- Returns
dict of dict of Bio.Seq.Seq – If
apply_template_to_dict=False
. Format ifpssm_id=None
: {self.id: {(0, self.length): <sequence>}} Format if pssm_id is provided: {<isoform_id>: {(start pos of domain in self, end pos of domain in self): <sequence>}}dict of Bio.Seq.Seq – If
apply_template_to_dict=True
- range_from_genomic(ranges, **conversion_kwargs) list [source]
Output ranges relative to sense strand of self, where first base on sense strand of feature is 0
- Parameters
ranges (list) – genomic range(s), 0-indexed, start-inclusive, end-exclusive (e.g. [(start1, end1), (start2, end2), …])
index (int) – output range index
start_incl (bool) – whether output range is start-inclusive
end_incl (bool) – whether output range is end-inclusive
- Returns
Of ranges
- Return type
list of tuples
- range_in_genomic_coords(ranges, index=0, start_incl=True, end_incl=False) list [source]
Convert range in feature to genomic coordinates.
- Parameters
ranges (list of tuple) – [(start1, end1), (start2, end2), …] - relative to sense strand of feature, where first base on sense strand of feature is 0 - 0-indexed; start-inclusive, end-exclusive
index (int) – output index
start_incl (bool) – start inclusive output ranges
end_incl (bool) – end inclusive output ranges
- Returns
Of ranges
- Return type
list of tuples
- range_to_genomic(ranges, **conversion_kwargs) list [source]
Convert ranges relative to sense strand of feature to their genomic positions.
- Parameters
ranges (list) – range(s) relative to sense strand of self, 0-indexed, start-inclusive, end-exclusive (e.g. [(start1, end1), (start2, end2), …])
index (int) – output range index
start_incl (bool) – whether output range is start-inclusive
end_incl (bool) – whether output range is end-inclusive
- Returns
Of ranges
- Return type
list of tuples
- subfeature_range(subfeature_id, feature_type=None, genomic=False, union=True) list [source]
Return range of given feature type (e.g. “CDS”) of a sub-feature. (e.g. If self.type == “gene”, self.subfeature_range(“<mRNA isoform ID>”, “CDS”) will return CDS range(s) of the specified mRNA isoform)
- Parameters
subfeature_id (str) – subfeature_id
feature_type (str) – GFF3 feature type
genomic (bool) – output range in genome instead of relative to sense strand of current feature
union (bool) – merge overlapping ranges
- Returns
Of ranges
- Return type
list of tuples
- subset(id) minorg.reference.AnnotatedFeature [source]
- exception minorg.reference.InvalidFeatureID(feature_id)[source]
Bases:
minorg.MINORgError
Raised when feature_id does not exist.