minorg.extract_homologue module
- minorg.extract_homologue.filter_min_cds_len(blast6cds_fname, merged, min_cds_len=0, colnames=('sseqid', 'sstart', 'send', 'pident', 'length'))[source]
Filter candidate homologues by minimum CDS length.
- Parameters
blast6cds_fname (str) – required, path to BLASTN output file (outfmt 6) prepended with header, where query was reference CDS sequences and subject was
fasta
merged (list) – list of candidate homologue data output by
merge_hits()
min_cds_len (int) – minimum CDS length in homologues (default=0)
- Returns
Of dict of merged hit data, where fields for each entry are: molecule, start, end, length
- Return type
list
- minorg.extract_homologue.find_homologue_imap(indv_fout_query_dir_kwargs)[source]
Wrapper of find_homologue_indv function for pickling.
- minorg.extract_homologue.merge_hits(data, merge_within_range=100, min_id=90, min_len=100, check_id_before_merge=False, colnames=('sseqid', 'sstart', 'send', 'pident', 'length'))[source]
Merge BLAST hits to generate candidate homologues.
[optional] If check_id_before_merge=True, discard hits with % identity <
min_id
- Merge hits within
merge_within_range
bp of each other Discard if no hit included in the merge has a % identity >=
min_id
Discard if resultant merged range is shorter than
min_len
bp
- Merge hits within
- Parameters
data (
minorg.functions.BlastResult
) – BLASTN result (or BlastResult coerced into reusable iterable)merge_within_range (int) – maximum number of bases between hits to be merged
min_id (float) – minimum hit % identity
min_len (int) – minimum candidate homologue length
check_id_before_merge (bool) – discard hits with % identity <
min_id
before mergingcolnames (tuple) – BLASTN fields to retain in output tsv of merged hits
- Returns
Of dict of merged hit data, where fields for each entry are: molecule, start, end, length
- Return type
list
- minorg.extract_homologue.merge_hits_and_filter(blast6_fname, fout, fasta, quiet=True, min_cds_len=0, indv_i=1, colnames=('molecule', 'start', 'end', 'max_pident', 'length'), blast6cds_fname=None, lvl=0, **for_merge_hits)[source]
Merge hits to generate homologues and filter by minimum inferred CDS length.
merge_hits()
: Merge hits in proximity to each other to generate candidate homologues.[optional] Filter candidate homologues by minimum CDS length (
min_cds_len
), as determined by how many bases in the candidate are covered by hits from reference CDS sequences.
- Parameters
blast6_fname (str) – path to BLASTN output file (outfmt 6) prepended with header, where query was reference genomic sequences and subject was
fasta
fout (str) – path to output FASTA file in which to write homologues
fasta (str) – path to FASTA file in which to find homologues
quiet (bool) – print only essential messages
min_cds_len (int) – minimum CDS length in homologues (default=0)
indv_i (str or int) – alias for
fasta
, used for generating sequence namescolnames (tuple or list) – fields in
blast6_fname
andblast6cds_fname
blast6cds_fname (str) – optional, if provided then candidate homologues will be filtere by minimum CDS length, path to BLASTN output file (outfmt 6) prepended with header, where query was reference CDS sequences and subject was
fasta
lvl (int) – optional, indentation level of printed messages
- minorg.extract_homologue.recip_blast(fasta_target, directory, gff, fasta_ref, blastn='blastn', bedtools='', keep_tmp=False, **kwargs)[source]
Additional args: genes (tuple), quiet (bool), relax (bool), lvl (int)
- minorg.extract_homologue.recip_blast_multiref(fasta_target, directory, gff, fasta_ref, blastn='blastn', bedtools='', keep_tmp=False, attribute_mod={}, **kwargs)[source]
Additional args: genes (tuple), quiet (bool), relax (bool), lvl (int)
gff and fasta_ref must be dictionaries of {<alias>: <path to file>}
- Parameters
fasta_target (str) – path to FASTA file containing query sequences for reciprocal BLAST
directory (str) – path to directory in which to write temporary files
gff (dict) – dictionary of GFF3 files for subjects of reciprocal BLAST
fasta_ref (dict) – dictionary of FASTA files containing subject sequences for reciprocal BLAST,
blastn (str) – optional, required only if check_reciprocal=True, blastn command (e.g. ‘blastn’) if available at CLI else path to blastn executable
bedtools (str) – optional, required only if bedtool is not in command-search path; path to directory contaiing BEDTools executables
attribute_mod (dict) – optional, required only if non-standard attriute field names are present in GFF3 files. Dictionary describing attribute modification.
- minorg.extract_homologue.remove_non_max_bitscore(fasta, bedtool, genes, relax=False, lvl=0, quiet=True, colnames_blast=['chrom', 'start', 'end', 'candidate', 'cstart', 'cend'], blast_metrics=['bitscore'], colnames_bed=['bed_chrom', 'bed_start', 'bed_end', 'id', 'score', 'strand', 'source', 'feature', 'phase', 'attributes', 'overlap'], colnames_gff=['bed_chrom', 'source', 'feature', 'bed_start', 'bed_end', 'score', 'strand', 'phase', 'attributes', 'overlap'], bedtools='', attribute_mod={}) None [source]
Remove query sequences for which the subject feature in the query-subject hit with the max bitscore is not a target gene/feature. This occurs in-place.
- Parameters
fasta (str) – path to FASTA file of query sequences to reduce
bedtool (
BedTool
) – BedTool object where BLAST hits have been intersected with subject GFF3 filesgenes (list) – gene/feature IDs of targets
relax (bool) – retain query sequences even if max bitscore hit overlaps with non-target feature so long as it also overlaps with a target feature
lvl (int) – printing indentation
quiet (bool) – silence non-essential messages
colnames_blast (list) – column names of BLAST output
blast_metrics (list) – additional column names of metrics in BLAST output
colnames_bed (list) – column names if annotation intersected with is BED format
colnames_gff (list) – column names if annotation intersected with is GFF3 format
bedtools (str) – path to directory contaiing BEDTools executables if bedtool is not in command-search path
attribute_mod (dict) – optional, required only if non-standard attriute field names are present in GFF3 files. Dictionary describing attribute modification.