minorg.extract_homologue module

minorg.extract_homologue.filter_min_cds_len(blast6cds_fname, merged, min_cds_len=0, colnames=('sseqid', 'sstart', 'send', 'pident', 'length'))[source]

Filter candidate homologues by minimum CDS length.

Parameters

blast6cds_fname (str) – required, path to BLASTN output file (outfmt 6) prepended with header, where query was reference CDS sequences and subject was fasta
merged (list) – list of candidate homologue data output by merge_hits()
min_cds_len (int) – minimum CDS length in homologues (default=0)

Returns

Of dict of merged hit data, where fields for each entry are: molecule, start, end, length

Return type

list

minorg.extract_homologue.find_homologue_imap(indv_fout_query_dir_kwargs)[source]: Wrapper of find_homologue_indv function for pickling.

minorg.extract_homologue.get_merged_seqs(merged_f, fasta, fout, header=[], indv_i=1)[source]

minorg.extract_homologue.merge_hits(data, merge_within_range=100, min_id=90, min_len=100, check_id_before_merge=False, colnames=('sseqid', 'sstart', 'send', 'pident', 'length'))[source]

Merge BLAST hits to generate candidate homologues.

[optional] If check_id_before_merge=True, discard hits with % identity < min_id
Merge hits within merge_within_range bp of each other
- Discard if no hit included in the merge has a % identity >= min_id
- Discard if resultant merged range is shorter than min_len bp

Parameters

data (minorg.functions.BlastResult) – BLASTN result (or BlastResult coerced into reusable iterable)
merge_within_range (int) – maximum number of bases between hits to be merged
min_id (float) – minimum hit % identity
min_len (int) – minimum candidate homologue length
check_id_before_merge (bool) – discard hits with % identity < min_id before merging
colnames (tuple) – BLASTN fields to retain in output tsv of merged hits

Returns

Of dict of merged hit data, where fields for each entry are: molecule, start, end, length

Return type

list

minorg.extract_homologue.merge_hits_and_filter(blast6_fname, fout, fasta, quiet=True, min_cds_len=0, indv_i=1, colnames=('molecule', 'start', 'end', 'max_pident', 'length'), blast6cds_fname=None, lvl=0, **for_merge_hits)[source]

Merge hits to generate homologues and filter by minimum inferred CDS length.

merge_hits(): Merge hits in proximity to each other to generate candidate homologues.
[optional] Filter candidate homologues by minimum CDS length (min_cds_len), as determined by how many bases in the candidate are covered by hits from reference CDS sequences.

Parameters

blast6_fname (str) – path to BLASTN output file (outfmt 6) prepended with header, where query was reference genomic sequences and subject was fasta
fout (str) – path to output FASTA file in which to write homologues
fasta (str) – path to FASTA file in which to find homologues
quiet (bool) – print only essential messages
min_cds_len (int) – minimum CDS length in homologues (default=0)
indv_i (str or int) – alias for fasta, used for generating sequence names
colnames (tuple or list) – fields in blast6_fname and blast6cds_fname
blast6cds_fname (str) – optional, if provided then candidate homologues will be filtere by minimum CDS length, path to BLASTN output file (outfmt 6) prepended with header, where query was reference CDS sequences and subject was fasta
lvl (int) – optional, indentation level of printed messages

minorg.extract_homologue.recip_blast(fasta_target, directory, gff, fasta_ref, blastn='blastn', bedtools='', keep_tmp=False, **kwargs)[source]: Additional args: genes (tuple), quiet (bool), relax (bool), lvl (int)

minorg.extract_homologue.recip_blast_multiref(fasta_target, directory, gff, fasta_ref, blastn='blastn', bedtools='', keep_tmp=False, attribute_mod={}, **kwargs)[source]

Additional args: genes (tuple), quiet (bool), relax (bool), lvl (int)

gff and fasta_ref must be dictionaries of {<alias>: <path to file>}

Parameters

fasta_target (str) – path to FASTA file containing query sequences for reciprocal BLAST
directory (str) – path to directory in which to write temporary files
gff (dict) – dictionary of GFF3 files for subjects of reciprocal BLAST
fasta_ref (dict) – dictionary of FASTA files containing subject sequences for reciprocal BLAST,
blastn (str) – optional, required only if check_reciprocal=True, blastn command (e.g. ‘blastn’) if available at CLI else path to blastn executable
bedtools (str) – optional, required only if bedtool is not in command-search path; path to directory contaiing BEDTools executables
attribute_mod (dict) – optional, required only if non-standard attriute field names are present in GFF3 files. Dictionary describing attribute modification.

minorg.extract_homologue.remove_non_max_bitscore(fasta, bedtool, genes, relax=False, lvl=0, quiet=True, colnames_blast=['chrom', 'start', 'end', 'candidate', 'cstart', 'cend'], blast_metrics=['bitscore'], colnames_bed=['bed_chrom', 'bed_start', 'bed_end', 'id', 'score', 'strand', 'source', 'feature', 'phase', 'attributes', 'overlap'], colnames_gff=['bed_chrom', 'source', 'feature', 'bed_start', 'bed_end', 'score', 'strand', 'phase', 'attributes', 'overlap'], bedtools='', attribute_mod={}) → None[source]

Remove query sequences for which the subject feature in the query-subject hit with the max bitscore is not a target gene/feature. This occurs in-place.

Parameters

fasta (str) – path to FASTA file of query sequences to reduce
bedtool (BedTool) – BedTool object where BLAST hits have been intersected with subject GFF3 files
genes (list) – gene/feature IDs of targets
relax (bool) – retain query sequences even if max bitscore hit overlaps with non-target feature so long as it also overlaps with a target feature
lvl (int) – printing indentation
quiet (bool) – silence non-essential messages
colnames_blast (list) – column names of BLAST output
blast_metrics (list) – additional column names of metrics in BLAST output
colnames_bed (list) – column names if annotation intersected with is BED format
colnames_gff (list) – column names if annotation intersected with is GFF3 format
bedtools (str) – path to directory contaiing BEDTools executables if bedtool is not in command-search path
attribute_mod (dict) – optional, required only if non-standard attriute field names are present in GFF3 files. Dictionary describing attribute modification.