minorg.extract_homologue module

minorg.extract_homologue.filter_min_cds_len(blast6cds_fname, merged, min_cds_len=0, colnames=('sseqid', 'sstart', 'send', 'pident', 'length'))[source]

Filter candidate homologues by minimum CDS length.

Parameters
  • blast6cds_fname (str) – required, path to BLASTN output file (outfmt 6) prepended with header, where query was reference CDS sequences and subject was fasta

  • merged (list) – list of candidate homologue data output by merge_hits()

  • min_cds_len (int) – minimum CDS length in homologues (default=0)

Returns

Of dict of merged hit data, where fields for each entry are: molecule, start, end, length

Return type

list

minorg.extract_homologue.find_homologue_imap(indv_fout_query_dir_kwargs)[source]

Wrapper of find_homologue_indv function for pickling.

minorg.extract_homologue.get_merged_seqs(merged_f, fasta, fout, header=[], indv_i=1)[source]
minorg.extract_homologue.merge_hits(data, merge_within_range=100, min_id=90, min_len=100, check_id_before_merge=False, colnames=('sseqid', 'sstart', 'send', 'pident', 'length'))[source]

Merge BLAST hits to generate candidate homologues.

  1. [optional] If check_id_before_merge=True, discard hits with % identity < min_id

  2. Merge hits within merge_within_range bp of each other
    • Discard if no hit included in the merge has a % identity >= min_id

    • Discard if resultant merged range is shorter than min_len bp

Parameters
  • data (minorg.functions.BlastResult) – BLASTN result (or BlastResult coerced into reusable iterable)

  • merge_within_range (int) – maximum number of bases between hits to be merged

  • min_id (float) – minimum hit % identity

  • min_len (int) – minimum candidate homologue length

  • check_id_before_merge (bool) – discard hits with % identity < min_id before merging

  • colnames (tuple) – BLASTN fields to retain in output tsv of merged hits

Returns

Of dict of merged hit data, where fields for each entry are: molecule, start, end, length

Return type

list

minorg.extract_homologue.merge_hits_and_filter(blast6_fname, fout, fasta, quiet=True, min_cds_len=0, indv_i=1, colnames=('molecule', 'start', 'end', 'max_pident', 'length'), blast6cds_fname=None, lvl=0, **for_merge_hits)[source]

Merge hits to generate homologues and filter by minimum inferred CDS length.

  1. merge_hits(): Merge hits in proximity to each other to generate candidate homologues.

  2. [optional] Filter candidate homologues by minimum CDS length (min_cds_len), as determined by how many bases in the candidate are covered by hits from reference CDS sequences.

Parameters
  • blast6_fname (str) – path to BLASTN output file (outfmt 6) prepended with header, where query was reference genomic sequences and subject was fasta

  • fout (str) – path to output FASTA file in which to write homologues

  • fasta (str) – path to FASTA file in which to find homologues

  • quiet (bool) – print only essential messages

  • min_cds_len (int) – minimum CDS length in homologues (default=0)

  • indv_i (str or int) – alias for fasta, used for generating sequence names

  • colnames (tuple or list) – fields in blast6_fname and blast6cds_fname

  • blast6cds_fname (str) – optional, if provided then candidate homologues will be filtere by minimum CDS length, path to BLASTN output file (outfmt 6) prepended with header, where query was reference CDS sequences and subject was fasta

  • lvl (int) – optional, indentation level of printed messages

minorg.extract_homologue.recip_blast(fasta_target, directory, gff, fasta_ref, blastn='blastn', bedtools='', keep_tmp=False, **kwargs)[source]

Additional args: genes (tuple), quiet (bool), relax (bool), lvl (int)

minorg.extract_homologue.recip_blast_multiref(fasta_target, directory, gff, fasta_ref, blastn='blastn', bedtools='', keep_tmp=False, attribute_mod={}, **kwargs)[source]

Additional args: genes (tuple), quiet (bool), relax (bool), lvl (int)

gff and fasta_ref must be dictionaries of {<alias>: <path to file>}

Parameters
  • fasta_target (str) – path to FASTA file containing query sequences for reciprocal BLAST

  • directory (str) – path to directory in which to write temporary files

  • gff (dict) – dictionary of GFF3 files for subjects of reciprocal BLAST

  • fasta_ref (dict) – dictionary of FASTA files containing subject sequences for reciprocal BLAST,

  • blastn (str) – optional, required only if check_reciprocal=True, blastn command (e.g. ‘blastn’) if available at CLI else path to blastn executable

  • bedtools (str) – optional, required only if bedtool is not in command-search path; path to directory contaiing BEDTools executables

  • attribute_mod (dict) – optional, required only if non-standard attriute field names are present in GFF3 files. Dictionary describing attribute modification.

minorg.extract_homologue.remove_non_max_bitscore(fasta, bedtool, genes, relax=False, lvl=0, quiet=True, colnames_blast=['chrom', 'start', 'end', 'candidate', 'cstart', 'cend'], blast_metrics=['bitscore'], colnames_bed=['bed_chrom', 'bed_start', 'bed_end', 'id', 'score', 'strand', 'source', 'feature', 'phase', 'attributes', 'overlap'], colnames_gff=['bed_chrom', 'source', 'feature', 'bed_start', 'bed_end', 'score', 'strand', 'phase', 'attributes', 'overlap'], bedtools='', attribute_mod={}) None[source]

Remove query sequences for which the subject feature in the query-subject hit with the max bitscore is not a target gene/feature. This occurs in-place.

Parameters
  • fasta (str) – path to FASTA file of query sequences to reduce

  • bedtool (BedTool) – BedTool object where BLAST hits have been intersected with subject GFF3 files

  • genes (list) – gene/feature IDs of targets

  • relax (bool) – retain query sequences even if max bitscore hit overlaps with non-target feature so long as it also overlaps with a target feature

  • lvl (int) – printing indentation

  • quiet (bool) – silence non-essential messages

  • colnames_blast (list) – column names of BLAST output

  • blast_metrics (list) – additional column names of metrics in BLAST output

  • colnames_bed (list) – column names if annotation intersected with is BED format

  • colnames_gff (list) – column names if annotation intersected with is GFF3 format

  • bedtools (str) – path to directory contaiing BEDTools executables if bedtool is not in command-search path

  • attribute_mod (dict) – optional, required only if non-standard attriute field names are present in GFF3 files. Dictionary describing attribute modification.