6. Tutorial (Python)
Please download the files in https://github.com/rlrq/MINORg/tree/master/examples. In all the examples below, you should replace “/path/to” with the appropriate full path name, which is usually to the directory containing these example files.
6.1. Setting up the tutorial
To ensure that the examples in this tutorial work, please replace ‘/path/to’ in the files ‘arabidopsis_genomes.txt’, ‘athaliana_genomes.txt’, and ‘subset_genome_mapping.txt’ with the full path to the directory containing the example files.
6.2. Getting started
To begin, import the MINORg
class.
>>> from minorg.MINORg import MINORg
To create a MINORg object:
>>> my_minorg = MINORg(directory = "/path/to/output/directory", prefix = "prefix")
Both directory
and prefix
are optional. If not provided, they will default to the current directory and ‘minorg’ respectively. If the directory does not currently exist, it will be created.
If you wish to use the default values specified in a config file, use this instead:
>>> my_minorg = MINORg(config = "/path/to/config.ini", directory = "/path/to/output/directory", prefix = "prefix")
You may now set your parameters using the attributes of your MINORg
object. For a table listing the equivalent CLI arguments and MINORg
attributes, see CLI vs Python.
6.3. IMPT: Note on executables
See: Executables
You can specify executables as such:
>>> my_minorg.blastn = '/path/to/blastn/executable'
>>> my_minorg.rpsblast = '/path/to/rpsblast/executable'
>>> my_minorg.mafft = '/path/to/mafft/executable'
>>> my_minorg.bedtools = '/path/to/bedtools2/bin'
Note that BEDTools is unique in that if it is not in your command-search path, you should provide the path TO THE DIRECTORY CONTAINING ITS EXECUTABLES (i.e. there will not be a single ‘bedtools’ executable), and if it IS in your command-search path you SHOULD NOT be using the bedtools
attribute. See Executables for more on executables.
6.4. Defining target sequences
6.4.1. User-provided targets
Let us begin with the simplest MINORg execution:
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_200_target")
>>> my_minorg.target = "/path/to/sample_CDS.fasta"
>>> my_minorg.full()
>>> my_minorg.resolve()
The above combination of arguments tells MINORg to generate gRNA from targets in a user-provided FASTA file (my_minorg.target = 'pat/to/sample_CDS.fasta'
) and to output files into the directory /path/to/output/directory/example_200_target
. By default, MINORg generates 20 bp gRNA using NGG PAM. The full MINORg programme is executed by calling the full()
method (my_minorg.full()
). Don’t forget to call resolve()
to remove any temporary files.
6.4.2. Reference gene(s) as targets
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_201_refgene")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050", "AT5G45060", "AT5G45200", "AT5G45210", "AT5G45220", "AT5G45230", "AT5G45240", "AT5G45250"]
>>> my_minorg.query_reference = True
>>> my_minorg.full()
>>> my_minorg.resolve()
In the above example, my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
is used to specify information about a reference genome:
Positional argument 1: path to reference assembly (In this case
"/path/to/subset_ref_TAIR10.fasta"
)Positional argument 2: path to reference annotation (In this case
"/path/to/subset_ref_TAIR10.gff"
)Optional keyword argument 1 (
alias
): genome alias (in this case"TAIR10"
); a unique name for the reference genome, used when referring to it in sequence names and output files. Autogenerated by MINORg if not provided.See
add_reference()
and Non-standard reference for how to specify genetic code and non-standard attribute field names
my_minorg.genes = ["AT5G45050", "AT5G45060", "AT5G45200", "AT5G45210", "AT5G45220", "AT5G45230", "AT5G45240", "AT5G45250"]
tells MINORg the target gene(s), and my_minorg.query_reference = True
tells MINORg to generate gRNA for reference gene(s).
6.4.3. Non-reference gene(s) as targets
6.4.3.1. Extending the reference
See also: Extended genome
If you have both genomic and CDS-only sequences of your target genes but not a GFF3 annotation file, MINORg can infer coding regions (CDS) for your target genes using extend_reference()
. See Extended genome for how to name your sequences to ensure proper mapping of CDS to genes.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_202_ext")
>>> my_minorg.extend_reference("/path/to/sample_gene.fasta", "/path/to/sample_CDS.fasta")
>>> my_minorg.genes = ["AT1G10920"]
>>> my_minorg.query_reference = True
>>> my_minorg.full()
>>> my_minorg.resolve()
extend_reference()
effectively adds new genes to the reference genome, so they can be used just like any reference gene. Therefore, they can also be used in combination with add_query()
.
6.4.3.2. Inferring homologues in unannotated genomes
See also: Non-reference homologue inference
If you would like MINORg to infer homologues in non-reference genomes, you can use add_query()
to specify the FASTA files of those non-reference genomes.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_203_query")
>>> my_minorg.extend_reference("/path/to/sample_gene.fasta", "/path/to/sample_CDS.fasta")
>>> my_minorg.genes = ["AT1G10920"]
>>> my_minorg.add_query("/path/to/subset_9654.fasta", alias = "9654")
>>> my_minorg.add_query("/path/to/subset_9655.fasta", alias = "9655")
>>> my_minorg.full()
>>> my_minorg.resolve()
In the above example, my_minorg.add_query("/path/to/subset_9654.fasta", alias = "9654")
and my_minorg.add_query("/path/to/subset_9655.fasta", alias = "9655")
are used to specify information about query FASTA files.
The alias keyword argument is optional. If not provided, MINORg will generate a unique alias.
Query FASTA files are stored as a dictionary with the format {<alias>:<FASTA>} at
query
.If you’d like to remove a query file that you’ve added, you can use:
>>> my_minorg.remove_query("9654")
The
remove_query()
method takes a query alias. If you did not specify an alias when usingadd_query()
and do not know the alias of the file you wish to remove, you may view the query-FASTA mapping using thequery
attribute.>>> my_minorg.query {"9654": "/path/to/subset_9654.fasta", "9655": "/path/to/subset_9655.fasta"}
6.4.4. Domain as targets
MINORg allows users to specify the identifier of an RPS-BLAST position-specific scoring matrix (PSSM-Id) to further restrict the target sequence to a given domain associated with the PSSM-Id. This could be particularly useful when designing gRNA for genes that do not share conserved domain structures but do share a domain that you wish to knock out.
6.4.4.1. Local database
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_204_domain")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050"]
>>> my_minorg.query_reference = True
>>> my_minorg.rpsblast = "/path/to/rpsblast/executable"
>>> my_minorg.db = "/path/to/rpsblast/db"
>>> my_minorg.pssm_ids = ["214815"]
>>> my_minorg.full()
>>> my_minorg.resolve()
In the above example, gRNA will be generated for the WRKY domain (PSSM-Id 214815 as of CDD database v3.18) of the gene AT5G45050. Users are responsible for providing the PSSM-Id of a domain that exists in the gene. If multiple PSSM-Ids are provided, overlapping domains will be combined and output WILL NOT distinguish between one PSSM-Id or another. Unlike other examples, the database (db
) is not provided as part of the example files. If you are using the full Docker image pulled from rlrq/minorg, the database is bundled with the image. Otherwise, you will have to download it yourself. See RPS-BLAST local database for more information.
6.4.4.2. Remote database
While it is in theory possible to use the remote CDD database & servers instead of local ones, the --remote
option for the ‘rpsblast’/’rpsblast+’ command from the BLAST+ package has never worked for me. In any case, if your version of local rpsblast is able to access the remote database, you can use remote_rps
.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_204_domain")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050"]
>>> my_minorg.query_reference = True
>>> my_minorg.rpsblast = "/path/to/rpsblast/executable"
>>> my_minorg.db = "Cdd"
>>> my_minorg.remote_rps = True
>>> my_minorg.pssm_ids = ["214815"]
>>> my_minorg.full()
>>> my_minorg.resolve()
6.5. Defining gRNA
See also: PAM
By default, MINORg generates 20 bp gRNA using SpCas9’s NGG PAM. You may specify other gRNA length using length
and other PAM using pam
.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_205_grna")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050"]
>>> my_minorg.query_reference = True
>>> my_minorg.length = 23
>>> from minorg import pam
>>> my_minorg.pam = pam.Cas12a
>>> my_minorg.full()
>>> my_minorg.resolve()
In the example above, MINORg will generate 19 bp gRNA (my_minorg.length = 23
) using Cas12a’s unusual 5’ PAM pattern (TTTV<gRNA>) (my_minorg.pam = pam.Cas12a
). MINORg has several built-in PAMs (see Preset PAM patterns for options), and also supports customisable PAM patterns using ambiguous bases and regular expressions (see PAM for format). To use preset PAMs, such as in the example above, you will first need to import MINORg’s minorg.pam module (from minorg import pam
), then use pam.<preset pam alias>
(such as pam.Cas12a
) to refer to the desired PAM pattern.
6.6. Filtering gRNA
MINORg supports 3 different gRNA filtering options, all of which can be used together.
6.6.1. Filter by GC content
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_206_gc")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050"]
>>> my_minorg.query_reference = True
>>> my_minorg.gc_min = 0.2
>>> my_minorg.gc_max = 0.8
>>> my_minorg.full()
>>> my_minorg.resolve()
In the above example, MINORg will exclude gRNA with less than 20% (my_minorg.gc_min = 0.2
) or greater than 80% (my_minorg.gc_min = 0.8
) GC content. By default, minimum GC content is 30% and maximum is 70%.
6.6.2. Filter by off-target
6.6.2.1. Using total mismatch/gap/unaligned
See: Total mismatch/gap/unaligned
Thresholds for total number of mismatches or gaps (and unaligned positions) required for an off-target gRNA hit to be considered non-problematic are controlled by ot_mismatch
and ot_gap
respectively. See Total mismatch/gap/unaligned for more.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_207_ot_ref")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050"]
>>> my_minorg.query_reference = True
>>> my_minorg.screen_reference = True
>>> my_minorg.add_background("/path/to/subset_ref_Araly2.fasta", alias = "araly")
>>> my_minorg.add_background("/path/to/subset_ref_Araha1.fasta", alias = "araha")
>>> my_minorg.add_background("/path/to/subset_9654.fasta", alias = "9654")
>>> my_minorg.add_background("/path/to/subset_9655.fasta", alias = "9655")
>>> my_minorg.ot_gap = 2
>>> my_minorg.ot_mismatch = 2
>>> my_minorg.full()
>>> my_minorg.resolve()
In the above example, MINORg will screen gRNA for off-targets in:
The reference genome (
my_minorg.screen_reference
)Four different FASTA files (
my_minorg.add_background("<FASTA>", alias = "<alias>")
)The alias keyword argument is optional. If not provided, MINORg will generate a unique alias.
Note that any AT5G45050 homologues in these four FASTA files will NOT be masked. This means that only gRNA that do not target any AT5G45050 homologues in these four genomes will pass this off-target check.
To mask homologues in these genomes, you will need to provide a FASTA file containing the sequences of their homologues using
my_minorg.mask = ["/path/to/to_mask_1.fasta", "/path/to/to_mask_2.fasta"]
. You may use subcommandseq()
(see Subcommands) to identify these homologues and retrieve their sequences.
ot_gap
and ot_mismatch
control the minimum number of gaps or mismatches off-target gRNA hits must have to be considered non-problematic; any gRNA with at least one problematic gRNA hit will be excluded. By default, both values are set to ‘1’. See Off-target assessment for more on the off-target assessment algorithm.
In the case above, my_minorg.screen_reference = True
is actually redundant as the genome(s) from which targets are obtained (which, because of my_minorg.query_reference
, is the reference genome) are automatically included for background check.
However, in the example below, when the targets are from non-reference genomes, the reference genome is not automatically included for off-target assessment and thus screen_reference
is NOT redundant. Additionally, do note that the genes specified using gene
are masked in the reference genome, such that any gRNA hits to them are NOT considered off-target and will NOT be excluded.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_208_ot_nonref")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050"]
>>> my_minorg.add_query("/path/to/subset_9654.fasta", alias = "9654")
>>> my_minorg.screen_reference = True
>>> my_minorg.add_background("/path/to/subset_ref_Araly2.fasta", alias = "araly")
>>> my_minorg.add_background("/path/to/subset_ref_Araha1.fasta", alias = "araha")
>>> my_minorg.add_background("/path/to/subset_9655.fasta", alias = "9655")
>>> my_minorg.ot_gap = 2
>>> my_minorg.ot_mismatch = 2
>>> my_minorg.full()
>>> my_minorg.resolve()
6.6.2.2. Using position-specific mismatch/gap/unaligned
See: Position-specific mismatch/gap/unaligned
Finer control of off-target definition can be achieved using ot_pattern
, which allows users to provide a pattern that specifies different thresholds for different positions along a gRNA. Unlike ot_mismatch
and ot_gap
, which specify the LOWER-bound of NON-problematic hits, ot_pattern
specifies UPPER-bound of PROBLEMATIC hits. By default, unaligned positions will be treated as mismatches, but this behaviour can be altered by setting ot_unaligned_as_mismatch
to False
. See Off-target pattern for how to build an off-target pattern, and Position-specific mismatch/gap/unaligned for more on how unaligned positions can be counted.
When ot_pattern
is specified, ot_mismatch
and ot_gap
will be ignored.
The following example is identical to the first in Using total mismatch/gap/unaligned, except ot_mismatch
and ot_gap
are replaced with ot_pattern
.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_209_ot_ref_pattern")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050"]
>>> my_minorg.query_reference = True
>>> my_minorg.screen_reference = True
>>> my_minorg.add_background("/path/to/subset_ref_Araly2.fasta", alias = "araly")
>>> my_minorg.add_background("/path/to/subset_ref_Araha1.fasta", alias = "araha")
>>> my_minorg.add_background("/path/to/subset_9654.fasta", alias = "9654")
>>> my_minorg.add_background("/path/to/subset_9655.fasta", alias = "9655")
>>> my_minorg.ot_pattern = "0mg-10,1mg-11-"
>>> my_minorg.full()
>>> my_minorg.resolve()
In the above example, my_minorg.ot_pattern = "0mg-10,1mg-11-"
means that MINORg will discard any gRNA with at least one off-target hit where:
There are no mismatches or gaps between positions -10 and -1, and there are no more than 1 mismatch or gap from position -11 to the 5’ end.
See Off-target pattern for how to build and interpret an off-target pattern.
6.6.2.3. PAM-less off-target check
By default, MINORg does NOT check for the presence of PAM sites next to potential off-target hits. You may override this behaviour by setting ot_pamless
to False
. This tells MINORg to mark off-target hits that fail the ot_gap
or ot_mismatch
thresholds (or match ot_pattern
) as problematic ONLY IF there is a PAM site nearby.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_210_ot_pamless")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050"]
>>> my_minorg.add_query("/path/to/subset_9654.fasta", alias = "9654")
>>> my_minorg.screen_reference = True
>>> my_minorg.add_background("/path/to/subset_ref_Araly2.fasta", alias = "araly")
>>> my_minorg.add_background("/path/to/subset_ref_Araha1.fasta", alias = "araha")
>>> my_minorg.add_background("/path/to/subset_9655.fasta", alias = "9655")
>>> my_minorg.ot_gap = 2
>>> my_minorg.ot_mismatch = 2
>>> my_minorg.ot_pamless = True
>>> my_minorg.full()
>>> my_minorg.valid_grna("background") ## gRNA that pass background filtering
gRNAHits(gRNA = 6)
>>> my_minorg.ot_pamless = False ## only remove gRNA from candidates if off-target hits have PAM site nearby
>>> my_minorg.full()
>>> my_minorg.valid_grna("background")
gRNAHits(gRNA = 12)
>>> my_minorg.resolve()
6.6.2.4. Skip off-target check
To skip off-target check entirely, use background_check = False
when calling full()
.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_211_skipbgcheck")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050"]
>>> my_minorg.query_reference = True
>>> my_minorg.full(background_check = False)
(several warning messages about the background check being unset will pop up, but you can ignore them)
>>> my_minorg.resolve()
6.6.3. Filter by feature
By default, when genes
is set, MINORg restricts gRNA to coding regions (CDS). For more on how MINORg does this for inferred, unannotated homologues, see Within-feature inference. You may change the feature type in which to design gRNA using the attribute feature
. See column 3 of your GFF3 file for valid feature types (see https://en.wikipedia.org/wiki/General_feature_format for more on GFF file format).
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_212_withinfeature")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050"]
>>> my_minorg.query_reference = True
>>> my_minorg.feature = "three_prime_UTR"
>>> my_minorg.full(background_check = False)
>>> my_minorg.resolve()
6.7. Generating minimum gRNA set(s)
6.7.1. Number of sets
By default, MINORg outputs a single gRNA set covering all targets. You may request more (mutually exclusive) sets using the set
attribute.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_213_set")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G46260", "AT5G46270", "AT5G46450", "AT5G46470", "AT5G46490", "AT5G46510", "AT5G46520"]
>>> my_minorg.query_reference = True
>>> my_minorg.sets = 5
>>> my_minorg.full()
>>> my_minorg.resolve()
6.7.2. Prioritise non-redundancy
By default, MINORg selects gRNA for sets using these criteria in decreasing order of priority:
Coverage (of as yet uncovered targets)
Proximity to 5’ end
Non-redundancy
Proximity is only assessed when there is a tie for coverage, and non-redundancy when there is a tie for both coverage and proximity. You may instead prioritise non-redundancy over proximity by setting prioritise_nr
to True
. MINORg will use a combination of approximate and optimal weighted set cover algorithms to output small sets with low redundancy. However, do note that the sets will in general be larger than when prioritise_nr
is False
.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_214_nr")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G46260", "AT5G46270", "AT5G46450", "AT5G46470", "AT5G46490", "AT5G46510", "AT5G46520"]
>>> my_minorg.query_reference = True
>>> my_minorg.prioritise_nr = True
>>> my_minorg.full()
>>> my_minorg.resolve()
6.7.3. Excluding gRNA
You may specify gRNA sequences to exclude from any final gRNA set by providing the path to a FASTA file containing sequences to exclude to exclude
.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_215_exclude")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G46260", "AT5G46270", "AT5G46450", "AT5G46470", "AT5G46490", "AT5G46510", "AT5G46520"]
>>> my_minorg.query_reference = True
>>> my_minorg.exclude = "/path/to/sample_exclude_RPS6.fasta"
>>> my_minorg.full()
>>> my_minorg.resolve()
The gRNA names in the file passed to exclude
do not matter. Only the sequences are used when determining whether to exclude a gRNA.
6.7.4. Accepting unknown checks
Sometimes, not all filtering checks (GC, background, and feature) are set for all sequences. This is not an issue if you use the full programme (i.e. full()
), but may be relevant if you are re-generating sets using the ‘minimumset’ subcommand (i.e. minimumset()
) with a modified mapping file OR a mapping file from the ‘filter’ subcommand where not all filters have been applied.
Let us take a look at ‘sample_custom_check.map’, where we’ve added a custom check called ‘my_custom_check’ in the last column:
gRNA id gRNA sequence target id target sense gRNA strand start end group background GC feature my_custom_check
gRNA_001 CTTCATCTTCTTCTCGAAAT targetA NA + 8 27 1 pass pass NA pass
gRNA_001 CTTCATCTTCTTCTCGAAAT targetB NA + 80 99 1 pass pass NA pass
gRNA_002 GATGTTTTCTTGAGCTTCAG targetA NA + 37 56 1 pass pass NA NA
gRNA_002 GATGTTTTCTTGAGCTTCAG targetB NA + 286 305 1 pass pass NA pass
gRNA_002 GATGTTTTCTTGAGCTTCAG targetC NA + 109 128 1 pass pass NA fail
gRNA_002 GATGTTTTCTTGAGCTTCAG targetD NA + 110 129 1 pass pass NA fail
gRNA_003 ATGTTTTCTTGAGCTTCAGA targetB NA + 38 57 1 pass pass NA NA
gRNA_003 ATGTTTTCTTGAGCTTCAGA targetC NA + 287 306 1 pass pass NA pass
gRNA_003 ATGTTTTCTTGAGCTTCAGA targetD NA + 110 129 1 pass pass NA pass
There are three possible values for check status: ‘pass’, ‘fail’, and ‘NA’.
An invalid/unset check is an ‘NA’. If a check is unset for all entries (as is the case with the check ‘feature’ here), it will be ignored (i.e. the check is treated as ‘pass’ for all entries). However, when a check has been set for some entries but not others (as is the case with the ‘my_custom_check’ check here), MINORg will treat invalid/unset checks as ‘fail’ by default. This is because there isn’t enough information on whether this constitutes a pass or fail for the check, and MINORg prefers to be conservative when outputting gRNA. You may override this behaviour by setting accept_invalid
to True
. By doing so, MINORg will treat ‘NA’ as ‘pass’ for all checks.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_216_acceptinvalid")
>>> my_minorg.parse_grna_map_from_file("/path/to/sample_custom_check.map")
>>> my_minorg.accept_invalid = True
>>> my_minorg.minimumset()
6.7.5. Manually approve gRNA sets
You may opt to manually inspect each gRNA set before MINORg write them to file by using manual = True
when executing full()
or the minimum set subcommand minimumset()
.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_217_manual")
>>> my_minorg.target = "/path/to/sample_CDS.fasta"
>>> my_minorg.full(manual = True)
ID sequence (Set 1)
gRNA_001 GGAATACAAGAGATTATCGA
Hit 'x' to continue if you are satisfied with these sequences. Otherwise, enter the sequence ID or
sequence of an undesirable gRNA (case-sensitive) and hit the return key to update this list: x
Final gRNA sequence(s) have been written to minorg_gRNA_final.fasta
Final gRNA sequence ID(s), gRNA sequence(s), and target(s) have been written to minorg_gRNA_final.map
1 mutually exclusive gRNA set(s) requested. 1 set(s) found.
Output files have been generated in /path/to/example_217_manual
6.8. Subcommands
MINORg comprises of four main steps:
Target sequence identification
Candidate gRNA generation
gRNA filtering
Minimum gRNA set generation
As users may only wish to execute a subset of these steps instead of the full programme (full()
), MINORg also provides four subcommands (methods) corresponding to these four steps:
filter()
, which itself calls three other methods
The subcommands may be useful if you already have a preferred off-target/on-target assessment software. In this case, you may execute subcommands seq()
and grna()
, submit the gRNA output by MINORg for off-target/on-target assessment, update the .map file output by MINORg with the status of each gRNA for that off-target/on-target assessment, and execute minimumset()
to obtain a desired number of minimum gRNA sets. Note that if you do this, you should re-read the updated .map file into MINORg using parse_grna_map_from_file()
so MINORg can replace the gRNA data stored in memory with your updated gRNA data.
Each subcommand may require a different combination of attributes.
6.8.1. Subcommand seq()
The seq()
subcommand identifies target sequences, whether by extracting them from a reference genome or inferring homologues in unannotated genomes. All parameters introduced in Defining target sequences (except attribute target
) and Defining reference genomes apply. If you already have a FASTA file containing your target sequences, you may set target
to the path of that FASTA file and skip this subcommand.
This step will output target sequences into a file ending with ‘_targets.fasta’. This filename will be stored at attribute target
.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_218_subcmdseq")
>>> my_minorg.extend_reference("/path/to/sample_gene.fasta", "/path/to/sample_CDS.fasta")
>>> my_minorg.genes = ["AT1G10920"]
>>> my_minorg.add_query("/path/to/subset_9654.fasta", alias = "9654")
>>> my_minorg.add_query("/path/to/subset_9655.fasta", alias = "9655")
>>> my_minorg.seq()
>>> my_minorg.target
'/path/to/example_218_subcmdseq/minorg/minorg_gene_targets.fasta'
6.8.2. Subcommand grna()
The grna()
subcommand generates gRNA within target sequences from a target file. Unlike the command line version, it DOES NOT incorporate parts of the seq()
and filter()
subcommands. All parameters introduced in Defining gRNA apply.
By default, .map and FASTA files of gRNA sequences will be written to files. You may override this behaviour by setting auto_update_files
to False
or using auto_update_files = False
when instantiating a MINORg
object (e.g. my_minorg(directory = "/path/to/output/dir", auto_update_files = False)
). In this case, only the FASTA file will be written. To manually write files, you should use the following methods. If you do not supply an output file path, it will be automatically generated:
write_all_grna_map()
: write .map file containing all candidate gRNA (no checks will be set bygrna()
so all entries in check fields will be ‘NA’)Path to output file will be stored at
grna_map
If output file is not specified, the output file will be <output_directory>/<prefix>/<prefix>_gRNA_all.map
write_all_grna_fasta()
: write FASTA file containing all candidate gRNAPath to output file will be stored at
grna_fasta
If output file is not specified, the output file will be <output_directory>/<prefix>/<prefix>_gRNA_all.fasta
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_219_subcmdgrna")
>>> my_minorg.target = "/path/to/sample_CDS.fasta"
>>> my_minorg.grna() ## default 3' NGG PAM
PAM pattern: .{20}(?=[GATC]GG)
>>> my_minorg.grna_hits
gRNAHits(gRNA = 201)
>>> from minorg import pam
>>> my_minorg.pam = pam.Cas12a ## 5' TTTV PAM
>>> my_minorg.grna() ## regenerate gRNA
PAM pattern: (?<=TTT[ACG]).{20}
>>> my_minorg.grna_hits
gRNAHits(gRNA = 95)
>>> my_minorg.pam = "ATV."
>>> my_minorg.grna() ## regenerate gRNA
PAM pattern: (?<=AT[ACG]).{20}
>>> my_minorg.grna_hits
gRNAHits(gRNA = 267)
>>> my_minorg.write_all_grna_fasta()
>>> my_minorg.grna_fasta
'/path/to/example_218_subcmdgrna/minorg/minorg_gRNA_all.fasta'
>>> my_minorg.write_all_grna_fasta("/path/to/another/location.fasta")
>>> my_minorg.grna_fasta
'/path/to/another/location.fasta'
gRNA data is stored at the attribute grna_hits
, and it prints the number of gRNA as a string representation. In the above example, 201 different gRNA are generated from the target sequences in the target file “sample_CDS.fasta”. We then decided we want to generate gRNA for Cas12a instead, which has a 5’ TTTV PAM pattern. This yields us 95 different gRNA. Finally we decided to try a completely made up 5’ ATV PAM pattern, netting us 267 different gRNA in the end. Satisfied, we wrote the sequences of these gRNA to file, and printed the path of the file.
6.8.3. Subcommand filter()
The filter()
subcommand takes in a compulsory MINORg .map file (which can be read using parse_grna_map_from_file()
) and rewrites some/all checks. You can execute all filters (GC, off-target, and feature) using filter()
, or execute checks separately using filter_gc()
, filter_background()
, and filter_feature()
.
By default, gRNA sequences and map files will be updated automatically whenever any of the filtering methods is called. You may override this behaviour by setting auto_update_files
to False
or using auto_update_files = False
when instantiating a MINORg
object (e.g. my_minorg(directory = "/path/to/output/dir", auto_update_files = False)
). To manually write files, you should use the following methods. If you do not supply an output file path, it will be automatically generated:
write_all_grna_map()
: write .map file containing all candidate gRNA and checksPath to output file will be stored at
grna_map
If output file is not specified, the output file will be <output_directory>/<prefix>/<prefix>_gRNA_all.map
write_all_grna_fasta()
: write FASTA file containing all candidate gRNAPath to output file will be stored at
grna_fasta
If output file is not specified, the output file will be <output_directory>/<prefix>/<prefix>_gRNA_all.fasta
This file will NOT be auto updated as it is not affected by filtering check status
write_pass_grna_map()
: write .map file containing all passing gRNAPath to output file will be stored at
pass_map
If output file is not specified, the output file will be <output_directory>/<prefix>/<prefix>_gRNA_pass.map
write_pass_grna_fasta()
: write FASTA file containing all passing gRNAPath to output file will be stored at
pass_fasta
If output file is not specified, the output file will be <output_directory>/<prefix>/<prefix>_gRNA_pass.fasta
In all cases, you may rename the gRNA using rename_grna()
, which takes in the path of a FASTA file that contains the gRNA sequences you wish to rename with sequence IDs of the names you wish to rename them to. This method should be used before you call any of the above methods to write gRNA to file.
6.8.3.1. Subcommand filter_gc()
All parameters introduced in Filter by GC content apply.
6.8.3.1.1. Filtering by GC content after calling full()
filter_gc()
can be used on an active MINORg object even if you’ve already called full()
.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_220_subcmdfilter_gc")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050", "AT5G45060", "AT5G45200", "AT5G45210", "AT5G45220", "AT5G45230", "AT5G45240", "AT5G45250"]
>>> my_minorg.query_reference = True
>>> my_minorg.full()
>>> my_minorg.grna_hits
gRNAHits(gRNA = 2141)
>>> my_minorg.valid_grna("GC") ## gRNA that pass GC filter
gRNAHits(gRNA = 1871)
>>> my_minorg.gc_min = 0.2
>>> my_minorg.gc_max = 0.8
>>> my_minorg.filter_gc() ## re-filter by GC content
>>> my_minorg.valid_grna("GC") ## gRNA that pass GC filter
gRNAHits(gRNA = 2097)
>>> my_minorg.minimumset()
>>> my_minorg.resolve()
6.8.3.1.2. Filtering GC content on output of another MINORg run
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_220_subcmdfilter_gc_pt2", auto_update_files = False)
>>> my_minorg.parse_grna_map_from_file("/path/to/sample_custom_check.map")
>>> my_minorg.valid_grna("GC")
gRNAHits(gRNA = 3)
>>> my_minorg.gc_min = 0.4
>>> my_minorg.gc_max = 0.6
>>> my_minorg.filter_gc()
>>> my_minorg.valid_grna("GC")
gRNAHits(gRNA = 1)
>>> my_minorg.write_pass_grna_fasta()
>>> my_minorg.resolve()
6.8.3.2. Subcommand filter_background()
All parameters introduced in Filter by off-target apply. Additionally, you should supply target sequences to target
so that MINORg can mask them (this tells MINORg that any gRNA hits to them is in fact on-target and NOT off-target). Any additional sequences to be masked may be provided to mask
as a list of paths to FASTA files. If you have set screen_reference
to True
to include reference genome(s) (see Multiple reference genomes for how to specify multiple reference genomes) in the off-target screen, you may specify a FASTA file of sequences of genes to be masked to mask
as well. You can generate these sequences using the seq()
subcommand, but MAKE SURE TO USE A DIFFERENT MINORg OBJECT AND DIRECTORY TO AVOID OVERWRITING ANY PREVIOUSLY GENERATED FILES.
6.8.3.2.1. Filtering background after calling full()
Let us first execute MINORg.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_221_subcmdfilter_bg")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G46450", "AT5G46470", "AT5G46490", "AT5G46510", "AT5G46520"]
>>> my_minorg.add_query("/path/to/subset_9654.fasta", alias = "9654")
>>> my_minorg.add_query("/path/to/subset_9655.fasta", alias = "9655")
>>> my_minorg.sets = 5
>>> my_minorg.full(background_check = False)
In the code above, we skipped off-target check using background_check = False
when executing full()
. But we’ve changed out mind and would like to screen the reference genome and the non-reference genomes that these targets are from AND we don’t want our gRNA to be able to target any genes in ‘subset_9944.fasta’ and ‘subset_9947’. We also want to tell MINORg that it’s okay if a gRNA has off-target effects in homologous genes AT5G46260 and AT5G46270 in the reference genome. We can do that using the filter()
subcommand, followed by the minimumset()
subcommand to regenerate minimum sets.
In order to do all this, we will have to get the gene sequences of AT5G46260 and AT5G46270 in order to mask them in the reference genome. We can do this using the get_reference_seq()
method.
>>> ot_minorg = MINORg(directory = "/path/to/example_221_subcmdfilter_bg_tomask") ## different directory
>>> ot_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> ot_minorg.genes = ["AT5G46260", "AT5G46270"]
>>> fout_to_mask = ot_minorg.mkfname("ref_to_mask.fasta") ## MINORg has a built-in method to generate file names within the output directory
>>> ot_minorg.get_reference_seq(fout = fout_to_mask) ## this method will return a dictionary of sequences, but will also write to file if 'fout' is used
>>> ot_minorg.resolve()
Now that we have the reference sequences to mask, we can pass the file name to my_minorg
‘s mask
attribute, add our background files using add_background()
, set screen_reference
to True
, call filter_background()
to update off-target checks for all candidate gRNA, and execute minimumset()
to regenerate our minimum gRNA sets. You may also wish to call write_all_grna_map()
, write_pass_grna_map()
, and/or write_pass_grna_fasta()
to update the gRNA FASTA and .map files if auto_update_files
has been set to False
.
>>> my_minorg.mask.append(fout_to_mask)
>>> my_minorg.add_background("/path/to/subset_9944.fasta", alias = "9944")
>>> my_minorg.add_background("/path/to/subset_9947.fasta", alias = "9947")
>>> my_minorg.screen_reference = True
>>> my_minorg.filter_background()
>>> my_minorg.minimumset()
>>> my_minorg.resolve()
6.8.3.2.2. Filtering background on output of another MINORg run
Alternatively, if the orginal my_minorg
object no longer exists, whether because you’ve closed the IDE session or deleted the object, you can read its .map file into a new MINORg
object using parse_grna_map_from_file()
like below. In this case, you can pass the IDs of the additional genes to be masked together with the original genes to genes
and don’t need to use get_reference_seq()
. Since we’re no longer querying ‘subset_9654.fasta’ and ‘subset_9655.fasta’, we can use add_background()
to tell MINORg to search for off-target effects in them. And don’t forget to also provide the FASTA file of target sequences to target
so MINORg can mask them!:
>>> from minorg.MINORg import MINORg
>>> new_minorg = MINORg(directory = "/path/to/example_221_subcmdfilter_bg_new")
>>> new_minorg.parse_grna_map_from_file("/path/to/example_221_subcmdfilter_bg/minorg/minorg_gRNA_all.map")
>>> new_minorg.target = "/path/to/example_221_subcmdfilter_bg/minorg/minorg_gene_targets.fasta"
>>> new_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> new_minorg.genes = ["AT5G46260", "AT5G46270", "AT5G46450", "AT5G46470", "AT5G46490", "AT5G46510", "AT5G46520"]
>>> new_minorg.add_background("/path/to/subset_9654.fasta", alias = "9654")
>>> new_minorg.add_background("/path/to/subset_9655.fasta", alias = "9655")
>>> new_minorg.add_background("/path/to/subset_9944.fasta", alias = "9944")
>>> new_minorg.add_background("/path/to/subset_9947.fasta", alias = "9947")
>>> new_minorg.screen_reference = True
>>> new_minorg.filter_background()
>>> new_minorg.minimumset()
>>> new_minorg.resolve()
6.8.3.3. Subcommand filter_feature()
All parameters introduced in Filter by feature apply. Additionally, you will need to provide a FASTA file of target sequences (attribute target
), reference genome(s) (see Defining reference genomes), and genes (attribute genes
). The specified reference gene(s) will be extracted from the reference genome(s) and aligned with target sequence(s) in order for MINORg to infer feature boundaries in target sequence(s). See Within-feature inference for the algorithm of how feature boundaries are inferred.
6.8.3.3.1. Filtering feature after calling full()
Let us first execute MINORg.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_222_subcmdfilter_feature")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.genes = ["AT5G45050"]
>>> my_minorg.add_query("/path/to/subset_9654.fasta", alias = "9654")
>>> my_minorg.add_query("/path/to/subset_9655.fasta", alias = "9655")
>>> my_minorg.full()
>>> my_minorg.valid_grna("feature") ## gRNA that pass within-feature filter
gRNAHits(gRNA = 368)
By default, MINORg sets the desired feature to ‘CDS’. You can re-assess and overwrite the ‘feature’ check in the .map file to only allow gRNA in other GFF features, such as the 3’ UTR, by updating feature
and using filter_feature()
to re-filter gRNA for the new feature.
>>> my_minorg.feature = "three_prime_UTR"
>>> my_minorg.filter_feature()
>>> my_minorg.valid_grna("feature") ## gRNA that pass within-feature filter
gRNAHits(gRNA = 5)
>>> my_minorg.minimumset()
>>> my_minorg.resolve()
6.8.3.3.2. Filtering feature on output of another MINORg run
As with Filtering background on output of another MINORg run, we can read in the output of a previous MINORg execution and filter that. This requires the .map file ending with ‘_all.map’ (parse using parse_grna_map_from_file()
) as well as a FASTA file of target sequences (specify using target
).
>>> from minorg.MINORg import MINORg
>>> new_minorg = MINORg(directory = "/path/to/example_222_subcmdfilter_feature_new")
>>> new_minorg.parse_grna_map_from_file("/path/to/example_222_subcmdfilter_feature/minorg/minorg_gRNA_all.map")
>>> new_minorg.target = "/path/to/example_222_subcmdfilter_feature/minorg/minorg_gene_targets.fasta"
>>> new_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> new_minorg.genes = ["AT5G45050"] ## MINORg needs to know which reference genes to align to targets to in order to infer feature ranges
>>> new_minorg.feature = "three_prime_UTR"
>>> new_minorg.filter_feature()
>>> new_minorg.minimumset()
>>> new_minorg.resolve()
6.8.4. Subcommand minimumset()
The minimumset()
subcommand generates mutually exclusive minimum set(s) of gRNA, where each set is capable of covering all targets. It requires a MINORg .map file (the one that ends in ‘_gRNA_pass.map’ is sufficient, but ‘_gRNA_all.map’ would allow for filtering by a custom combination of fields). All parameters introduced in Generating minimum gRNA set(s) apply.
This step will write final gRNA sequences into a file ending with ‘_gRNA_final.fasta’. A file ending with ‘_gRNA_final.map’ that maps gRNA to their targets will also be generated. You may optionally specify the location of the FASTA and .map output files using:
final_map
: path of .map file containing gRNA in final set(s)If output file is not specified, the output file will be <output_directory>/<prefix>/<prefix>_gRNA_final.map
final_fasta
: path of FASTA file containing gRNA in final set(s)If output file is not specified, the output file will be <output_directory>/<prefix>/<prefix>_gRNA_final.fasta
6.8.4.1. Regenerating minimum sets after calling full()
minimumset()
can also be used on an active MINORg object.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_223_subcmdminimumset_pt1")
...
<set up parameters>
...
>>> my_minorg.full()
>>> my_minorg.sets = 5
>>> my_minorg.minimumset() ## regenerate up to 5 gRNA sets
6.8.4.2. Generating minimum sets from output of another MINORg run
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_223_subcmdminimumset_pt2")
>>> my_minorg.parse_grna_map_from_file("/path/to/example_203_query/minorg/minorg_gRNA_all.map")
>>> my_minorg.target = "/path/to/example_203_query/minorg/minorg_gene_targets.fasta"
>>> my_minorg.prioritise_nr = True
>>> my_minorg.sets = 5
>>> my_minorg.minimumset(gc_check = False)
>>> my_minorg.resolve()
In order for MINORg to better assess a gRNA’s proximity to the 5’ end (of hopefully sense strand) of a target in the event a tie-breaker is necessary, it is strongly suggested that target sequences be provided to target
so MINORg knows how long a target sequence is. This is especially so if the target sequences are antisense ones (you can check this using the .map file) generated by MINORg’s inferences of homologues in unannotated genomes. In the example above, we’ve asked MINORg to ignore the GC content check when generating minimum sets (my_minorg.minimumset(gc_check = False)
).
6.8.5. Chaining subcommands
You may use subcommands separately if you’d like to inspect the outcome of each step and/or repeat a step with different parameters before proceeding with the next. MINORg tracks the output of previous steps, so you do not need to read them into MINORg before executing the next step.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_224_subcmd", prefix = "test", thread = 1)
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10", replace = True)
>>> my_minorg.add_reference("/path/to/subset_ref_Araly2.fasta", "/path/to/subset_ref_Araly2.gff", alias = "araly2")
>>> my_minorg.genes = ["AT1G33560", "AL1G47950.v2.1"]
>>> my_minorg.query_reference = True
>>> my_minorg.seq() ## generate target sequences
>>> my_minorg.target ## print path to FASTA file containing target sequences
'/path/to/example_223_subcmd/minorg/minorg_gene_targets.fasta'
>>> my_minorg.grna()
PAM pattern: .{20}(?=[GATC]GG)
>>> my_minorg.screen_reference = True
>>> my_minorg.filter_background()
Masking on-targets
Finding off-targets
>>> my_minorg.valid_grna("background")
gRNAHits(gRNA = 395)
>>> my_minorg.add_background("/path/to/subset_ref_Araha1.fasta", alias = "araha1") ## add background file
>>> my_minorg.filter_background() ## repeat background check with additional background file
Masking on-targets
Finding off-targets
>>> my_minorg.valid_grna("background") ## updated set of passing gRNA
gRNAHits(gRNA = 250)
>>> my_minorg.filter_gc()
>>> my_minorg.valid_grna("GC")
gRNAHits(gRNA = 355)
>>> my_minorg.valid_grna("background", "GC")
gRNAHits(gRNA = 223)
>>> my_minorg.valid_grna() ## gRNA filtered for all valid checks (at this point, background and GC)
/path/to/minorg/grna.py:823: MINORgWarning: The following hit checks have not been set: feature
gRNAHits(gRNA = 223)
>>> my_minorg.filter_feature() ## by default, MINORg only retains gRNA in CDS
>>> my_minorg.valid_grna("feature")
gRNAHits(gRNA = 324)
>>> my_minorg.valid_grna()
gRNAHits(gRNA = 181)
>>> my_minorg.minimumset(manual = True)
ID sequence (Set 1)
gRNA_026 GTCGTTTCCGGAGACTATGA
Hit 'x' to continue if you are satisfied with these sequences. Otherwise, enter the sequence ID or
sequence of an undesirable gRNA (case-sensitive) and hit the return key to update this list: gRNA_026
ID sequence (Set 1)
gRNA_223 TCAATCTCCATCATAGTCTC
Hit 'x' to continue if you are satisfied with these sequences. Otherwise, enter the sequence ID or
sequence of an undesirable gRNA (case-sensitive) and hit the return key to update this list: x
Final gRNA sequence(s) have been written to /path/to/example_223_subcmd/minorg/minorg_gRNA_final.fasta
Final gRNA sequence ID(s), gRNA sequence(s), and target(s) have been written to
/path/to/example_223_subcmd/minorg/minorg_gRNA_final.map
1 mutually exclusive gRNA set(s) requested. 1 set(s) found.
>>> my_minorg.write_all_grna_map() ## write .map file containing check information for all candidate gRNA
>>> my_minorg.write_all_grna_fasta() ## write FASTA file containing all candidate gRNA
>>> my_minorg.write_pass_grna_map() ## write .map file containing information for valid gRNA
>>> my_minorg.write_pass_grna_fasta() ## write FASTA file containing valid gRNA
>>> my_minorg.resolve() ## remove temporary files
It is highly recommended that you execute resolve()
to remove any temporary files generated.
6.9. Defining reference genomes
6.9.1. Single reference genome
See example in Reference gene(s) as targets.
6.9.2. Multiple reference genomes
See also: Reference
You may specify genes from multiple reference genomes so long as those reference genomes have also been added using add_reference()
.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_225_multiref")
>>> my_minorg.add_reference("/path/to/subset_ref_TAIR10.fasta", "/path/to/subset_ref_TAIR10.gff", alias = "TAIR10")
>>> my_minorg.add_reference("/path/to/subset_ref_Araly2.fasta", "/path/to/subset_ref_Araly2.gff", alias = "Araly2")
>>> my_minorg.add_reference("/path/to/subset_ref_Araha1.fasta", "/path/to/subset_ref_Araha1.gff", alias = "Araha1")
>>> my_minorg.genes = ["AT1G33560", "AL1G47950.v2.1", "Araha.3012s0003.v1.1"]
>>> my_minorg.query_reference = True
>>> my_minorg.full()
>>> my_minorg.resolve()
In the example above, MINORg will design gRNA for 3 highly conserved paralogues in 3 different species. Note that you should be careful that any gene IDs you use should either be unique across all reference genomes OR be shared only among your target genes. Otherwise, MINORg will treat any undesired genes with the same gene IDs as targets as well.
6.9.3. Non-standard reference
6.9.3.1. Non-standard genetic code
When using pssm_ids
, users should ensure that the correct genetic code has been specified for reference genomes using the genetic_code
keyword argument when adding reference genomes using add_reference()
, as MINORg has to first translate CDS into peptides for domain search using RPS-BLAST. The default genetic code is the Standard Code. Please refer to https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for genetic code numbers and names.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_226_geneticcode")
>>> my_minorg.add_reference("/path/to/subset_ref_yeast_mt.fasta", "/path/to/subset_ref_yeast_mt.gff", alias = "yeast_mt", genetic_code = 3) ## specify genetic code here
>>> my_minorg.genes = ["gene-Q0275"]
>>> my_minorg.query_reference = True
>>> my_minorg.rpsblast = "/path/to/rpsblast/executable"
>>> my_minorg.db = "/path/to/rpsblast/db"
>>> my_minorg.pssm_ids = ["366140"]
>>> my_minorg.full()
>>> my_minorg.resolve()
In the above example, the gene ‘gene-Q0275’ is a yeast mitochondrial gene, and my_minorg.pssm_ids = ["366140"]
specifies the PSSM-Id for the COX3 domain in the Cdd v3.18 RPS-BLAST database. The genetic code number for yeast mitochondrial code is ‘3’.
As a failsafe, MINORg does not terminate translated peptide sequences at the first stop codon. This ensures that any codons after an incorrectly translated premature stop codon will still be translated. Typically, a handful of mistranslated codons can still result in the correct RPS-BLAST domain hits, although hit scores may be slightly lower. Nevertheless, to ensure maximum accuracy, the correct genetic code is preferred.
6.9.3.2. Non-standard GFF3 attribute field names
See also: Attribute modification
MINORg requires standard attribute field names in GFF3 files in order to properly map subfeatures to their parent features (e.g. map CDS to mRNA, and mRNA to gene). Non-standard field names should be mapped to standard ones using the attr_mod
(for ‘attribute modification’) keyword argument when adding reference genomes using add_reference()
.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_227_attrmod")
>>> my_minorg.add_reference("/path/to/subset_ref_irgsp.fasta", "/path/to/subset_ref_irgsp.gff", alias = "irgsp", attr_mod = {"mRNA": {"Parent": "Locus_id"}}) ## specify attribute modifications
>>> my_minorg.genes = ["Os01g0100100"]
>>> my_minorg.query_reference = True
>>> my_minorg.full()
>>> my_minorg.resolve()
The IRGSP 1.0 reference genome for rice (Oryza sativa subsp. Nipponbare) uses a non-standard attribute field name for mRNA entries in their GFF3 file. Instead of ‘Parent’, which is the standard name of the field used to map a feature to its parent feature, mRNA entries in the IRGSP 1.0 annotation use ‘Locus_id’. See Attribute modification for more details on how to format the input to attr_mod
.
6.10. Multithreading
MINORg supports multi-threading in order to process files in parallel. Any excess threads may also be used for BLAST. This is most useful when you are querying multiple genomes, have multiple reference genomes, or multiple background sequences.
NOTE for Docker users: Multithreading for parallel querying of multiple genomes and backgrounds is DISABLED for Docker distributions due to incompatibilities.
To run MINORg with parallel processing, set thread
to the desired number of threads.
>>> from minorg.MINORg import MINORg
>>> my_minorg = MINORg(directory = "/path/to/example_228_thread")
>>> my_minorg.extend_reference("/path/to/sample_gene.fasta", "/path/to/sample_CDS.fasta")
>>> my_minorg.genes = ["AT1G10920"]
>>> my_minorg.add_query("/path/to/subset_9654.fasta", alias = "9654")
>>> my_minorg.add_query("/path/to/subset_9655.fasta", alias = "9655")
>>> my_minorg.thread = 2
>>> my_minorg.full()
>>> my_minorg.resolve()
6.11. Differences between CLI and Python versions
Note that, unlike the command line, the Python package does not support aliases even if the config file has been set up appropriately for command line executions. Therefore, there are no true equivalents to --cluster
, --indv
, or --reference
.
6.11.1. To specify cluster genes
Analogous to --cluster
and --gene
.
Correct:
>>> my_minorg.genes = ['AT5G46260','AT5G46270','AT5G46450','AT5G46470','AT5G46490','AT5G46510','AT5G46520']
Incorrect:
>>> my_minorg.cluster_set = '/path/to/subset_cluster_mapping.txt'
>>> my_minorg.cluster = 'RPS6'
Attributes ‘cluster_set’ and ‘cluster’ do not exist. This does not throw error now but will cause problems later.
6.11.2. To specify query FASTA files
Analogous to --indv
and --query
.
Correct:
>>> my_minorg.add_query('/path/to/subset_9654.fasta', alias = '9654')
>>> my_minorg.add_query('/path/to/subset_9655.fasta', alias = '9655')
Incorrect:
>>> my_minorg.genome_set = '/path/to/subset_genome_mapping.txt'
>>> my_minorg.indv = '9654,9655'
Attributes ‘genome_set’ and ‘indv’ do not exist. This does not throw error now but will cause problems later.
6.11.3. To specify reference genomes
Analogous to --reference
, --assembly
, --annotation
, --attr-mod
, and --genetic-code
.
Correct:
>>> my_minorg.add_reference('/path/to/TAIR10.fasta', '/path/to/TARI10.gff3', alias = 'TAIR10', genetic_code = 1, atr_mod = {})
Note that attr_mod
and genetic_code
are optional if the annotation uses standard attribute field names and the standard genetic code, which the example above does.
Incorrect:
>>> my_minorg.reference_set = '/path/to/arabidopsis_genomes.txt'
>>> my_minorg.reference = 'TAIR10'
AttributeError: can't set attribute
Attributes ‘reference_set’ does not exist, and ‘reference’ is a property that users are not allowed to directly modify.