8. Output
8.1. Directory structure
As a general rule of thumb, using the default prefix ‘minorg’, MINORg generates the following files with the following directory structure:
<output_directory>/
|-- minorg.log
+-- minorg/
|-- minorg_<domain>_mafft.fasta
|-- minorg_<domain>_targets.fasta
|-- minorg_gRNA_all.fasta
|-- minorg_gRNA_all.map
|-- minorg_gRNA_final.fasta
|-- minorg_gRNA_final.map
|-- minorg_gRNA_pass.fasta
|-- minorg_gRNA_pass.map
|-- minorg_gRNA_masked.txt
+-- ref/
|-- minorg_ref_<domain>_CDS.fasta
|-- minorg_ref_<domain>_gene.fasta
+-- minorg_ref_<domain>_pep.fasta
The exact combination of files generated depends on whether the full programme or a subcommand (and which subcommand) has been executed, as well as the combination of parameters used.
If domain is not specified (using --domain
(CLI) OR pssm_ids
and/or domain_name
(Python)), it defaults to ‘gene’. If the user has specified a custom prefix (using --prefix
(CLI) OR prefix
(Python)), the prefix replaces ‘minorg’ for all file names.
When --cluster
(CLI exclusive) is used, the general structure is maintained, except ‘minorg_<cluster alias>’ replaces ‘minorg’ for all files except ‘minorg.log’, and each cluster gets its own directory:
<output_directory>/
|-- minorg.log
+-- minorg_clusterA/
| |-- minorg_clusterA_<domain>_mafft.fasta
| |-- minorg_clusterA_<domain>_targets.fasta
| |-- minorg_clusterA_gRNA_all.fasta
| |-- minorg_clusterA_gRNA_all.map
| |-- minorg_clusterA_gRNA_final.fasta
| |-- minorg_clusterA_gRNA_final.map
| |-- minorg_clusterA_gRNA_pass.fasta
| |-- minorg_clusterA_gRNA_pass.map
| |-- minorg_clusterA_gRNA_masked.txt
| +-- ref/
| |-- minorg_clusterA_ref_<domain>_CDS.fasta
| |-- minorg_clusterA_ref_<domain>_gene.fasta
| +-- minorg_clusterA_ref_<domain>_pep.fasta
+-- minorg_clusterB/
|-- minorg_clusterB_<domain>_mafft.fasta
|-- minorg_clusterB_<domain>_targets.fasta
|-- minorg_clusterB_gRNA_all.fasta
|-- minorg_clusterB_gRNA_all.map
|-- minorg_clusterB_gRNA_final.fasta
|-- minorg_clusterB_gRNA_final.map
|-- minorg_clusterB_gRNA_pass.fasta
|-- minorg_clusterB_gRNA_pass.map
|-- minorg_clusterB_gRNA_masked.txt
+-- ref/
|-- minorg_clusterB_ref_<domain>_CDS.fasta
|-- minorg_clusterB_ref_<domain>_gene.fasta
+-- minorg_clusterB_ref_<domain>_pep.fasta
8.2. Output files
8.2.1. minorg.log
This file is currently only generated when the CLI version of MINORg is used.
The logfile follows the following format:
((raw args))
minorg --directory ./example_07_domain --indv ref --gene AT5G45050 --assembly ./subset_ref_TAIR10.fasta --annotation ./subset_ref_TAIR10.gff --domain 214815
((expanded args))
reference_set: /path/to/arabidopsis_genomes.txt
genome_set: /path/to/subset_genome_mapping.txt
cluster_set: /path/to/cluster_mapping.txt
output_ver: 4 (default)
prioritise_nr: False (default)
auto: True (default)
sets: 1 (default)
accept_invalid: False (default)
exclude: None (default)
...
<more arguments>
where ‘raw arguments’ log the exact command and parameters used and ‘expanded arguments’ shows all arguments, including those that the user did not explicitly specify in their command.
Warning and/or error messages may be logged after the arguments.
8.2.2. XXX_mafft.fasta
The file ending in ‘_mafft.fasta’ contains a sequence alignment of targets to reference genes generated using MAFFT.
8.2.3. XXX_targets.fasta
The file ending in ‘_targets.fasta’ contains target sequence(s).
The names of target sequences follow the following format for reference genes:
Reference|<reference alias>|<domain>|<n>|<feature type>|<stitched/complete>|<gene ID>|<range(s)>
Reference alias: Unique alias given to each reference genome
Domain: PSSM ID or domain name (‘gene’ if not specified’)
n: If multiple domains are present, they will be numbered according to proximity to 5’ of sense strand
Feature type: GFF3 feature type
Stitched/complete: Whether sequences were concatenated
Stitched: Concatenated sequence, generated by stitching together regions of the requested GFF3 feature type. For example, individual CDS regions concatenated together into a single translatable sequence is considered ‘stitched’. On the other hand
Complete: Sequence that includes intervening regions that may or may not also be of the requested GFF3 feature type. For example, a sequence that spans the first base of the first CDS block to the last base of the last CDS block would be a ‘complete’ CDS sequence.
Gene ID: Gene ID
Range(s): Ranges of the gene that this sequence spans
For stitched sequences, there may be multiple feature ranges in the format ‘0-10,20-30’
For complete sequences, there will only be a single range spanning the first base of the first feature to the last base of the last feature of that feature type in the gene.
The names of target sequences follow the following format if they were discovered by homology inferrence:
<query alias>|<molecule>|<i>|<range(s)>
Query alias: Unique alias given to each query file
Molecule: Sequence ID of sequence the target is from
For example, if the query is a FASTA file with sequences named ‘ChrA’, ‘ChrB’, and ‘scaffold_001’, and the target is found on scaffold_001, the value of this field will be ‘scaffold_001’
i: Unique number given to each target sequence from the same query file
Range(s): Range of the target in the molecule
8.2.4. gRNA FASTA files
Names of files containing gRNA sequences follow the following format: <prefix>_gRNA_<category>.fasta
The categories are:
all: all candidate gRNA, regardless of pass/fail status
pass: candidate gRNA that have passed all valid checks
final: final gRNA selected in minimum sets
8.2.5. gRNA .map files
Names of files containing information for mapping gRNA to targets format: <prefix>_gRNA_<category>.map
As with gRNA FASTA files, the categories are:
all: all candidate gRNA, regardless of pass/fail status
pass: candidate gRNA that have passed all valid checks
final: final gRNA selected in minimum sets
These files are tab-separated and look like this:
gRNA id gRNA sequence target id target sense gRNA strand start end set background GC feature my_custom_check
gRNA_001 CTATGGGTTTGGCGAAAGTA Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382 sense + 4 23 1 pass pass fail pass
gRNA_002 TCAAAAGTTCTCCTTATCCA Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382 sense + 38 57 1 pass pass fail pass
gRNA_003 AAATCTTTGATGTTTACTTA Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382 sense + 79 98 1 pass fail fail fail
gRNA_004 GTCTTTGCTTTTTACTTCTC Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382 sense + 111 130 1 pass pass fail pass
gRNA_005 TATAGATGTGCCAGCTCGAA Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382 sense + 140 159 1 pass pass pass fail
gRNA_006 GCTCGAAAGGTTGTTTTGCT Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382 sense + 153 172 1 pass pass pass fail
gRNA_007 TAAGTAATTACTGAAACATT Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382 sense - 206 225 1 pass fail pass pass
gRNA_008 CTGAAACATTTGGATCAGTG Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382 sense - 196 215 1 pass pass pass pass
gRNA_009 AGCAAAACAACCTTTCGAGC Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382 sense - 153 172 1 pass pass pass pass
Column description:
gRNA id: Unique ID for each gRNA sequence, consistent with gRNA FASTA files
gRNA sequence: gRNA sequence (upper case)
target id: Sequence ID of target sequence, consistent with XXX_targets.fasta
target sense: Whether target sequence is sense or antisense
This is detected by alignment with reference genes.
If the user provided target sequences to the full programme or to the
seq
subcommand using--target
, all entries in this field will be ‘NA’.
gRNA strand: Strand of gRNA relative to target sequence
start: Start position of gRNA in target sequence
end: End position of gRNA in target sequence
set: gRNA set number
Unless the file ends with ‘_final.map’, all entries in this field will be set to 1.
If the file ends with ‘_final.map’, this value corresponds to the set a gRNA is assigned to.
background: Status of background check (only in file ending with ‘_all.map’)
GC: Status of GC content check (only in file ending with ‘_all.map’)
feature: Status of within feature check (only in file ending with ‘_all.map’)
Users may provide custom checks in additional columns
In this example, I’ve named my custom check ‘my_custom_check’