8. Output

8.1. Directory structure

As a general rule of thumb, using the default prefix ‘minorg’, MINORg generates the following files with the following directory structure:

<output_directory>/
|-- minorg.log
+-- minorg/
    |-- minorg_<domain>_mafft.fasta
    |-- minorg_<domain>_targets.fasta
    |-- minorg_gRNA_all.fasta
    |-- minorg_gRNA_all.map
    |-- minorg_gRNA_final.fasta
    |-- minorg_gRNA_final.map
    |-- minorg_gRNA_pass.fasta
    |-- minorg_gRNA_pass.map
    |-- minorg_gRNA_masked.txt
    +-- ref/
        |-- minorg_ref_<domain>_CDS.fasta
        |-- minorg_ref_<domain>_gene.fasta
        +-- minorg_ref_<domain>_pep.fasta

The exact combination of files generated depends on whether the full programme or a subcommand (and which subcommand) has been executed, as well as the combination of parameters used.

If domain is not specified (using --domain (CLI) OR pssm_ids and/or domain_name (Python)), it defaults to ‘gene’. If the user has specified a custom prefix (using --prefix (CLI) OR prefix (Python)), the prefix replaces ‘minorg’ for all file names.

When --cluster (CLI exclusive) is used, the general structure is maintained, except ‘minorg_<cluster alias>’ replaces ‘minorg’ for all files except ‘minorg.log’, and each cluster gets its own directory:

<output_directory>/
|-- minorg.log
+-- minorg_clusterA/
|   |-- minorg_clusterA_<domain>_mafft.fasta
|   |-- minorg_clusterA_<domain>_targets.fasta
|   |-- minorg_clusterA_gRNA_all.fasta
|   |-- minorg_clusterA_gRNA_all.map
|   |-- minorg_clusterA_gRNA_final.fasta
|   |-- minorg_clusterA_gRNA_final.map
|   |-- minorg_clusterA_gRNA_pass.fasta
|   |-- minorg_clusterA_gRNA_pass.map
|   |-- minorg_clusterA_gRNA_masked.txt
|   +-- ref/
|       |-- minorg_clusterA_ref_<domain>_CDS.fasta
|       |-- minorg_clusterA_ref_<domain>_gene.fasta
|       +-- minorg_clusterA_ref_<domain>_pep.fasta
+-- minorg_clusterB/
    |-- minorg_clusterB_<domain>_mafft.fasta
    |-- minorg_clusterB_<domain>_targets.fasta
    |-- minorg_clusterB_gRNA_all.fasta
    |-- minorg_clusterB_gRNA_all.map
    |-- minorg_clusterB_gRNA_final.fasta
    |-- minorg_clusterB_gRNA_final.map
    |-- minorg_clusterB_gRNA_pass.fasta
    |-- minorg_clusterB_gRNA_pass.map
    |-- minorg_clusterB_gRNA_masked.txt
    +-- ref/
        |-- minorg_clusterB_ref_<domain>_CDS.fasta
        |-- minorg_clusterB_ref_<domain>_gene.fasta
        +-- minorg_clusterB_ref_<domain>_pep.fasta

8.2. Output files

8.2.1. minorg.log

This file is currently only generated when the CLI version of MINORg is used.

The logfile follows the following format:

((raw args))
minorg --directory ./example_07_domain --indv ref --gene AT5G45050 --assembly ./subset_ref_TAIR10.fasta --annotation ./subset_ref_TAIR10.gff --domain 214815

((expanded args))
reference_set:        /path/to/arabidopsis_genomes.txt
genome_set:   /path/to/subset_genome_mapping.txt
cluster_set:  /path/to/cluster_mapping.txt
output_ver:   4 (default)
prioritise_nr:        False (default)
auto: True (default)
sets: 1 (default)
accept_invalid:       False (default)
exclude:      None (default)
...
<more arguments>

where ‘raw arguments’ log the exact command and parameters used and ‘expanded arguments’ shows all arguments, including those that the user did not explicitly specify in their command.

Warning and/or error messages may be logged after the arguments.

8.2.2. XXX_mafft.fasta

The file ending in ‘_mafft.fasta’ contains a sequence alignment of targets to reference genes generated using MAFFT.

8.2.3. XXX_targets.fasta

The file ending in ‘_targets.fasta’ contains target sequence(s).

The names of target sequences follow the following format for reference genes:

Reference|<reference alias>|<domain>|<n>|<feature type>|<stitched/complete>|<gene ID>|<range(s)>
  • Reference alias: Unique alias given to each reference genome

  • Domain: PSSM ID or domain name (‘gene’ if not specified’)

  • n: If multiple domains are present, they will be numbered according to proximity to 5’ of sense strand

  • Feature type: GFF3 feature type

  • Stitched/complete: Whether sequences were concatenated

    • Stitched: Concatenated sequence, generated by stitching together regions of the requested GFF3 feature type. For example, individual CDS regions concatenated together into a single translatable sequence is considered ‘stitched’. On the other hand

    • Complete: Sequence that includes intervening regions that may or may not also be of the requested GFF3 feature type. For example, a sequence that spans the first base of the first CDS block to the last base of the last CDS block would be a ‘complete’ CDS sequence.

  • Gene ID: Gene ID

  • Range(s): Ranges of the gene that this sequence spans

    • For stitched sequences, there may be multiple feature ranges in the format ‘0-10,20-30’

    • For complete sequences, there will only be a single range spanning the first base of the first feature to the last base of the last feature of that feature type in the gene.

The names of target sequences follow the following format if they were discovered by homology inferrence:

<query alias>|<molecule>|<i>|<range(s)>
  • Query alias: Unique alias given to each query file

  • Molecule: Sequence ID of sequence the target is from

    • For example, if the query is a FASTA file with sequences named ‘ChrA’, ‘ChrB’, and ‘scaffold_001’, and the target is found on scaffold_001, the value of this field will be ‘scaffold_001’

  • i: Unique number given to each target sequence from the same query file

  • Range(s): Range of the target in the molecule

8.2.4. gRNA FASTA files

Names of files containing gRNA sequences follow the following format: <prefix>_gRNA_<category>.fasta

The categories are:

  • all: all candidate gRNA, regardless of pass/fail status

  • pass: candidate gRNA that have passed all valid checks

  • final: final gRNA selected in minimum sets

8.2.5. gRNA .map files

Names of files containing information for mapping gRNA to targets format: <prefix>_gRNA_<category>.map

As with gRNA FASTA files, the categories are:

  • all: all candidate gRNA, regardless of pass/fail status

  • pass: candidate gRNA that have passed all valid checks

  • final: final gRNA selected in minimum sets

These files are tab-separated and look like this:

gRNA id       gRNA sequence   target id       target sense    gRNA strand     start   end     set     background      GC      feature my_custom_check
gRNA_001      CTATGGGTTTGGCGAAAGTA    Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382    sense   +       4       23      1       pass    pass    fail    pass
gRNA_002      TCAAAAGTTCTCCTTATCCA    Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382    sense   +       38      57      1       pass    pass    fail    pass
gRNA_003      AAATCTTTGATGTTTACTTA    Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382    sense   +       79      98      1       pass    fail    fail    fail
gRNA_004      GTCTTTGCTTTTTACTTCTC    Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382    sense   +       111     130     1       pass    pass    fail    pass
gRNA_005      TATAGATGTGCCAGCTCGAA    Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382    sense   +       140     159     1       pass    pass    pass    fail
gRNA_006      GCTCGAAAGGTTGTTTTGCT    Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382    sense   +       153     172     1       pass    pass    pass    fail
gRNA_007      TAAGTAATTACTGAAACATT    Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382    sense   -       206     225     1       pass    fail    pass    pass
gRNA_008      CTGAAACATTTGGATCAGTG    Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382    sense   -       196     215     1       pass    pass    pass    pass
gRNA_009      AGCAAAACAACCTTTCGAGC    Reference|Reference|214815|gene|stitched|AT5G45050|4139-4382    sense   -       153     172     1       pass    pass    pass    pass

Column description:

  1. gRNA id: Unique ID for each gRNA sequence, consistent with gRNA FASTA files

  2. gRNA sequence: gRNA sequence (upper case)

  3. target id: Sequence ID of target sequence, consistent with XXX_targets.fasta

  4. target sense: Whether target sequence is sense or antisense

    • This is detected by alignment with reference genes.

    • If the user provided target sequences to the full programme or to the seq subcommand using --target, all entries in this field will be ‘NA’.

  5. gRNA strand: Strand of gRNA relative to target sequence

  6. start: Start position of gRNA in target sequence

  7. end: End position of gRNA in target sequence

  8. set: gRNA set number

    • Unless the file ends with ‘_final.map’, all entries in this field will be set to 1.

    • If the file ends with ‘_final.map’, this value corresponds to the set a gRNA is assigned to.

  9. background: Status of background check (only in file ending with ‘_all.map’)

  10. GC: Status of GC content check (only in file ending with ‘_all.map’)

  11. feature: Status of within feature check (only in file ending with ‘_all.map’)

  12. Users may provide custom checks in additional columns

    • In this example, I’ve named my custom check ‘my_custom_check’