3. Configuration

Note that all command line code snippets in the following tutorial are for bash terminal. You may have to adapt them according to your operating system.

3.1. config file

The config.ini file is used for two main purposes: setting default values and assigning short aliases to long values. All entries in the config.ini file that end with ‘alias’ or ‘sets’ are used to assign aliases, while the rest are for setting default values. Use the following to export the config file location as an environment variable:

$ export MINORG_CONFIG=/path/to/config.ini

If you’d like the config file to apply to all users, consider setting it globally.

3.2. Alias lookups

There are two types of alias lookups in MINORg: 1-level lookups and 2-level lookups.

3.2.1. 1-level lookup

1-level lookups are defined directly in the config file. Parameters that use 1-level lookups are:

  • --assembly

  • --annotation

  • --db

  • --attr-mod

  • --domain

To understand how aliases are assigned using the config file, let us take a look at assembly alias under section [lookup]. Here, we assign aliases for FASTA files of reference assemblies in the format <semicolon-separated alias>:<value>.

assembly alias = tair10;TAIR10:/path/to/subset_ref_TAIR10.fasta
                 araport11:/path/to/subset_ref_Araport11.fasta
                 araly2:/path/to/subset_ref_Araly2.fasta
                 araha1:/path/to/subset_ref_Araha1.fasta

With this setup, we can use --assembly TAIR10 instead of --assembly /path/to/subset_ref_TAIR10.fasta when building our MINORg command (although either way is acceptable). This applies to annotation alias (--annotation), rps database alias (--db), and gff attribute modification presets (--attr-mod) as well.

assembly

annotation

rps database

attr mod

description

assembly alias

annotation alias

rps database alias

gff attribute modification presets

[config] assign alias to values

--assembly

--annotation

--db

--attr-mod

[CLI] specify value (alias or raw)

Unlike the above parameters, domain alias is specified in the reverse format using commas instead of semi-colons (<value>:<comma-separated aliases) for readability reasons, as most PSSM-Ids are comprised of the same number of digits. Therefore, the domain alias in section [lookup] looks like this instead:

domain alias = 366714:TIR
               366375:NB-ARC,NBS
               375519:Rx_N

As with the above parameters, you can use --domain TIR instead of --domain 366714 (although either way is acceptable). Do note, however, that if you use --domain TIR, newly generate file and sequence names will use TIR instead of 366714 if the domain is to be included in the name.

3.2.2. 2-level lookup

2-level lookups are defined using a file that maps aliases to values. This means that there is a first layer of lookup that maps aliases to those files, and then a second layer of lookup in those files that maps aliases to values. Parameters that use 2-level lookups are:

  • --reference

  • --cluster

  • --indv

To understand how aliases are assigned, we shall use the sample_config.ini file (provided at https://github.com/rlrq/MINORg/blob/master/examples/sample_config.ini) and zoom in on reference genomes as an example. First, let use look at reference sets under the section [lookup]. Here, we assign aliases for lookup files (which at the command line we can specify using --reference-set).

reference sets = athaliana:/path/to/athaliana_genomes.txt
                 arabidopsis:/path/to/arabidopsis_genomes.txt

Each lookup file contains mapping of aliases to reference genome files as well as meta data. Let us look at the file athaliana_genomes.txt (provided at https://github.com/rlrq/MINORg/blob/master/examples/athaliana_genomes.txt).

tair10;TAIR10 /path/to/subset_ref_TAIR10.fasta        /path/to/subset_ref_TAIR10.gff  1
araport11     /path/to/subset_ref_Araport11.fasta     /path/to/subset_ref_Araport11.gff       1

This is a tab-separated file where:

  • column 1: semicolon-separated alias(es)

  • column 2: path to genome FASTA file

  • column 3: path to genome GFF3 file

  • column 4: NCBI genetic code number or name (optional if standard genetic code)

  • column 5: mapping of nonstandard GFF3 attribute field names to standard field names (optional if standard)

By using --reference-set athaliana in the command line execution, we can use --reference TAIR10 (or --reference tair10) to tell MINORg to use ‘/path/to/subset_ref_TAIR10.fasta’ and ‘/path/to/subset_ref_TAIR10.gff’ as reference assembly and annotation files respectively without having to type their paths out. In fact, we can even specify multiple reference genomes using --reference TAIR10,araport11 and making sure to use a comma to separate reference genome aliases.

But there’s more! We can set ‘athaliana’ as the default reference set AND ‘TAIR10’ as the default reference in the [data] section of the config file:

reference = TAIR10
reference set = athaliana

This way, unless you wish to use a different reference genome from the default, you won’t have to type --reference-set athaliana --reference TAIR10 either! This is particularly useful if you primarily design gRNA for only a single species. Of course, if you wish, you can combine all reference set files into a single massive file containing the mapping information for all possible reference genomes instead of having multiple files. However, I personally find it easier to maintain smaller files with descriptive names.

You can view the reference genomes in the default reference set using:

$ minorg --references ## prints contents of default reference set file

Valid genome aliases (defined in /path/to/athaliana_genomes.txt):

<semicolon-separated genome alias(es)>       <FASTA file>    <GFF3 file>     <NCBI genetic code>     <attribute name mapping>
tair10;TAIR10        /path/to/subset_ref_Araport11.fasta     /path/to/subset_ref_Araport11.gff
araport11    /path/to/subset_ref_TAIR10.fasta        /path/to/subset_ref_TAIR10.gff

To view the reference genomes in a non-default reference set, use:

$ minorg --reference-set arabidopsis --references ## prints contents of arabidopsis_genomes.txt instead

Valid genome aliases (defined in /path/to/arabidopsis_genomes.txt):

<semicolon-separated genome alias(es)>       <FASTA file>    <GFF3 file>     <NCBI genetic code>     <attribute name mapping>
tair10;TAIR10        /path/to/subset_ref_Araport11.fasta     /path/to/subset_ref_Araport11.gff
araport11    /path/to/subset_ref_TAIR10.fasta        /path/to/subset_ref_TAIR10.gff
araly2;alyrata2      /path/to/subset_ref_Araly2.fasta        /path/to/subset_ref_Araly2.gff
araha1;ahalleri1     /path/to/subset_ref_Araha1.fasta        /path/to/subset_ref_Araha1.gff

Note that you can also provide an alias mapping file that is not in the config file to MINORg by specifying the path to the file instead of using a non-existent alias (e.g. --reference-set /path/to/arabidopsis_genomes.txt).

3.2.2.1. Parameters

The same logic applies as well to cluster sets-cluster set (--cluster-set---cluster---clusters) and genome sets-genome set (--genome-set---indv---genomes), with the caveat that there is no option to set default clusters or query genomes.

reference

cluster

genome

description

reference sets

cluster sets

genome sets

[config] assign alias to lookup files

reference set

cluster set

genome set

[config] set default lookup file

reference

[config] set default value

--reference-set

--cluster-set

--genome-set

[CLI] specify lookup file (alias or path)

--reference

--cluster

--indv

[CLI] specify reference/cluster/indv (comma-separated alias(es))

--references

--clusters

--genomes

[CLI] print contents of lookup file to screen

3.2.2.2. Alternative parameters

Do note that, because of the nature of these lookups, you cannot simply provide the value(s) mapped to an alias to --reference, --cluster, or --indv. If the desired files/genes are not specified in any mapping file, you will have to use the following alternatives:

using aliases

not using aliases

reference

--reference <alias(es)>

--assembly <path to FASTA> --annotation <path to GFF3> --genetic-code <number or name> --attr-mod <attribute modification>*

cluster

--cluster <alias(es)>

--gene <comma-separated gene IDs>**

indv

--indv <alias(es)>

--query <path to FASTA>***

* --genetic-code and --attr-mod are optional if the reference genome uses the standard genetic code and standard GFF attribute field names respectively. Do note that you CANNOT SPECIFY MULTIPLE reference genomes if not using aliases.

** If not using aliases, each cluster must be processed separately (i.e. a different MINORg execution for each cluster), as MINORg has no way of knowing which gene belongs to which cluster if you use --gene.

*** You may specify multiple query FASTA files by using --query <FASTA> as many times as needed (e.g. --query <FASTA 1> --query <FASTA 2> --query <FASTA 3>).

3.2.3. 2-level lookup file formats

3.2.3.1. cluster

The lookup files specified in cluster sets should look like this:

RPS4;TTR1;RPS4_TTR1   AT5G45050,AT5G45060,AT5G45200,AT5G45210,AT5G45220,AT5G45230,AT5G45240,AT5G45250
RPS6  AT5G46260,AT5G46270,AT5G46450,AT5G46470,AT5G46490,AT5G46510,AT5G46520

This is a tab-separated file where:

  • column 1: semicolon-separated alias(es)

  • column 2: comma-separated gene IDs of genes in the cluster

3.2.3.2. genome

The lookup files specified in genome sets should look like this:

9654  /path/to/subset_9654.fasta
9655  /path/to/subset_9655.fasta
9944  /path/to/subset_9944.fasta
9947  /path/to/subset_9947.fasta

This is a tab-separated file where:

  • column 1: semicolon-separated alias(es)

  • column 2: path to FASTA file

3.2.3.3. reference

For ease of reference, the format for reference lookup files specified in reference sets is reproduced here:

tair10;TAIR10 /path/to/subset_ref_TAIR10.fasta        /path/to/subset_ref_TAIR10.gff  1
araport11     /path/to/subset_ref_Araport11.fasta     /path/to/subset_ref_Araport11.gff       1

This is a tab-separated file where:

  • column 1: semicolon-separated alias(es)

  • column 2: path to genome FASTA file

  • column 3: path to genome GFF3 file

  • column 4: NCBI genetic code number or name (optional if standard genetic code)

  • column 5: mapping of nonstandard GFF3 attribute field names to standard field names (optional if standard) (see Attribute modification format (reference lookup file) for format)