riboPicker @ SourceForge.net

rRNA identification and removal.

Manual

Web version
    Necessary resources
    Upload data
    From input to output

Removal of rRNAs

Standalone version
    Necessary resources
    Available options

Databases
    SILVA
    RDP-II
    Greengenes
    NCBI
    HMP
    Rfam
    Human
    rrnadb

Introduction

The majority of RNA recovered in metatranscriptomic studies is ribosomal RNA (rRNA), not mRNA, often exceeding 90% of the total reads (Stewart et al., 2010; doi:10.1038/ismej.2010.18). Even after various treatments prior to sequencing, the observed rRNA content decreases only slightly (Hei et al., 2010; doi:10.1038/nmeth.1507). It is estimated that misannotations of rRNA as proteins may cause up to 90% false positive matches of rRNA-like sequences in metatranscriptomic studies (Tripp et al., 2011; doi:10.1093/nar/gkr576). Those false positive matches are a serious concern for downstream analysis, possibly causing erroneous conclusions. Therefore, the removal of rRNA-like sequences presents a necessary step for all metatranscriptomic projects.

riboPicker is distributed under the GNU Public License (GPL). All its source codes are freely available to both academic and commercial users. The latest version can be downloaded at the SourceForge download page.

Web version

TOP OF PAGE

The interactive web interface of riboPicker can be used to automatically identify and efficiently remove rRNA-like sequences from genomic and metatranscriptomic datasets.

Necessary resources

Hardware
- Computer connected to the Internet

Software
- Up-to-date Web browser (Firefox, Safari, Chrome, Internet Explorer, ...)

Files
- FASTA or FASTQ file with sequence data

Upload data to the riboPicker web version

To upload a new dataset in FASTA or FASTQ format to riboPicker, follow these steps:
  1. Go to http://ribopicker.sourceforge.net
  2. Click on "Use riboPicker" in the top menu on the right (the latest riboPicker web version should load)
  3. Select your FASTA or FASTQ file
  4. Select the database(s)
  5. Click "Continue"

From input to output

The graphic below shows the four basic steps of riboPicker's web interface: (i) Select a dataset and the databases; (ii) Automatic processing of the input data; (iii) Select thresholds and data for download; and (iv) download results.

riboPicker steps

riboPicker generates a Coverage vs. Identity (CI) plot and reference coverage plots to guide users in their selection of threshold values. The plots show the number of matching reads for different query coverage and alignment identity values. The number of matching reads with a specific coverage and identity value defines the size of each dot in the CI plot. The column and row sums at the top and right of the plot allow an easier identification of the number of sequences that match for a particular threshold value.

The coverage plots show were the metatranscriptomic sequences align on the rRNA sequences. The header of the plots shows the name of the database, sequence ID, domain, phyla, and length of the database sequence, as well as the used thresholds and the number of matching metatranscriptomic sequences. The bar at the x-axis is marked with a darker color in areas that have metatranscriptomic sequences aligned. This can be useful for areas with very low coverage in plots with high number of hits. The plots also provide an easy way to check for possible bias in the alignment or rRNA-removal prior to sequencing (wet lab).

Coverage plot example

There are currently three outputs that the users can choose from:
  (a) Sequence data separated by rRNA-like and non-rRNA (two files)
  (b) Coverage data for each database sequence with at least the number of specified hits
  (c) Summary report that contains information about the input, selected databases, thresholds selected, and rRNA-like sequences classified by database, domain and phyla

Removal of rRNA-like sequences

TOP OF PAGE

As described above, the rRNA-like sequences in metatranscriptomes can lead to false positive identifications when comparing the data to protein databases.

riboPicker uses the query sequence coverage and alignment identity to identify sequences that are similar to known rRNA sequences in the provided databases. The identity is a measure for how similar the query sequence to the reference sequence is and the coverage is a measure of how much of the query sequence is similar to the reference sequence.

Standalone version

TOP OF PAGE

The standalone version does not provide any graphical outputs and databases. The databases used for rRNA screening have to be generated as described in the readme file distributed with the source code. The readme file also contains information on the usage of the standalone version.

Necessary resources

Hardware
- Computer with a Linux/Unix or Mac OS X operating system

Software
- riboPicker standalone (available at http://ribopicker.sourceforge.net under "Downloads")
- Perl 5 (or higher)

Files
- FASTA or FASTQ file with sequence data

Available options

Option/flag	Description	Default	Range
-help or -h	Print the help message; ignore other arguments
-man	Print the full documentation; ignore other arguments
-version	Print program version; ignore other arguments
Input/Output options
-show_dbs	Prints a list of available databases
-f	Input file in FASTA or FASTQ format that contains the query sequences		FILE
-dbs	Name of remove database(s) to use. Names are according to their definition in the config file. Separate multiple database names by comma without spaces. Example: -dbs ssr,test,rrnadb	test	STRING
-out_dir	Directory where the results should be written. If the directory does not exist, it will be created.	Current directory	STRING
-no_seq_out	Prevents the generation of the FASTA/FASTQ output file for the given coverage and identity thresholds. This feature is e.g. useful for the web-version since -i and -c are set interactively and not yet defined at the data processing step.
-keep_tmp_files	Prevents from unlinking the generated tmp files. These usually include the id file and the .tsv file(s). This feature is e.g. useful for the web-version since .tsv files are used to dynamically generate the output files.
-id	Optional parameter. If not set, ID will be automatically generated to prevent from overwriting previous results. This option is useful if integrated into other tools and the output filenames need to be known. (Use this option to defined the output filename prefix. Output files will end in _rrna.fa and _nonrrna.fa, respectively.)		STRING
Alignment options
-i	Alignment identity threshold in percentage. The identity is calculated for the part of the query sequence that is aligned to a reference sequence. For example, a query sequence of 100 bp that aligns to a reference sequence over the first 50 bp with 40 matching positions has an identity value of 80%.		INT [1..100]
-c	Alignment coverage threshold in percent. The coverage is calculated for the part of the query sequence that is aligned to a reference sequence. For example, a query sequence of 100 bp that aligns to a reference sequence over the first 50 bp with 40 matching positions has an coverage value of 50%.		INT [1..100]
-l	Alignment length threshold used to define matching sequences as similar. For example, a query sequence of 100 bp that aligns to a reference sequence over the first 50 bp with 40 matching positions has an alignment length of 50.		INT
-S	Chunk size of reads in bp as used by BWA-SW	10000000	INT
-z	Z-best value as used by BWA-SW	1	INT
-T	Alignment score threshold as used by BWA-SW	30	INT

Databases

TOP OF PAGE

The web version already provides preprocessed databases for the identification of rRNA sequences. Here, you will find additional information for those databases and a detailed How To to generate the same databases for the standalone version (assuming you have access to a Unix/Linux-like command line with Perl installed).

SILVA database

The SILVA rRNA database project (http://www.arb-silva.de/ and http://www.ncbi.nlm.nih.gov/pubmed/17947321) is a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data. SILVA provides comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life (Bacteria, Archaea and Eukarya).

How To

The raw sequence data can be downloaded from the ARB-SILVA website at http://www.arb-silva.de/no_cache/download/archive/current/Exports/. There are several files offered and the ones that you should download are LSURef_XXX_tax_silva_trunc.fasta.tgz and SSURef_XXX_tax_silva_trunc.fasta.tgz, where XXX represents the version number such as 106 or 108. (The current way of how SILVA offers the files does not allow easy wget access and therefore requires you to download the files manually.)

The files are compressed and for easier handling during the next steps, we will extract the data:

tar -xzf LSURef_108_tax_silva_trunc.fasta.tgz
tar -xzf SSURef_108_tax_silva_trunc.fasta.tgz

(This will generate about 1 GB of output.)

Reformat extracted SILVA fasta files to remove spaces, convert RNA into DNA sequences, remove exact duplicates for a given accession number, and have sequence IDs in following form:

database_unique-sequence-number

The first input for the Perl script (formatSilvaData.pl) is the extracted FASTA file and the second one the name of the output file and sequence identifiers:

perl formatSilvaData.pl LSURef_108_tax_silva_trunc.fasta slr108
perl formatSilvaData.pl SSURef_108_tax_silva_trunc.fasta ssr108

If the taxon information is not available for a sequence (based on ARB), then SILVA assigns either "Unclassified", "unclassified", "other", or any other term that they decide to use in a new release. If the assignment is not one of the three listed here, then there will be a warning written to STDERR and the Perl script needs to be modified accordingly.

The output of the Perl script are the processed FASTA file and a file containing the taxon information. The taxon information is currently not used in the standalone version, but is required for the web-based version. For the two perl commands above, the output files are:

slr108.fasta   ssr108.fasta   slr108.taxon   ssr108.taxon

The last step is to create the database index. Since riboPicker uses BWA-SW in the backend, the processed FASTA file will be indexed using the BWA index function:

./bwa64 index -p slr108 slr108.fasta
./bwa64 index -p ssr108 ssr108.fasta

Make sure that you use the BWA version included in the riboPicker package to make sure you create the correct index for the modified BWA-SW program version. Instead of bwa64 you can use bwaMAC if you are running the commands on a Mac. If the bwa64 file is not in the current directory, you need to specify the path to the file instead of using "./bwa64" use e.g. "../ribopicker/bwa64 index -p slr108 slr108.fasta".

After creating the database index, copy the index files into the database directory of your choice and follow the instructions in the README file to setup riboPicker for the database use.

Ribosomal Database Project RDP-II database

The Ribosomal Database Project RDP-II database (http://rdp.cme.msu.edu/misc/resources.jsp and http://www.ncbi.nlm.nih.gov/pubmed/19004872) provides researchers with quality-controlled bacterial and archaeal small subunit rRNAs.

How To

The steps are similar to the ones described for SILVA and therefore, are shortened to only highlight the differences. The example commands are for version 10.28 and need to be modified for later releases of the database.

Get raw data (GenBank format and unaligned):

wget http://rdp.cme.msu.edu/download/release10_28_unaligned.gb.gz

Extract data:

gzip -d release10_28_unaligned.gb.gz

Format data (formatRdpData.pl):

perl formatRdpData.pl release10_28_unaligned.gb rdp1028

Create database index:

./bwa64 index -p rdp1028 rdp1028.fasta

GreenGenes database

The GreenGenes database (http://greengenes.lbl.gov/ and http://www.ncbi.nlm.nih.gov/pubmed/16820507) is a 16S rRNA database.

How To

The steps are similar to the ones described for SILVA and therefore, are shortened to only highlight the differences. The example commands are for the release from May 31, 2011 and need to be modified for later releases of the database.

Get raw data (GreenGenes format):

wget http://greengenes.lbl.gov/Download/Sequence_Data/Greengenes_format/greengenes16SrRNAgenes.txt.gz

Extract data:

gzip -d greengenes16SrRNAgenes.txt.gz

Format data (formatGreengenesData.pl):

perl formatGreengenesData.pl greengenes16SrRNAgenes.txt gg20110531

Please note that the raw database may contain sequences annotated as vectors. If the phylum information is not available for Eukaryota, the next available classification such as family is used (if available).

Create database index:

./bwa64 index -p gg20110531 gg20110531.fasta

NCBI archaeal/bacterial complete genomes rRNA database

The NCBI archaeal/bacterial complete genomes rRNA database contains data from the Microbial Genomes Resources (http://www.ncbi.nlm.nih.gov/genomes/MICROBES/microbial_taxtree.html and http://www.ncbi.nlm.nih.gov/pubmed/21097890) that present public data from prokaryotic (Archaea and Bacteria) genome sequencing projects. The sequence collection contains data from finished genomes as well as draft assemblies.

How To

The steps are similar to the ones described for SILVA and therefore, are shortened to only highlight the differences. The example commands are for a certain release and need to be modified for later releases of the database.

Get raw data (GenBank format):

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.gbk.tar.gz

Extract data (this will generate >1,000 separate directories for each genome):

tar -xzf all.gbk.tar.gz

Get list of GBK files:

find . -type f | sed "s#^.#$(pwd)#" | grep '.gbk$' > gbk_files.txt

Format data (ignore 5.8S, 4.5S, 6S and 7S rRNA annotations, and sequences longer than 5,000 bp; formatNcbiData.pl):

perl formatNcbiData.pl gbk_files.txt ncbibact20120117 > missing.txt

Check for missing information:

grep "Missing" missing.txt

If the file is empty, then you can create the index. If not, then you might need to modify the Perl script to add the missing information. You should check other outputs such as "Wrong annotation?" and "Missing taxon info" warnings using "less missing.txt" or a text editor, but you can ignore those as those will be dealed with in the Perl script (wrong annotations will be filtered out and missing taxon info will be assigned "unknown").

Create database index (for 16S, 23S and 5S sequences):

./bwa64 index -p ncbibact2012011716S ncbibact2012011716S.fasta
./bwa64 index -p ncbibact2012011723S ncbibact2012011723S.fasta
./bwa64 index -p ncbibact201201175S ncbibact201201175S.fasta

or using a loop in Perl

perl -e 'foreach(16,23,5){print STDERR `./bwa64 index -p ncbibact20120117$_\S ncbibact20120117$_\S.fasta`;}'

HMP reference genomes rRNA database

The HMP reference genomes rRNA database contains data from the Human Microbiome Project (http://www.hmpdacc.org/HMRGD/ and http://www.ncbi.nlm.nih.gov/pubmed/19819907).

How To

Get raw data (GenBank format; does only work if you accepted the Data Usage Agreement, download manually otherwise):

wget http://downloads.hmpdacc.org/data/reference_genomes/all_gbk_20111206.tar.gz

Extract data (creates original directory structure):

tar -xzf all_gbk_20111206.tar.gz

Format data (ignore 5.8S annotations in Bacterial genomes; formatHmpData.pl):

perl formatHmpData.pl local/db/repository/ncbi/dacc_reference_genomes/20111206/all_gbk_20111206/ hmp20111206

The HMP data contains rRNA misannotations and requires manual corrections of those. For example, a specific region should not be annotated as both 16S and 23S:

File: 19921.AASB02000001-AASB02000256.nuc.gbk
[...]
     gene            complement(1..1732)
                     /locus_tag="HMPREF0776_r0005"
     rRNA            complement(1..1732)
                     /locus_tag="HMPREF0776_r0005"
                     /product="23S ribosomal RNA"
     gene            complement(81..1620)
                     /locus_tag="HMPREF0776_r0006"
     rRNA            complement(81..1620)
                     /locus_tag="HMPREF0776_r0006"
                     /product="16S ribosomal RNA"
[...]

The easiest way to check for misannotations is to use the riboPicker web-based version as those databases contain the latest version and are manually checked for misannotations. If you, for example, compare the 23S HMP sequences against the 16S databases in riboPicker and find strong hits (use high thresholds and take a look at the coverage plots), then you should remove those before generating the database index. It should be noted that the rRNA sequence from the example above will be ignored as those are overalapping annotation. To include those sequences, either add them manually or correct the annotation.

Create database index (for 16S, 23S and 5S sequences):

./bwa64 index -p hmp2011120616S hmp2011120616S.fasta
./bwa64 index -p hmp2011120623S hmp2011120623S.fasta
./bwa64 index -p hmp201112065S hmp201112065S.fasta

or using a loop in Perl

perl -e 'foreach(16,23,5){print STDERR `./bwa64 index -p hmp20111206$_\S hmp20111206$_\S.fasta`;}'

Rfam database

The Rfam database (http://rfam.sanger.ac.uk/ and http://www.ncbi.nlm.nih.gov/pubmed/21062808) is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). The 5S and 5.8S data from Rfam adds rRNA sequences not included in the 16S and SILVA databases.

How To

Get raw data (requires manual download):

http://rfam.sanger.ac.uk/family/RF00001#tabview=tab2
http://rfam.sanger.ac.uk/family/RF00002#tabview=tab2

The data can be downloaded from those URLs. For the "Formatting Options", choose the following and then click on the "Generate" button:

Full
Name/start-end
FASTA (UNgapped)
Download

Format data (formatRfamSeqs.pl):

perl formatRfamSeqs.pl RF00001_full.txt rfam5s101
perl formatRfamSeqs.pl RF00002_full.txt rfam58s101

The following entries are ignored as they seem to be contamination rather than rRNA sequences: AY497569.1, AY497571.1, AY497570.1, AY497568.1, AY497572.1, and AY497573.1.

The taxon information requires multiple additional steps due to missings identifiers in the NCBI Taxonomy as a result of discontinued records at NCBI that are still used in Rfam. If you need to extract the taxon information, I can provide the additional Perl scripts that you will need to map the identifiers. For now, please download the tarred taxon data: rfam_taxon.tar.gz.

Create database index:

./bwa64 index -p rfam5s101 rfam5s101.fasta
./bwa64 index -p rfam58s101 rfam58s101.fasta

Human rRNA database

The human rRNA database contains the human subset of the SILVA database and the NCBI human genome rRNA sequences.

How To

To get the SILVA data, go to:

http://www.arb-silva.de/

Select "Browser" and then for both LSU and SSU follow this path:

Eukaryota > Metazoa > Chordata > Craniata > Vertebrata > Euteleostomi > Mammalia > Eutheria > Euarchontoglires > Primates > Haplorrhini > Catarrhini > Hominidae > Homo

Click on "Homo" to add sequences to the cart.

Click on "Download" in the cart panel (top right), wait for the download file to be generated and download the file when done.

Reformat the downloaded SILVA fasta files to remove spaces, convert RNA into DNA sequences, remove exact duplicates, and have sequence IDs (formatSilvaData.pl):

perl formatSilvaData.pl human_rrna_silva.fasta silvahuman20120220

Download the human genome data from NCBI:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh37.p5_*.fa.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/Assembled_chromosomes/gbs/hs_ref_GRCh37.p5_*.gbs.gz

Get list of files:

find . -type f | grep gbs | sed "s#.gbs.gz##" | sed "s#^.#$(pwd)#" > files.txt

Format data (formatNcbiHumanData.pl):

perl formatNcbiHumanData.pl files.txt ncbihuman20120220

(The .gz files do not need to be decompressed.)

Repeat for other human genomes at NCBI.

Combine data:

cat silvahuman20120220.fasta ncbihuman20120220.fasta > humanrrna20120220.fasta
cat silvahuman20120220.taxon ncbihuman20120220.taxon > humanrrna20120220.taxon

Create database index:

./bwa64 index -p humanrrna20120220 humanrrna20120220.fasta

Non-redundant rRNA database

The rrnadb is a non-redundant database that combines all the above databases and is provided for the standalone version. The database contains 16S, 18S, 23S, 28S, 5S and 5.8S rRNA sequences.

How To

The first step is to combine all the processed FASTA files from the databases above (this assumes that you followed the steps above to create the FASTA files for each database). If you removed all the raw FASTA files and the current directory only contains the preprocessed FASTA files, then you can simply use:

cat *.fasta > all.fa

Otherwise, you have to specify all the files separately that you want to combine:

cat slr108.fasta ssr108.fasta rdp1028.fasta gg20110531.fasta ncbibact2012011716S.fasta ncbibact2012011723S.fasta ncbibact201201175S.fasta hmp2011120616S.fasta hmp2011120623S.fasta hmp201112065S.fasta rfam5s101.fasta rfam58s101.fasta > all.fa

Format data using PRINSEQ (http://prinseq.sourceforge.net/):

perl prinseq-lite.pl -log -verbose -fasta all.fa -derep 12345 -out_good rrnadb -out_bad all_reps

Please note that this step requires several GB of memory (approx. 12 GB). If you don't have those resources available, you can ignore non-exact duplicates and modify the command for dereplication to "-derep 14".

Create database index:

./bwa64 index -p rrnadb rrnadb.fasta >bwa64.out 2>&1 &

This command has added ">bwa64.out 2>&1 &" at the end. This allows to create the database index in the background and write all the outputs from BWA to the file bwa64.out.