rRNA identification and removal.


If you can't find the answer to your question, take a look at the manual or the Q&A site.

What is riboPicker?

riboPicker is a publicly available tool that is able to automatically identify and efficiently remove rRNA-like sequences from metatranscriptomic and metagenomic datasets. It is easily configurable and provides a user-friendly interface. The interactive web interface facilitates visualizations of the results and export functionality for subsequent data processing, and is available at http://edwards.sdsu.edu/ribopicker or by clicking on "Use riboPicker" in the menu above.

How can I cite riboPicker?

If you use riboPicker, please cite:
Schmieder R, et al.: Identification and removal of ribosomal RNA sequences from metatranscriptomes. Bioinformatics 2012, 28:433-435. [PMID: 22155869]

	title = {Identification and removal of ribosomal RNA sequences from metatranscriptomes},
	volume = {28},
	issn = {1367-4811},
	url = {http://www.ncbi.nlm.nih.gov/pubmed/22155869},
	doi = {10.1093/bioinformatics/btr669},
	number = {3},
	journal = {Bioinformatics {(Oxford,} England)},
	author = {Schmieder, Robert and Lim, Yan Wei and Edwards, Robert},
	month = feb,
	year = {2012},
	note = {{PMID:} 22155869},
	pages = {433--435}

Why should I use riboPicker?

The majority of RNA recovered in metatranscriptomic studies is ribosomal RNA (rRNA), not mRNA, often exceeding 90% of the total reads (Stewart et al., 2010; doi:10.1038/ismej.2010.18). Even after various treatments prior to sequencing, the observed rRNA content decreases only slightly (Hei et al., 2010; doi:10.1038/nmeth.1507). It is estimated that misannotations of rRNA as proteins may cause up to 90% false positive matches of rRNA-like sequences in metatranscriptomic studies (Tripp et al., 2011; doi:10.1093/nar/gkr576). Those false positive matches are a serious concern for downstream analysis, possibly causing erroneous conclusions. Therefore, the removal of rRNA-like sequences presents a necessary step for all metatranscriptomic projects.

There are several advantages in using riboPicker to remove rRNA-like sequences:
   - Removal of rRNA-like sequences improves the reliability of downstream data analysis
   - The web application allows users to pre-process their datasets without installing any software or preparing any databases
   - It takes about 15-20 minutes to screen an average size metatranscriptome for rRNA-like sequences

How does it work?

The graphic below shows the four basic steps of riboPicker's web interface: (i) Select a dataset and the databases; (ii) Automatic processing of the input data; (iii) Select thresholds and data for download; and (iv) download results.

riboPicker steps

Is there a standalone version of the program?

Yes, there is a standalone version of riboPicker available. The Perl code and modified BWA-SW source code (under "Downloads") can be used to run riboPicker as a standalone version, if required.
Currently, the web version uses the standalone version in the backend with the following parameter settings:
perl ribopicker.pl -no_seq_out -keep_tmp_files -id WEB_ID -dbs DATABASES -out_dir WEB_DIR -f INPUT_FILE -z 3
If you want to achieve the same results as the web version, please download the databases per FTP or HTTP and specify the parameter -z 3 (e.g.: perl ribopicker.pl -c 50 -i 75 -l 30 -z 3 -dbs DATABASES -f INPUT_FILE).

What file formats does riboPicker support?

You can submit files in FASTA or FASTQ format using the web version and FASTA using the standalone version. The files can also be compressed in ZIP, BZIP2, LZOP or GZIP format (only web version).

What is the maximum number of sequences that I can submit to the web version?

There is no limit on the number of sequences that you can submit. However, there is a limit for the file size that you can upload. The current web-service allows files up to 600 MB. If you compress your data, you can submit around 2 GB of sequence data.

Where can I set the threshold parameters?

The riboPicker web interface does not require the setting of threshold parameters (such as query coverage or alignment identity) before the data is processed. Instead, the threshold parameters are set after the data is processed. This allows the user to choose parameters appropriate for their dataset and does not require them to submit and process the same data with modified parameters for several times. The riboPicker standalone version requires the thresholds as input prior to data processing.

What threshold values should I use?

The identity threshold should be set according to your expected error rate. This means that if your data set has an average error rate of 2%, then your identity threshold should be set to 97% [= 100% - (error rate + 1% margin)] or below. The base N in your query sequence always mismatches the reference sequence and therefore, sequences with Ns cannot be aligned with 100% identity (except if they occur at the ends and the alignment stops before).
The coverage value should be selected based on the quality of your data. If your sequences are likely to have many errors at the 3'-end, then the alignment might not fully cover the query sequence. A value between 90% and 95% should be selected if unsure.
The coverage vs. identity plots can be helpful for the selection of the threshold values. The bar chart at the top and right shows how the sequences were aligned. The higher the bars, the more sequence were aligned with this coverage or identity value. For example, if you see high bars at the right chart for 100% to 98% and low bars for 97% and below, then you should set your identity threshold to max 98%.

How long do you keep the data submitted to the web version?

You as the user can select if you want us to keep the data accessible for one day (24 hours) or one week (168 hours). You can also request to delete the data after you are done, or if you want us to keep it for a longer time period.

How were the databases for the web version generated?

The web-based version of riboPicker offers preprocessed reference databases for a variety of rRNA resources. The source data was converted to DNA sequence files and the corresponding taxonomic information was retrieved and manually currated. The sequence data was then collapsed for exact duplicates to reduce redundancy, while keeping the taxonomic information for each source sequence. Due to the possibility of misannotations, further filtering of the datasets was performed (such as length filtering) before creating the final database. Details for each database can be accessed using the "Details" button on the input page.

I can't find the right database for my data set. What should I do?

If you want to idenifty rRNA-like sequences in your dataset using data that is not listed as database, you can contact us and we can add the database to the web version. If you want to use the databases from the web version, take a look at "Is there a standalone version of the program?" above. If you want to create your own database, follow the steps described under "Manual".

I am getting an error. What should I do?

ERROR: cannot find all database files for database "db" in dir "directory".
Please make sure that the directory path does not start with the "~"-sign and does end with an "/"-sign. Specify the full path to the database directory if unsure from what directory the program will be used.

ERROR: system call "./bwa64 bwasw -A -f file.tsv -z 3 /directory/db file.fastq" failed: 32512.
Please make sure that you are running the program from the directory where the bwa64 file is located, or that you specified the full path to the directory where the bwa64 file is located in the config file under variable PROG_DIR.

How was the BWA-SW source code modified?

riboPicker uses a modified version of the BWA-SW source code. The file bwtsw2_aux.c was modified to generate an alternative output, which presents a lightweight tab-separated output format containing only the necessary data required by riboPicker (query identifier, reference identifier, query coverage and alignment identity). The file bwtsw2_aux.c was additionally modified to force a mismatch when aligning the ambiguous base N in query sequences instead of randomly replacing it by A, C, G or T and possibly resulting in a match (BWA-SW default). The files stdaln.c, stdaln.h and bwtsw2_aux.c were modified to include "R" for replacements in an extended version of the Cigar string, instead of using "M" for both match and replacement (mismatch). The files bwtsw2_main.c and bwtsw2.h were modified to fix the double defined parameter -s (changed to -s and -S), and to add the new parameters -A (generate alternative output), -R (output extended version of Cigar string with replacements) and -M (force to mismatch Ns in query sequence). The modified version of BWA-SW is made available as part of the riboPicker source.