Welcome to BRIXL!
This site will host losslessly compressed transcriptomic data for model organisms.
One of the most challenging tasks in analyzing RNA-seq data is that each data file typically comes in gigabytes, causing problems in storage, transmission
and analysis. However, these files do not need to be so large and can be reduced without loss of information. Each RNA-seq
file that you generate or retrieve from databases, either in .SRA or .fastq format,
contains numerous identical reads stored as separate entries. I have processed
these data (by my program ARSDA) so that identical reads are represented in the SeqID_NumCopy format. For
example, among 44603541 forward reads in the SRR4011234.sra file
from NCBI, one read has 497027 identical copies and are stored as 497027 separate entries. I group identical reads into sequence groups (SeqGr) and
store them as a single entry, e.g., SeqGr1_497027 means that one read has 497027 identical copies in the .SRA
or fastq file. This dramatically reduces not only the file size, but also the downstream
RNA-seq data analysis. Instead of repeatedly mapping the same read half a million times against a
genome, you map it just once. I call this new format as FASTA+ format, i.e.,
>SeqGr1_497027
ACCGGA...
>SeqGr2_56443
ACCGAC....
The generation of FASTA+ files is time-consuming but it needs to be done only once. To save time for the user, I have generated these FASTA+ files for
model species and created BLAST databases from them, so that you can download them for further data analysis. These data are grouped by species below.
If you use these data for RNA-seq analysis, you need to use to the number after
the underscore in sequence ID (i.e., you need to use 497027 in SeqGr1_497027). My program ARSDA
(for Analyzing RNA-Seq Data)
uses files in FASTA+ format for analyzing gene expression, alternative splicing and ribosome-profiling data. ARSDA is best used together with DAMBE. Both are menu-driven user-friendly programs, and both take just a few clicks to install.
Zipped BLAST databases derived from original .SRA files can be found in the following links, together with a file for coding sequences (CDSs) for you to practise gene expression characterization using ARSDA:
|