Using Kallisto for expression analysis of published RNAseq data

update (2015-11-10):  Several readers have pointed out that using TPM for limma/voom analysis is inappropriate.  See more here from the developers of limma.  I have gone back and replaced using TPM for differential analysis with the estimate count output from Kallisto.  Big thanks to Julien Roux, John Dunkan, and Alex for comments!

 

I was very excited when I first read about Kallisto from Lio Pachter’s lab for transcript abundance measurements from RNA-seq.  The advantage that Kallisto offers versus other algorithms is that by performing ‘pseudo-alignment’ (or ‘alignment light’ as others have put it), Kallisto significantly cuts down on total run time.  The pre-print for Kallisto is reporting run times of 5 minutes for 30 million PE reads for the pseudo-alignment and transcript abundance measurements using the same Expectation-Maximization machine learning algorithm as RSEM.  All this on a standard laptop.  Not having to rely on your University’s computer cluster of hundreds of compute nodes or having to wait for runtimes of hours/days for RNAseq analysis is an obvious cost/time advantage for Kallisto over existing pipelines.  In my RNAseq analysis work, we have been using STAR aligner for alignment, followed by HTseq-count for binning of reads to gene features and finally DESeq2 for gene expression analysis with raw counts.  In parallel, we will also generate TPM and estimated count values with RSEM which uses Bowtie for alignment to the transcriptome.  This would generally take several hours per sample on our University’s HPC.

I wanted to try using Kallisto to analyze publicly available datasets that have already been published, run differential gene expression analysis with Limma-voom and then compare to their results.  In my head, the workflow would have been this:

  1. download raw, fastq files from NCBI Sequence Read Archive (SRA).
  2. Run Kallisto on PE reads for each sample to generate estimated values.
  3. Use the R/Bioconductor package limma to identify differentially expressed transcripts.  Limma was originally developed to perform linear models on microarray data but is also applicable to RNAseq data by taking the log of counts per million of estimated counts.
  4. compare results to published findings.

While researching how to do each step, I found Andrew McKenzie’s very good blog detailing exactly what I had in mind.  In Andrew’s R code, He used several functionals that I have not seen/used before and I found his blog very well done.  I am going to re-use a lot of his R code but also try and go a little further by analyzing a recent publication that I found interesting and also to compare differential gene expression analysis using Kallisto’s estimated count values in the Limma-voom package.  Since the publication reported differentially expressed genes, we can compare whether Kallisto output closely match their findings.

Dr. David Tuveson’s group recently published a very interesting paper in Cell (Boj et al 2015) reporting the generation of organoid cultures from genetically engineered mouse models of pancreatic cancer.  The authors generated organoids from normal pancreas (mN), early stage lesions called Pancreatic Intraepithelial Neoplasias or PanIN (mP), or from pancreatic adenocarcinoma (mT).  The PanIN stage organoids were generated from “KC” mice which have a pancreatic-specific gene promoter driving expression of Cre-Recombinase.  In turn, Cre-Recombinase will ‘turn on’ an oncogeneic Kras G12D mutation by excising a stop codon before the mutation.  Signaling from Kras G12D is sufficient to turn normal pancreatic epithelium in to PanIN-stage early lesions.  Roughly 99% of PanIN’s have a Kras mutation based on sequencing studies.  To model pancreatic adenocarinoma, the authors used the KPC model which contains, in addition to Cre and Kras G12D, a p53 R172H hotspot mutation which causes dominant-negative effects on the other allele.  Mutations in p53 occur in roughly 50% of human PDA.  KPC mice develop pancreatic adenocarcinoma that recapitulates the human disease with high-fidelity including multiple organ metastasis which is rare in other mouse models of cancer.

The Organoids generated by Boj et al have the potential to serve as a platform for genetic manipulation of PDA cells at various stages to investigate gene/network interactions or drug screens.  For our purposes, the authors ran RNA sequencing on these organoids (~65 million paired end reads per sample).  They report the Gene Expression Omnibus (GEO) ascension number in their paper as GSE63348 and Sequence Read Archive (SRA) SRP049959.  To download the SRA files 8 samples (2 Normal, 3 PanIN and 3 tumor), I made a new directory and ran the following bash script:


#!/bin/bash

# mN5
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR165/SRR1654626/SRR1654626.sra
# mN7
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR165/SRR1654628/SRR1654628.sra
# nP1
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR165/SRR1654633/SRR1654633.sra
# nP4
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR165/SRR1654636/SRR1654636.sra
# nP5
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR165/SRR1654637/SRR1654637.sra
# nT3
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR165/SRR1654639/SRR1654639.sra
# nT5
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR165/SRR1654641/SRR1654641.sra
# nT8
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR165/SRR1654643/SRR1654643.sra

fastq-dump --split-files *.sra

This script will access the SRA database and download the .sra files corresponding to each sample in the dataset. I am not downloading every sample as I don’t have the hard drive space for the full fastq file set. The SRA Tookit can be used to convert SRA files into fastq files for each sample (fastq-dump –split-files *.sra).  The –split-files flag will split the SRA file into each paired end read (SRA_File_1.fq and SRA_File_2.fq).  This approach generated about ~300 Gb of fastq files on my computer ready for analysis with Kallisto.

Once I had each sample downloaded and fastq files split, we can run Kallisto for pseudo-alignment and transcript abundance measurements.  One part of Andrew’s workflow that I had not used before was declaring an array.  Loading the SRA sample ID’s into an array will allow us to loop through each element in the array and run Kallisto.  This script does a few things.  1.  It makes an array for the sample IDs.  2. For each sample ID, the script will run a for-loop and make a directory for each sample, start a timer, run Kallisto, end the timer and calculate total Kallisto run time to a log file.


#!/bin/bash

# # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# bash script to call kallisto quantification on SRA files
# David Balli
#
# program goal:
# 1. for each SRA file ID
# 2. make output directory
# 3. start clock
# 4. run kallisto quantification on paired end reads
# 5. stop clock
# 6. calculate Kallisto run time and log to file
# # # # # # # # # # # # # # # # # # # # # # # # # # # # #

INDEX="/Users/dballi/NGS_tools/kallisto/mm10_index"

# make an array of each SRA id 
declare -a arr=("SRR1654626" "SRR1654628" "SRR1654633" "SRR1654636" "SRR16546367" 'SRR1654639'"SRR1654637" "SRR1654641" "SRR1654643"))

for i in "${arr[@]}"
do
 mkdir output_"$i"
 start=`date +%s`
 kallisto quant -i $INDEX -o output_"$i" "$i""_1.fastq" "$i""_2.fastq"
 end=`date +%s`
 runtime=$((end-start))
 echo "your runtime: $runtime seconds for sample $i" >> time_log.txt
done

Once Kallisto and finished running:

$ cat time_log.txt
your runtime: 1249 seconds for sample SRR1654626
your runtime: 1470 seconds for sample SRR1654628
your runtime: 2037 seconds for sample SRR1654633
your runtime: 1685 seconds for sample SRR1654636
your runtime: 2382 seconds for sample SRR1654639
your runtime: 1587 seconds for sample SRR1654637
your runtime: 1646 seconds for sample SRR1654641
your runtime: 1904 seconds for sample SRR1654643

$ ls -h # to see each output directory
output_SRR1654636
output_SRR1654637
output_SRR1654639
output_SRR1654641
output_SRR1654626
output_SRR1654643
output_SRR1654628
output_SRR1654633

Kallisto ran for an average of 29 minutes for each ~65 million read sample. As I said before, our normal work flows can take several hours (even up to days) depending on size.

Now we can load our estimated count values into R and analyze with the limma package. Once again, following Andrew’s nice workflow, we will merge the values from each sample into one data frame, and do some further exploratory data analysis to compare to the published findings.

################################################################################
# R script for analysis of Kallisto tutorial for differential gene expression
# using data from Boj et al., Cell 160, 324–338, January 15, 2015
# updated October 10th, 2015
# Organoid Models of Human and Mouse Ductal Pancreatic Cancer
# 
# David Balli
################################################################################

library("genefilter")
library("ggplot2")
library("limma")
library("edgeR")
library('biomaRt')
library("NMF")
library("stringr")
library("plyr")
library("RColorBrewer")

setwd("~/NGS_tools/Computational_notebook/SRA:Kalisto/")
#("2015-7-7-Kallisto-tutorial.RData")
# save.image("2015-7-7-Kallisto-tutorial.RData")
directory <- '~/NGS_tools/Computational_notebook/SRA:Kalisto/'
Files <- grep("SRR", list.files(directory),value=T)
Files <- grep(".fastq", Files, value = T, invert=T) # but we don't want fastq files

# will be list of "output_SRR*" directories for each kallisto run
Files

# for each kallisto run output directories (Files)
# loop through each directory and read abudnance.txt file (with target_id, tpm, expected_counts, feature_length, Tpm)
# and assign the Files name to each temperary table
# run: ?assign 
# for more info on assign() function

for (i in 1:length(Files)){
 tmp = read.table(file = paste0(Files[i],"/abundance.txt"), header = T)
 assign(Files[i], tmp)
}

# get all values from File list 
# again: typing "?base::mget" will retrieve documentation
sample_list = mget(Files)

# determine list of unique names
ListNames <- Map(function(x, i) setNames(x, ifelse(names(x) %in% "target_id", names(x), sprintf("%s.%d", names(x), i))), sample_list, seq_along(sample_list))

# Hadley Wickham's Advanced R website explains many of these Functionals
# http://adv-r.had.co.nz/Functionals.html
# "Map() is useful whenever you have two (or more) lists (or data frames) that you need to process in parallel."

# merge full kalisto table
# have to use Reduce() function as merge() will only merge two data.frames
full_results <- Reduce(function(...) merge(..., by='target_id', all = T), ListNames)

# subset estimate count values (correcting from TPM values)
count_vals <- full_results[, grep("est_counts", names(full_results))]

# now we have to assign column and row IDs to tpm table
row.names(count_vals) <- full_results$target_id

# setting column names is a little trickier as each file has "output_SRR..." as IDs
# we will use regrex with stringr and dplyr packages to get down to the SRR file IDs
splitNames <- str_split(string=Files, pattern = "_", n = 2)
df <- ldply(splitNames, data.frame)
FinalNames <- df[df$X..i.. != "output",]

FinalNames
# [1] SRR1654626 SRR1654628 SRR1654633 SRR1654636 SRR1654637 SRR1654639 SRR1654641 SRR1654643
# 9 Levels: output SRR1654626 SRR1654628 SRR1654633 SRR1654636 SRR1654637 ... SRR1654643
# actual IDs
ActualNames <- c("mN5",'mN7','mP1','mP4','mP5','mT3','mT5','mT8')
Groups <- data.frame(Subtypes = c("mN","mN",'mP','mP','mP','mT','mT','mT'))

Names_count <- data.frame(FinalNames, ActualNames, Groups)
Names_count

colnames(count_vals) <- ActualNames

head(count_vals)
# mN5 mN7 mP1 mP4 mP5 mT3 mT5 mT8
# ENSMUST00000000001 10392.00000 12666.00000 14283.000 13103.00 13714.000 16091.0000 14870.0000 17685.00000
# ENSMUST00000000003 0.00000 0.00000 0.000 0.00 0.000 0.0000 0.0000 0.00000
# ENSMUST00000000010 0.00000 0.00000 0.000 0.00 0.000 0.0000 0.0000 0.00000
# ENSMUST00000000028 657.04100 672.57000 427.341 724.58 878.154 961.0980 1547.2900 1958.93000
# ENSMUST00000000033 1.28795 0.00000 0.000 0.00 0.000 2.5892 0.0000 2.58194
# ENSMUST00000000049 1.00000 2.09284 3.000 2.00 0.000 1.0000 2.2568 2.00000

Now that we have matrices for TPM for the 8 samples, we are ready to run the limma package. First we will need to make a design matrix which specifies which samples are in specific group-wise comparisons


des <- factor(Groups$Subtypes)
design <- model.matrix(~0+des)
colnames(design) <- levels(des)
design
#   mN mP mT
#1  1  0  0
#2  1  0  0
#3  0  1  0
#4  0  1  0
#5  0  1  0
#6  0  0  1
#7  0  0  1
#8  0  0  1
#attr(,"assign")
#[1] 1 1 1
#attr(,"contrasts")
#attr(,"contrasts")$des
# [1] "contr.treatment"

Now we are ready to to filter lowly expressed genes from the TPM tables. Then we can use principle component analysis to compare our data to the principle component plot in Figure 5A.

# filter out lowly expressed genes
A <- rowSums(count_vals)
isexpr <- A > 1
table(isexpr)
# isexpr
# FALSE TRUE 
# 21367 66831 
count_vals <- count_vals[isexpr, ]
dim(count_vals)
y <- DGEList(counts = count_vals)

# lets do some exploratory data analysis of this RNAseq dataset
# first with Principle componenet analysis like figure 5A in Boj et al., 
# and compare whether TPM normalization can replicate their data 

log_count <- log2(y$counts + 1)

pcCount <- prcomp(t(log_count))

# 2D PCA plot with ggplot2
scoresCount <- data.frame(pcCount$x, Names_count) 

brewer.pal(3,"Set2")
my_palette <- brewer.pal(3,"Set2") # "#66C2A5" "#FC8D62" "#8DA0CB"

# PCA
(pcaCount <- ggplot(scoresCount, aes(x=PC1, y=PC2)) +
 geom_point(size = 4, aes(col=factor(scoresCount$Subtypes))) +
 ggtitle("Principal Components\nUsing Kallisto Estimated Counts") +
 geom_text(aes(label=scoresCount$ActualNames),hjust=0.5, vjust=1) + 
 theme_minimal())
ggsave(filename = "2015-7-9-Boj-PCA.png",plot = pcaTpm, width = 8, height = 6)

Screen Shot 2015-07-10 at 12.28.30 PM2015-7-9-Boj-TPM-PCA

The PCA plot in Figure 5A of Boj et al., is on the left and a PCA plot using TPM values generated from Kallisto is on the right made in ggplot2.  Even though we have not analyzed about half the dataset, we can clearly see similar principle components in both graphs.

Now we can run the limma-voom linear model to identify differential gene expression analysis.  Then I want to take the ensembl IDs and replace them with MGI symbols with the biomaRt package.


exp <- voom(counts = y$counts, design = design, plot = T)
fit <- lmFit(exp, design = design)

# set up contrast matrix of conditions
cont.matrix <- makeContrasts("mT-mN",'mT-mP','mN-mP', levels = design)
fit <- contrasts.fit(fit, cont.matrix)
fit <- eBayes(fit)
topTable(fit, adjust="BH", number = Inf, p=0.05)
summary(decideTests(fit))
# mT-mN mT-mP mN-mP
# -1 62 25 29
# 0 66732 66777 66761
# 1 37 29 41

# compare the mT vs mN and mP vs mN contrasts and sort by LogFC
options(digits = 3)

# comparing normal organoids to Tumor organoids 
res_mTvsmN <- (topTable(fit, coef = 1, adjust = 'BH', n = Inf, p = 0.05)[,-2])
res_mTvsmN <- res_mTvsmN [order(-res_mTvsmN$logFC),]
head(res_mTvsmN )
nrow(res_mTvsmN ) # 99 genes

# comparing normal organoids to PanIN organoids
res_mNvsmP <- (topTable(fit, coef = 3, adjust='BH', n = Inf, p = 0.05)[,-2])
res_mNvsmP <- res_mNvsmP[order(-res_mNvsmP$logFC),]
head(res_mNvsmP)
nrow(res_mNvsmP) # 70 genes

So in all, using the abbreviated data set, there are 124 differentially expressed genes between Tumor organoids and normal organiods (mT vs mN).  There is 66 differentially expressed genes between PanIN stage organoids and normal (mP vs mN).  Now lets map the MGI symbols to the ensembl IDs to have a better idea of what genes are here.


# lets also change from ensembl id to MGI symbol via biomaRt
mart <- useMart(biomart = 'ensembl', dataset = "mmusculus_gene_ensembl")

Names <- row.names(count_vals)
# map to mouse genome names 
attributes = c("mgi_symbol", 'entrezgene',"ensembl_transcript_id")
# Extract information from biomart for gene names
results <- getBM(attributes = attributes,
 filters = "ensembl_transcript_id",
 values = Names,
 mart = mart)
head(results)

# merge each results table with the biomart data to give MGI symbols to each ensemble id
res_mTvsmN$ensembl_transcript_id <- row.names(res_mTvsmN)
res_mTvsmN <- merge(res_mTvsmN, results, by = "ensembl_transcript_id", sort = F)
res_mTvsmN

res_mNvsmP$ensembl_transcript_id <- row.names(res_mNvsmP)
res_mNvsmP <- merge(res_mNvsmP, results, by = "ensembl_transcript_id", sort = F)
res_mNvsmP

There is some overlap between differentially expressed genes reported in the paper and the genes identified here with Limma-voom and TPM values from Kallisto.  Obviously not downloading the entire dataset will limit the power to reproduce Boj et al’s data as closely as possible, but it looks like data from Kallisto is replicating Boj’s analysis in significantly less time.

How to make a co-mutation plot

It has been way to long since I posted here and thought it was time to add something.

Our lab is planning to analyze a series of cancer genomes via exome sequencing.  Part of my preparation has been planning on the best way to present NGS data and, more specifically, what mutations we have identified in our samples.  The most prevalent way to present this kind of data is by the Co-mutation (comut) plots.  This blog post from the Broad Institute discusses the back story of how Nico Stransky came up with the idea.  Every study being published from the TCGA, ICGC and any genomics group is using the comut plots to represent mutation burden across whatever cohort they have.

mutsig_coMut-v2

What motivated me to write a blog post discussing the comut plots is that there doesn’t seem to be a decent tutorial on how to generate these plots.  I am just going to focus on the body of the graph for now, as the accompanying bargraphs are fairly standard to generate.  I am also going to explore the biomaRt package as it a great source of wide range of information for basically every thing you would ever need for bioinformatics.


# how to make a Co-mutation (comut) plot with ggplot2
library("ggplot2")
library('biomaRt')

# first - lets get random gene names from biomaRt
listMarts() # will show all available databases

mart <- useMart(biomart = "ensembl")
listDatasets(mart) # will list all available datasets
mart <- useMart(biomart = 'ensembl',dataset = "mmusculus_gene_ensembl"))
head(listAttributes(mart), n = 50)

# Extract information from biomart for uniprot gene names
results <- getBM(attributes = c("uniprot_genename"), mart = mart)

#random sample of 40 genes
RandomGenes <- sample(results$uniprot_genename,40)
#RandomGenes

# Now we will make an example data frame for a Co-mutation plot

# This data is generally obtained from MutSig output (or Genome MuSiC, etc)

# with 40 genes of interest in 100 subjects
df <- expand.grid(gene=1:40, subj=1:100)
df$Mutation <- as.factor(sample(c(rep("Missense",1500),
rep("Nonsense",500),
rep("Frame Shift",1000),
rep("Indel", 250),
rep("Splice Site", 749), 4000)))
df$Mutation[sample(4000,3700)] <- NA

df$gene <- RandomGenes

# now for a Comut plot with ggplot2
mut <- ggplot(df, aes(x=subj, y=gene, height=0.8, width=0.8))
(mut <- mut + geom_tile(aes(fill=Mutation)) +
scale_fill_brewer(palette = "Set1", na.value="Grey90") +
xlab("Subject") +
ggtitle("Example Comut plot") +
theme(
legend.key = element_rect(fill='NA'),
legend.key.size = unit(0.4, 'cm'),
legend.title = element_blank(),
legend.position="bottom",
legend.text = element_text(size=8, face="bold"),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_text(colour="Black"),
axis.title.x=element_text(face="bold"),
axis.title.y=element_blank(),
panel.grid.major.x=element_blank(),
panel.grid.major.y=element_blank(),
panel.grid.minor.x=element_blank(),
panel.grid.minor.y=element_blank(),
panel.background=element_blank()
))

ggsave(mut,file="2015-5-24-ExampleComutplot.png",width=10,height=8)

And here you have a fairly simple comut plot (compared to the above example).  But I think it is a good beginning plot to build a more complex figure around.

2015-5-24-ExampleComutplot

RNAseq pipeline – alignment to DE analysis

RNA sequencing has quickly surpassed microarrays for differential gene expression analysis in most labs.  I wanted to post a RNAseq pipelines from Raw sequencing reads to differential expression analysis and various other downstream analyses.  There are quite a few pipelines for RNAseq analysis and each has its pro/cons.  One of the most popular pipelines is alignment to the genome with STAR aligner, binning of sequencing reads to genes/exons with HTseq-count, and finally differential analysis with DESeq/DESeq2.  I linked the online manuals for each tool here but I would highly suggest the initial publications for each as those will include benchmarking compared to other programs and the general theory behind of each algorithm (here, here and here – the original Deseq paper now has over 1800 citations on google scholar!).  I’d suggest DESeq2.

This example assumes that you starting with RNA sequencing reads in fastq format after quality control (FastQC, etc.) and demultiplexing (fastq_multx, etc.).  Those steps will be a future post.  Depending on who is doing your actual sequencing (in-house core, BGI, etc.), they will probably release the fastq reads to you after each of these steps have been done.

1. Alignment to genome with STAR aligner v2.4.0.1.  STAR is wonderfully fast and accurate alignment program written by Alexander Dobin.  Alignment speeds with STAR are advertised at  400 million pairs per hour for 100 bp paired end reads on 12-core server (compared to Tophat at 20 million reads per hour).   In our hands, on 50 bp paired end reads, we are aligning at about 600 million pairs per hour on average with ~75-80% unique alignment rate.   The key with STAR is that it needs A LOT of memory, at least 30 GB.  Our cluster has dedicated job queues to allocate up to 64 GB memory for each STAR alignment job.

1a) Alignment generate genome.   Before alignment can be done, you must prepare the necessary genome index files.  The basic input for genome index files are 1) genome in fasta format and 2) annotation gtf file (needed in later steps).  You can download the genome files for human/mouse/etc from igenome from Illumina.  For the new version of STAR, there are pre-made index files, depending on which reference you want to build.  They have various builds and model organisms.  We have been using the mouse mm10 reference from UCSC and have been happy with alignment rates.  I downloaded the mouse mm10 reference from igenome and renamed it “mm10_index/” .  You will also have to designate a directory for the genome index files STAR produces (“~/STAR_mm10/” or /path/to/STAR_mm10/”).

STAR --runMode genomeGenerate \
--genomeDir /path/to/STAR_mm10/ \
--genomeFastaFiles ~/mm10_index/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa \
--runThreadN 16

After STAR runs, you should have a few files in the /STAR_mm10/ directory:

  1. Genome
  2. genomeParameters.txt
  3. chrLength.txt, chrNameLength.txt, chrStart.txt
  4. SA, SAindex
  5. Log.out

STAR will use these files for alignment in the next step.

1b) Alignment to genome (or transcriptome).  The newer version (2.4.0). has the option to align specifically to the transcriptome and not the genome.  This option is for downstream analysis with RSEM ( another future post).  I’ll detail the basic STAR alignment job for now.

 

STAR --genomeDir /path/to/STAR_mm10/ \
--readFilesIn read_1.fq read_2.fq \
--runThreadN 16 \
--outSAMtype BAM SortedByCoordinate 

STAR will generate a few files:

  1. Log.out
  2. Log.progress.out
  3. Log.final.out – summary mapping statistics – will give you an idea on qc
  4. Aligned.sortedByCoord.out.bam – your reads aligned in standard BAM format sorted by coordinate – similar to samtools sort command
  5. SJ.out.tab – high confidence splice junction in tab-delimited format

To generate an alignment file to the transcriptome with STAR v2.4.0:

STAR --genomeDir /path/to/STAR_mm10/ \
--readFilesIn read_1.fq read_2.fq \
--runThreadN 16 \
--quantMode TranscriptomeSAM

I haven’t had a chance to really run through this option yet (more for RSEM instead of DESeq2) but I am excited to try.

Optional) Sorting, munging, etc.  The new version of STAR allows for generating alignment into a BAM file sorted by genomic coordinate.  Earlier versions didn’t and one would have to run the Aligned.out.sam files through various samtools munging steps to get alignment file into the correct format for downstream read counting.  For this example, it is necessary to have samtools-0.1.19.  First you convert the Aligned.out.sam to bam format with samtools view and pipe the output directly to samtools sort and sorting by genome position.  Finally converting back from bam to sam format.  If you have the python module pysam installed you can use a BAM file in the next steps.

samtools view -uS Aligned.out.sam -b | samtools sort -@ 8 -m 800000000 - Aligned.sorted
samtools view Aligned.sorted.bam > Aligned.sorted.sam

3) raw counts generation using HTSeq-count. 

htseq-count has a few options:

  • -m union: union is one of three options for htseq count for binning of reads to features (genes/exons).  it is the recommended option per htseq count’s manual.  It is the most stringent option as it will only count a read if it aligns to only 1 gene.
  • -r pos : meaning that the BAM/SAM file is sorted by genomic coordinate/position
  • -i gene_name : tells htseq-count to use the gene name in output file
  • -a 10 : tells htseq-count to skip reads with alignment score less than 10.
  • –stranded=no : the RNAseq reads are not strand specific

htseq-count -m union \
-r pos \
-i gene_name \
-a 10 \
--stranded=no \
Aligned.sortedByCoord.out.bam \
~/mm10_index/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf > output_basename.counts

Output of htseq count is a tab-delimited text file containing two columns: 1) gene id 2) number of read counts. Below is just an example of various cadherins with number of reads mapped to each.

Cdh1 731
Cdh10 41
Cdh11 129
Cdh12 0
Cdh13 50538
Cdh15 0
Cdh16 0
Cdh17 648
Cdh18 0
Cdh19 0
Cdh2 570
Cdh20 0
Cdh22 0
Cdh23 11
Cdh24 3
Cdh26 140
Cdh3 3
Cdh4 0
Cdh5 237
Cdh6 496

4) differential gene expression analysis using DESeq2. 

<pre>####################################################################################
# bioinformatic analysis of RNA-seq data using DESeq2
#
# combined from DESeq2 manual and vignette
# http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
# and from Dave Wheeler's blog at Massey University
# http://dwheelerau.com/2014/02/17/how-to-use-deseq2-to-analyse-rnaseq-data/
#
# analysis per htseq count -intersections-nonempty
# controlling for batch (different RNAseq prep libraries)
#
# notes:
#
####################################################################################

library("DESeq2")
setwd("~/path/to/working/directory/")
directory <- "/path/to/counts/directory/"

# can merge individual sample files (i.e. ctrl1.counts, ctrl2.counts, etc.)
sampleFiles <- grep("counts"list.files(directory),value=T)

# view sampleFiles
sampleFiles

# can designate different batches of samples (i.e. different sequencers,
# PE vs SE, different library preps (eg. BATCH1 vs BATCH2))
sampleBatch <- c("Batch1","Batch1","Batch1","Batch1","Batch1","Batch1",
                 "Batch2","Batch2","Batch2","Batch2")

# set sampleConditions and sampleTable for experimental conditions
sampleCondition <- c("Control, Experimental")

sampleTable <- data.frame(sampleName = sampleFiles,
                          fileName = sampleFiles,
                          condition = sampleCondition,
                          Batch = sampleBatch)

# view sampleTable
sampleTable 

ddsHTseq <- DESeqDataSetFromHTSeqCount( sampleTable = sampleTable,
                                       directory = directory,
                                       design = ~condition) 

## view ddsHTseq - should give summary of class, data, etc.
ddsHTseq

colData(ddsHTseq)$condition <- factor(colData(ddsHTseq)$condition,
                                      levels = c('Control','Experimental'))

# gut of DESeq2 analysis
dds <- DESeq(ddsHTseq)

res <- results(dds)
# order results by padj value (most significant to least)
res <- res[order(res$padj),]
head(res)
# should see DataFrame of baseMean, log2Foldchange, stat, pval, padj 

# save data results and normalized reads to csv!
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds,normalized=T)), by='row.names',sort=F)
names(resdata)[1] <- 'gene'
head(resdata)
write.csv(resdata, file="DATE-DESeq2-results-with-normalized.csv")

# send normalized counts to tab delimited file for GSEA, etc.
write.table(as.data.frame(counts(dds),normalized=T), file = 'DATE_DESeq2_normalized_counts.txt', sep = '\t')

# produce DataFrame of results of statistical tests
# could way to record experimental design
mcols(res, use.names = T)
write.csv(as.data.frame(mcols(res, use.name = T)),file = "DATE-DESeq2-test-conditions.csv")

# replacing outlier value with estimated value as predicted by distrubution using
# "trimmed mean" approach. recommended if you have several replicates per treatment
# DESeq2 will automatically do this if you have 7 or more replicates

ddsClean <- replaceOutliersWithTrimmedMean(dds)
ddsClean <- DESeq(ddsClean)
tab <- table(initial = results(dds)$padj < 0.1,
             cleaned = results(ddsClean)$padj < 0.1)
addmargins(tab)
write.csv(as.data.frame(tab),file = 'DATE-DESeq2-replaceoutliers.csv')
resClean <- results(ddsClean)
resClean <- resClean[order(resClean$padj),]
head(resClean)
write.csv(as.data.frame(resClean),file = 'DATE-DESeq2-replaceoutliers-results.csv')

####################################################################################
# Exploritory data analysis of RNAseq data with DESeq2
#
# these next R scripts are for a variety of visualization, QC and other plots to
# get a sense of what the RNAseq data looks like based on DESEq2 analysis
#
# 1) MA plot
# 2) rlog stabilization and variance stabiliazation
# 3) variance stabilization plot
# 4) heatmap of clustering analysis
# 5) PCA plot
#
#
####################################################################################

# MA plot of RNAseq data for entire dataset
# http://en.wikipedia.org/wiki/MA_plot
# genes with padj < 0.1 are colored Red
plotMA(dds, ylim=c(-8,8),main = "RNAseq experiment")
dev.copy(png, "DATE-DESeq2_MAplot_initial_analysis.png")
dev.off()

# transform raw counts into normalized values
# DESeq2 has two options:  1) rlog transformed and 2) variance stabilization
# variance stabilization is very good for heatmaps, etc.
rld <- rlogTransformation(dds, blind=T)
vsd <- varianceStabilizingTransformation(dds, blind=T)

# save normalized values
write.table(as.data.frame(assay(rld),file='DATE-DESeq2-rlog-transformed-counts.txt', sep='\t')
write.table(as.data.frame(assay(vsd),file='DATE-DESeq2-vst-transformed-counts.txt', sep='\t')

# plot to show effect of transformation
# axis is square root of variance over the mean for all samples
par(mai = ifelse(1:4 <= 2, par('mai'),0))
px <- counts(dds)[,1] / sizeFactors(dds)[1]
ord <- order(px)
ord <- ord[px[ord] < 150]
ord <- ord[seq(1,length(ord),length=50)]
last <- ord[length(ord)]
vstcol <- c('blue','black')
matplot(px[ord], cbind(assay(vsd)[,1], log2(px))[ord, ],type='l', lty = 1, col=vstcol, xlab = 'n', ylab = 'f(n)')
legend('bottomright',legend=c(expression('variance stabilizing transformation'), expression(log[2](n/s[1]))), fill=vstcol)
dev.copy(png,"DATE-DESeq2_variance_stabilizing.png")
dev.off()

# clustering analysis
library("gplots")
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds), paste(condition, type, sep=" : "))
# OR
# if you want the conditions used
# rownames(mat) <- colnames(mat) <- with(colData(dds),condition)
heatmap.2(mat, trace = "none", col = rev(hmcol), margin = c(13, 13))
dev.copy(png, "DATE-DESeq2-clustering.png")
dev.off()

# Principal components plot
# will show additional clustering of samples
# showing basic PCA function in R from DESeq2 package
# this lacks sample IDs and only broad sense of sample clustering
# its not nice - but it does the job
print(plotPCA(rld, intgroup = c("condition")))
dev.copy(png, "DATE-DESeq2_PCA_initial_analysis.png")
dev.off()

# or ggplot PCA plot
library("grDevices")
library('ggplot2')
library("genefilter")

rv <- rowVars(assay(rld))
select <- order(rv, decreasing=T)[seq_len(min(500,length(rv)))]
pc <- prcomp(t(assay(vsdMF)[select,]))

# set condition
condition <- c("condition1", 'condition2')

scores <- data.frame(sampleFiles, pca$x, condition)

(pcaplot <- ggplot(scores, aes(x = PC1, y = PC2, col = (factor(condition))))
+ geom_point(size = 5)
+ ggtitle("Principal Components")
+ scale_colour_brewer(name = " ", palette = "Set1")
+ theme(
  plot.title = element_text(face = 'bold'),
  legend.position = c(0,0),
  legend.key = element_rect(fill = 'NA'),
  legend.text = element_text(size = 10, face = "bold"),
  axis.text.y = element_text(colour = "Black"),
  axis.text.x = element_text(colour = "Black"),
  axis.title.x = element_text(face = "bold"),
  axis.title.y = element_text(face = 'bold'),
  panel.grid.major.x = element_blank(),
  panel.grid.major.y = element_blank(),
  panel.grid.minor.x = element_blank(),
  panel.grid.minor.y = element_blank(),
  panel.background = element_rect(color = 'black',fill = NA)
))

ggsave(pcaplot,file="DATE-PCA_ggplot2.pdf")

# scatter plot of rlog transformations between Sample conditions
# nice way to compare control and experimental samples
head(assay(rld))
plot(log2(1+counts(dds,normalized=T)[,1:2]),col='black',pch=20,cex=0.3, main='Log2 transformed')
plot(assay(rld)[,1:2],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
plot(assay(rld)[,3:4],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
plot(assay(rld)[,5:6],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
plot(assay(rld)[,7:8],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
plot(assay(rld)[,9:10],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
plot(assay(rld)[,11:12],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
plot(assay(rld)[,13:14],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
plot(assay(rld)[,15:16],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")

# heatmap of data
library("RColorBrewer")
library("gplots")
# 1000 top expressed genes with heatmap.2
select <- order(rowMeans(counts(dds,normalized=T)),decreasing=T)[1:1000]
my_palette <- colorRampPalette(c("blue",'white','red'))(n=1000)
heatmap.2(assay(vsd)[select,], col=my_palette,
          scale="row", key=T, keysize=1, symkey=T,
          density.info="none", trace="none",
          cexCol=0.6, labRow=F,
          main="TITLE")
dev.copy(png, "DATE-DESeq2-HEATMAP.png")
dev.off()

# top 2000 genes based on row variance with heatmap3
library(heatmap3)
colsidecolors = c("darkgreen","darkgreen",
 "mediumpurple2","mediumpurple2",
 "mediumpurple2","mediumpurple2",
 "darkgreen","darkgreen",
 "mediumpurple2","mediumpurple2",
 "darkgreen","darkgreen",
 "mediumpurple2","mediumpurple2",
 "mediumpurple2","mediumpurple2")
rv <- rowVars(assay(vsd))
select <- order(rv, decreasing=T)[seq_len(min(2000,length(rv)))]
my_palette <- colorRampPalette(c("blue", "white", "red"))(1024)
heatmap3(assay(vsd)[select,],col=my_palette,
 labRow = F,cexCol = 0.8,margins=c(6,6))
dev.copy(pdf, "DATE-DESeq2-heatmap3.pdf")
dev.off()

sessionInfo()

###############################################################
#
# Optional analyses
#
###############################################################

# multifactor designs
# can analysis with more than one factor influencing the counts  (e.g., different library preps, PE vs. SE)
# from manual section 1.5 for more information and rationale

# make copy of DESeq2 results
ddsMF <- dds

# change design formulate controlling for Batch
design(ddsMF) <- formula(~ Batch + condition)

# rerun DESeq analysis
ddsMF <- DESeq(ddsMF)
resMF <- results(ddsMF)

# order by padj values
resMF <- resMF[order(resMF$padj),]

head(resMF)
# should see DataFrame of baseMean, log2Foldchange, stat, pval, padj
# save data 'resMF' to csv!
write.csv(as.data.frame(resMF),file='DATE-DESeq2_batchcontroll_initial_analysis.csv')

future post idea: pairing HTSeq-DESeq2 for DE gene expression with DEXSeq for splice variants as this pipeline is agnostic for differential exon usage.  *May miss true DE genes if total exon counts are the same but use different exons in different conditions (per Lior Pachter)

Future post idea: comparing STAR-HTSeq-DESeq2 to Bowtie-RSEM for gene expression analysis on the same dataset.

Things to learn to do bioinformatics.

I know what you are thinking.  “Not another bioinformatics blog…”.  There are a lot of blogs from a wide variety of scientists today.  These range from doctoral students working towards their degrees in bioinformatics to the leaders of the field and all are fantastic resources for information and discussion.  I believe I fall somewhere outside of everyone else.  I didn’t major in computer science in undergraduate or use anything besides Excel for my data analysis in graduate school.  My training is in molecular biology and my Ph.D. dissertation investigated the molecular biology of fibrosis in mouse models.  For my current postdoctoral work, I am interested in understanding the genetics of cancer progression (using mouse models) and have undertaken learning “to do bioinformatics” for my lab.   The most exposure I had to Unix tools was an elective course I took in undergrad where all I remembered were the basics of working and moving around in an Unix environment (cd, mkdir, ls, etc…).   For over the past year, I have had to learn, sometimes on the fly, what I needed to know to accomplish the task at hand.  This has mainly been RNA sequencing analysis from QC and Alignment to calculation of Differential expression.  We are also starting to venture into whole exome sequencing for somatic variant detection in cancer.   I also think that coming from a molecular biology background may make this blog interesting in that I am not necessarily as interested in generating a better BWA algorithm.   Using bioinformatics tools/analysis to interrogate biological datasets to form better hypotheses is the direction in which biology is going.  In a few years (some would say now), I think every biologist will need to at least have a rudimentary understanding of bioinformatics to do their job well.  Sequencing has just become to (relatively) cheap to not understand how the data is generated and analyzed.

A big motivation to start a blog was how scattered information is in regards to learning “to do bioinformatics”.  When my mentor and I discussed my current project, we both imagined the time required to perform that data analysis properly would be rather small (10% of my time).   Perhaps this shows the naivety of two traditional molecular biologists deciding to venture into large-scale, next-gen sequencing projects and data analysis.

There are other great resources/blogs for those interested in learning computational/bioinformatic skills.  A few I read are: Stephen Turner’s blog, Tommy Tang’s blog, and David Ting’s blog all of which are a very good sources for technical learning/examples.  Also, Nick Loman’s blog had a post dealing with learning bioinformatics.  These and many others are great sources of information ranging form basic how-to’s to posts that are math/computer science heavy.  How many fields have a leading scientists making blog posts discussing data analysis?  One great point in Nick Loman’s post was that to really learn how to do data analysis on your NGS data, you must learn by doing (“sink or swim” is another analogy for this).  Personally, that is how have I learned so far and as I mentioned above it tends to be learning on the fly.  An R-Blogger post that I found early on when I was starting out (here), offers a lot of guidance in terms of basic skills and areas to learn for a budding bioinformatics scientist.  I think this list is fantastic, but I found that the order doesn’t necessarily match the order of what I have learned so far.  I want to re-organize it into the order that I think makes the most sense in terms of prioritizing skills to learn for someone coming from my background (a bioinformatics blog for biologists).

1) Learn to use the command line.

The vast majority of your analysis and data manipulation can not be done on a graphical user interface (GUI) or in other words, what you use to interact with your macbook.  Using the command line in a Terminal window to interact with your computer through Shell program offers a much more powerful way to use a computer versus pointing and clicking icons.  The vast majority of the bioinformatic tools in use today (BWA, GATK, Cufflinks, DESeq, Samtools) are  called and used via the command line.

MIT has a great primer on command line usage.  Same with Brown.  The Victorian bioinformatics consortium has a nice presentation on why the ability to work in the command line is important for bioinformatic analysis.

Linux/Unix operating systems also have “built-in” programs available in the command line that make data manipulation (data “munging”)  very fast and easy…once you learn the language.  Sed, Awk, Sort, Cut and Grep are programs that allow you to manipulate data and will make your life easy.  I wish I tried to learn Awk earlier.  Stephen Turner’s github has a nice recipe list of common commands and their usages.  I really recommend learning them as early as possible, as it will save you time later on.  The combination of Awk/Sed may help you get the gene annotation GTF file from the UCSC into the correct format for RSEM Rnaseq analysis.

2) R.

The programming language R is a statistical computing and graphics environment.  R has its supporters and also those that suggest other languages for statistical analysis (Python and the newer Julia), but I picked up on R first and think its fantastic.

  •  It is free, open-source and supported by a wide community.  So there are numerous program packages available for practically any statistical analysis you want to run (hierarchical clustering, K-means, t tests, anova, linear regression, generalized linear models, principal component analysis….I could go on and on).
  • For biologists, Bioconductor is an open-source suite of packages specifically for high-throughput, next-gen sequencing data analysis.   A significant portion of my data analysis is done using these packages and quite a few job postings I have seen for bioinformatics position ask explicitly for familiarity using Bioconductor.  Other languages, mainly python, don’t seem to have as many biology specific packages/modules compared to R.  I have heard of some labs operating entirely in R.

Suggested readings, sources:

  • I originally bought this book: Beginning R: The Statistical Programming Language (Amazon).  It covers the basics of R programming, built-in statistical packages (Anova, linear regression, etc) and the basic plotting functions.
  • Code school has a nice online tutorial for basic R programming.
  • Frank Mcown’s site has a cookbook for the basic R graphic plots.  I am not a fan of the basic R graphics now (explained below), but when I was learning, see the syntax in the recipes helped pick up the language.
  • O’reilly books/manuals have been very useful in my experience  (Amazon).
  • Coursera’s R programming MOOC.

2b) RStudio.

Don’t even bother using terminal for R.  Seriously.  Go here and download RStudio.  RStudio is an integrated development environment (or IDE) that is a great instrument to design experiments, run data analyses, and generate publication quality plots/graphs.  Below is a crudy screen shot of RStudio open as an example of somethings things  that you can do.  R is ideal for data analysis for me as you can save a snapshot of data for everything that you have done.  That way, I can come back to RNAseq analysis in two months (without having thought about it) and continue my analysis without having to re-run previous steps (or wonder what I was doing before).  Also, the best part of RStudio is that is free for desktop (They make their money making a server/professional version for people with that kind of money).

RStudio

2C) Learn the Hadley-verse or why I wish I found ggplot2 earlier.

Hadley Wickham is a statistician (currently at RStudio) who has developed some amazing R packages for data analytics.  His paper discussing Tidy data format for making analysis and plotting as easy as possible is definitely worth a read.  His gglot2 package is an excellent resource for data visualization (and widely used).  I don’t think I could ever go back to using Prism, Excel, or even the default plotting functions in R after I learned to use ggplot2 .  I actually didn’t prioritize learning the underlying syntax to ggplot2 for a little while (it is different from standard R plot syntax), but reading Hadley’s book really explain the structure of the syntax and the reasoning behind it.  His other creations, like plyr and reshape, are on my to-learn list as well.  Once I learned how to use ggplot2, I quickly recognized that all the publications I had/have been reading use ggplot2 for the majority of the data presentation.

3) Scripting language (python/perl).

Being able to write short programming scripts for data analysis/munging is a great skill to have. Often times, I will have to take numerous text file form multiple samples that contain two columns of GENEID and RNAseq counts and combine them into a single file of GENEID, SAMPLE1, SAMPLE2…SAMPLE-Nth.  Sometimes files like these are two large to successfully load into excel and combine.  That is why being able to write a python script to programmatically parse each file and combine columns/lines into a format you need is important.  The above is one example but there are many uses for being able to write scripts for data analysis.  To be honest, I feel it is something I still struggle with and probably takes a bit of time depending on how much time you are willing to put into it.  Some resources for python:

  • Code Academy has a nice online, free tutorial for Python.
  • Learn Python the Hard Way isn’t actually hard at all and is highly recommended as great way for beginners to learn python.  My only issue for these types of online tutorials is that they are more geared towards generating simply games with python.
  • Python for Biologist’s is great because you use actually bioinformatic analysis as examples.  How to parse fasta files, etc.
  • O’Reilly’s Python for Data Analysis is a book I have and goes into detail about many of the statistical and plotting analyses available in python modules like Pandas and Scipy.  I’d recommend this if you really don’t like R.

4) High Performance Computing/ Parallel environments on your university’s cluster or Amazon EC2.

It is not uncommon to have RNAseq projects produce 100s of Gigabytes of data.  The standard macbook does not have the memory or hard drive to adequately process data of this size (or even bigger).  That is where you University’s high performance computing cluster (or Amazon web service like EC2) come in as it has the resources available to process vast quantities of data from multiple samples in a somewhat reasonable amount of time.  The University where I work has a cluster of over 150 servers nodes with 16 core processors per node with each containing up to 256 GB of RAM per node.  Throw in about 2 petabytes of total storage and we are talking serious firepower for analysis of NGS data at a major university.  The ability to run bioinformatic programs in parallel significantly improves ability to analyze very large datasets.

If your institution doesn’t have access to an HPC, you could instead utilize Amazon Web Service EC2.  It is commercially available HPC that numerous companies use (Dropbox, Spotify to name two).  EC2 is nice because you only pay for what resources you use and they have options for bioinformatics.

Learning to work in an Linux environment on a HPC is something I had to learn early and couldn’t imagine not having access to it now.  The data sets being generated are just to large to run on a laptop or even a smaller lab server (the RNAseq alignment program we use needs at least 64 GB of RAM – factor in over 20-50+ samples and you can understand why HPCs are so critical).  I have found that several clusters around the US have tutorials online for users.  Depending on operating system (usually different distributions of Linux) these tutorials may not be exactly what you need to know but will explain the basics of batch job submission, parallelism, bash scripting etc.  See hereherehere as a HPCs documentation from various Universities that I googled at random.

5) version control (git/svn).

Version control is a management system to track changes to given program or script.   Version control tracks these changes over time and allows you to recall/reset to a specific version from earlier on if that script/program has since changed.    A poor analogy would be to consider version control a backup of a particular script.  If you accidentally (or intentionally) change a program for a reason, you still would have the previous version stored away to revert back.  Git is probably the mostly widely used version control system in both software engineering and bioinformatics (subversion is another).

6) Relational Databases (various flavors of SQL).

To be honest, this is an area that I haven’t had time/need to learn at all.  However, I think learning the basics of SQL are important.    Ensembl offers mySQL databases for each of their reference genomes.  TCGA also has a SQL-like syntax for querying its databases.  Finally, for our exome sequencing project, loading all the called variants across samples into a databases allows for much easier querying of snps located in specific genes/choromsomes/regions of interest.

So those are the areas/skills in the order that I think makes the most sense for someone starting to learn “to do bioinformatics”.  The order may not apply to everyone, but command line usage and R are fairly universal in bioinformatics, whereas SQL may be more specialized.  Hopefully you found this post helpful and I promise to try and not to let this blog be to boring.

Something I forgot to include that is in the R-bloggers post is to have a solid understanding of statistics (or appreciate the fact that you don’t know a lot of statistics).  This blog from Drew Conway is more generalized for Data analytics but can be applied very well to bioinformatics.  Having ability to programmatically analyze data and having an area of expertise without understanding of the underlying statistical principles can/will be a recipe for disaster in analyzing bioinformatic data.  I spend a lot of my time making sure I understand the underlying models/theory behind programs I use as well as read benchmarking papers comparing similar tools (like this one for example) to try and determine the best way to analyze our data.  You probably won’t build a better bioinformatic mouse  trap but don’t use the wrong one for the job.