Using R to evaluate DGE for differential expression

Moderated statistical tests for assessing differences in tag abundance. (2007). Robinson MD, Smyth GK. Bioinformatics. 23(21):2881-7. PMID: 17881408

The authors of this paper develop a statistical test using a negative binomial distribution to model noise in the data. They release R code (available at to support their treatment, but it is uncommented. The windows version installs from their zip file, but I was unable to get the linux version to build.

Here I apply the code and examine the results in a sample data set.

At the website above you can find a tar archive called: msage_0.7.tar.gz. I can't get the package to install in R, so I simply unpack it and use the .R files directly. In addition, the code calls a code block (kepler.R) which is not included in the distribution. However, it is available here:

Incidentally, the block of code: kepler.R is not actually required for the Robinson-Smyth algorithm. It is only called for the comparison of their algorithm to Kepler's algorithm quoted in the paper.

Basically there are two files of R code in the msage package: fitNBP.R and msage.R that can be sourced. The file msage.R has a source statement inside: source("kepler.R"), however this code block is not included. To get the code to function, download sage_analysis.r, and rename it kepler.R (written by Thomas Kepler at Duke). It contains a function called: glm.poisson.disp() that is called by msage.R.

Since I can't get the package to install, I simply place the .R files in my working directory (along with kepler.R), and source them:

# set working directory and load library functions

This doesn't strike me as the optimal way to proceed, as you end up with a messy workspace (71 items in your environment), but I don't know a better way.

Assemble some data

There is a function in msage which can assemble data from individual tab-delimited tag files of the form: tag, count. It's called: read.sage() and takes an array of file names to read. It returns a list with two elements: a data matrix in the first position, and an array of library sizes as the second element. The input files simply need two columns, lists of tags (or gene names) and the associated counts separated by a tab character. They do not have to be normalized to tpm.

Assuming you had a directory of tag results files you could read them in as follows:

# read mydir to get a list of file names with .txt extensions
files <- dir("mydir", "txt", full.n=T)
# read in files of my choosing
dge <- read.sage(files[c(3,4,7,8)])

dge will now be a list containing a data matrix with counts for all the tags from all the files that were read, and an array with the total counts found in each file (also known as the size of each library).

However I already have my data in a table.

                wt1     wt2     mut1    mut2
NM_008665       3347    2740    3185    2842
CF805324        3336    2897    3500    2838
NM_030248       3252    2754    3231    2604
BC093504        3241    3179    3482    2943
AK043166        3157    7257    55      8837
NM_133703       3153    3196    2839    3588
NM_198419       3131    2787    2631    3576
BX511437        2893    2017    2429    2059
NM_028110       2844    2611    2279    2237
BF470700        2830    2629    3343    2333
AK028146        2681    2464    2104    2501

Once we have some data we need to make a list to hand to the first function. The list contains the data matrix, the sample groupings, and the library sizes. You may want to filter your data to only contain rows of the matrix that have reasonable tag counts. A set of columns with only 1 count in half of them won't be very meaningful.

My data is in a table called gex.all (if it's a data frame you can reference it as.matrix as I do below). I need to find the size of each library (aka the sum of each column), and I want to filter it a little:

lib.size <- colSums(gex.all)
# create a filter for really low numbers of reads
k <- rowSums(gex.all) > 4

# create a list structure
cws <- list(data=as.matrix(gex.all[k,]), groups=c(1,1,2,2), lib.size=lib.size)

# run the first function from msage
# this takes a while
alpha <- alpha.approxeb(cws)

# this takes a while
ms <- msage(cws, alpha=alpha$alpha)

adj.p <- p.adjust(ms$exact, "fdr")
k <- (adj.p < .05)

# create table with some results
all <- cbind(cws$data,adj.p,sign(rowSums(cws$data[,1:2]) - rowSums(cws$data[,3:4])))

# save results to disk
write.table(all[k,], "all.csv", sep=",", col.names=NA)

The functions: alpha.approxeb() and msage() took a combined time of 36 hours to run on a data set with 18000 rows.

Plotted below is a snapshot of the results. Spots are labeled with a color gradient based on p-value. Red spots have p-values of 0.01 or less.

Kepler Sage Analysis

The data can also be analyzed by putting a wrapper around Kepler's algorithm, as described in.

Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach. (2005). Lu J, Tomfohr JK, Kepler TB. BMC Bioinformatics. 6:165. PMID: 15987513

Using the code referenced above (sage_analysis_code.r), for a data matrix containing transcripts per million

# for data matrix mgex
results <- c()
for(i in 1:nrow(mgex)){
  tag.count <- as.numeric(mgex[i,])

  #fit the overdispersed log-linear model
  fit <- glm(tag.count~offset(log(lib.size))+group,family=poisson(link=log)); <- glm.poisson.disp(fit,verbose=FALSE);
  phi <-$dispersion;
  t.value <- summary($coefficients[2,'z value'];
  p.value <- 2*pt(-abs(t.value),$df.residual);

  results <- rbind(results, c(tag.count, t.value, phi, p.value))

  # spit out some output so I know things are happening
  if( ! i %% 100 ){
     cat("row number ", i, "\n")

rownames(results) <- rownames(mgex)
colnames(results) <- c(colnames(mgex), "t", "phi", "p")

write.table(results, file="kepler_sage_results.txt", sep=",", col.names=NA)

CategoryR ChrisSeidel

R DGE NB (last edited 2008-04-18 22:51:07 by ChrisSeidel)