The following 345 words could not be found in the dictionary of 615 words (including 615 LocalSpellingWords) and are highlighted below:

above   abs   abundance   actually   addition   adj   adjust   aka   algorithm   all   along   alpha   already   an   Analysis   analysis   analyzed   and   apply   approach   approxeb   archive   around   array   Assemble   assemble   assessing   associated   Assuming   at   At   attachment   au   authors   available   based   Basically   be   below   better   binomial   bioinf   Bioinformatics   biostat   block   build   but   by   C093504   called   calls   can   cat   Category   cbind   character   Chris   cmd   coefficients   col   colnames   color   column   columns   combined   comparison   console   contain   containing   contains   count   counts   cry1   csv   cws   db   delimited   described   develop   df   dge   differences   differential   dir   directly   disp   dispersion   distribution   don   download   Duke   duke   dulci   each   edu   element   elements   entrez   environment   evaluate   exact   examine   expression   F470700   F805324   family   fdr   file   files   filter   find   fit   flush   follows   for   form   found   frame   from   full   function   functions   gene   get   gex   glm   gov   gradient   group   groupings   groups   gz   had   half   hand   happening   hours   however   However   Identifying   if   in   In   Incidentally   included   individual   input   inside   install   installs   items   jpg   K028146   K043166   Kepler   kepler   know   known   labeled   less   lib   libraries   library   linear   link   linux   list   lists   little   log   Lu   make   matrix   me   meaningful   messy   mgex   million   model   Moderated   ms   msage   multiple   mut1   mut2   My   my   mydir   names   ncbi   need   negative   nih   nlm   noise   normalized   now   nrow   number   numeric   of   offset   on   Once   only   optimal   or   out   output   overdispersed   package   paper   per   phi   place   Plotted   poisson   position   proceed   ps   pt   pubmed   putting   quoted   rbind   read   reasonable   Red   reference   referenced   release   rename   required   residual   resources   results   retrieve   returns   Robinson   row   rownames   rows   run   S1   S2   Sage   sage   sample   second   Seidel   sep   separated   set   setwd   sign   simply   Since   sites   size   sizes   Smyth   snapshot   so   solexa   Solexa   some   source   sourced   spit   Spots   spots   statement   statistical   strike   sum   summary   Sums   support   tab   table   tag   tags   takes   tar   test   tests   that   The   the   their   them   There   there   They   things   this   This   Thomas   time   to   To   Tomfohr   took   total   tpm   transcripts   treatment   two   uid   unable   uncommented   unpack   up   use   Using   using   value   values   verbose   version   very   want   was   way   we   website   wehi   were   which   will   windows   with   won   working   workspace   wrapper   write   written   wt1   wt2   X511437   You   you   your   zip  

Clear message

Using R to evaluate DGE for differential expression

Moderated statistical tests for assessing differences in tag abundance. (2007). Robinson MD, Smyth GK. Bioinformatics. 23(21):2881-7. PMID: 17881408

The authors of this paper develop a statistical test using a negative binomial distribution to model noise in the data. They release R code (available at bioinf.wehi.edu.au/resources/) to support their treatment, but it is uncommented. The windows version installs from their zip file, but I was unable to get the linux version to build.

Here I apply the code and examine the results in a sample data set.

At the website above you can find a tar archive called: msage_0.7.tar.gz. I can't get the package to install in R, so I simply unpack it and use the .R files directly. In addition, the code calls a code block (kepler.R) which is not included in the distribution. However, it is available here: http://dulci.biostat.duke.edu/sage/sage_analysis_code.r

Incidentally, the block of code: kepler.R is not actually required for the Robinson-Smyth algorithm. It is only called for the comparison of their algorithm to Kepler's algorithm quoted in the paper.

Basically there are two files of R code in the msage package: fitNBP.R and msage.R that can be sourced. The file msage.R has a source statement inside: source("kepler.R"), however this code block is not included. To get the code to function, download sage_analysis.r, and rename it kepler.R (written by Thomas Kepler at Duke). It contains a function called: glm.poisson.disp() that is called by msage.R.

Since I can't get the package to install, I simply place the .R files in my working directory (along with kepler.R), and source them:

# set working directory and load library functions
setwd("E:/cry1/Solexa/Smyth")
source("fitNBP.R")
source("msage.R")

This doesn't strike me as the optimal way to proceed, as you end up with a messy workspace (71 items in your environment), but I don't know a better way.

Assemble some data

There is a function in msage which can assemble data from individual tab-delimited tag files of the form: tag, count. It's called: read.sage() and takes an array of file names to read. It returns a list with two elements: a data matrix in the first position, and an array of library sizes as the second element. The input files simply need two columns, lists of tags (or gene names) and the associated counts separated by a tab character. They do not have to be normalized to tpm.

Assuming you had a directory of tag results files you could read them in as follows:

# read mydir to get a list of file names with .txt extensions
files <- dir("mydir", "txt", full.n=T)
# read in files of my choosing
dge <- read.sage(files[c(3,4,7,8)])

dge will now be a list containing a data matrix with counts for all the tags from all the files that were read, and an array with the total counts found in each file (also known as the size of each library).

However I already have my data in a table.

                wt1     wt2     mut1    mut2
NM_008665       3347    2740    3185    2842
CF805324        3336    2897    3500    2838
NM_030248       3252    2754    3231    2604
BC093504        3241    3179    3482    2943
AK043166        3157    7257    55      8837
NM_133703       3153    3196    2839    3588
NM_198419       3131    2787    2631    3576
BX511437        2893    2017    2429    2059
NM_028110       2844    2611    2279    2237
BF470700        2830    2629    3343    2333
AK028146        2681    2464    2104    2501
...

Once we have some data we need to make a list to hand to the first function. The list contains the data matrix, the sample groupings, and the library sizes. You may want to filter your data to only contain rows of the matrix that have reasonable tag counts. A set of columns with only 1 count in half of them won't be very meaningful.

My data is in a table called gex.all (if it's a data frame you can reference it as.matrix as I do below). I need to find the size of each library (aka the sum of each column), and I want to filter it a little:

lib.size <- colSums(gex.all)
# create a filter for really low numbers of reads
k <- rowSums(gex.all) > 4

# create a list structure
cws <- list(data=as.matrix(gex.all[k,]), groups=c(1,1,2,2), lib.size=lib.size)

# run the first function from msage
# this takes a while
alpha <- alpha.approxeb(cws)

# this takes a while
ms <- msage(cws, alpha=alpha$alpha)

adj.p <- p.adjust(ms$exact, "fdr")
k <- (adj.p < .05)

# create table with some results
all <- cbind(cws$data,adj.p,sign(rowSums(cws$data[,1:2]) - rowSums(cws$data[,3:4])))

# save results to disk
write.table(all[k,], "all.csv", sep=",", col.names=NA)

The functions: alpha.approxeb() and msage() took a combined time of 36 hours to run on a data set with 18000 rows.

Plotted below is a snapshot of the results. Spots are labeled with a color gradient based on p-value. Red spots have p-values of 0.01 or less.

Kepler Sage Analysis

The data can also be analyzed by putting a wrapper around Kepler's algorithm, as described in.

Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach. (2005). Lu J, Tomfohr JK, Kepler TB. BMC Bioinformatics. 6:165. PMID: 15987513

Using the code referenced above (sage_analysis_code.r), for a data matrix containing transcripts per million

# for data matrix mgex
results <- c()
for(i in 1:nrow(mgex)){
  tag.count <- as.numeric(mgex[i,])

  #fit the overdispersed log-linear model
  fit <- glm(tag.count~offset(log(lib.size))+group,family=poisson(link=log));
  fit.ps <- glm.poisson.disp(fit,verbose=FALSE);
  phi <- fit.ps$dispersion;
  t.value <- summary(fit.ps)$coefficients[2,'z value'];
  p.value <- 2*pt(-abs(t.value),fit.ps$df.residual);

  results <- rbind(results, c(tag.count, t.value, phi, p.value))

  # spit out some output so I know things are happening
  if( ! i %% 100 ){
     cat("row number ", i, "\n")
     flush.console()
  }
}

rownames(results) <- rownames(mgex)
colnames(results) <- c(colnames(mgex), "t", "phi", "p")

write.table(results, file="kepler_sage_results.txt", sep=",", col.names=NA)

CategoryR ChrisSeidel

R DGE NB (last edited 2008-04-18 22:51:07 by ChrisSeidel)