The following 192 words could not be found in the dictionary of 615 words (including 615 LocalSpellingWords) and are highlighted below:

above   alignment   an   analysis   analyzing   and   Args   args   assign   assigned   awkward   Bam   bam   bam2   bam2gd   bamchrs   bamfile   based   basically   bd   be   been   below   better   bit   bowtie   builds   bunch   but   call   called   can   case   cat   certain   Ch   chipseq   chr   chriv   chrnames   chromosome   chromosomes   command   convert   Convert   creates   each   Eland   environment   every   Export   extracts   figure   file   filename   files   flag   flags   follows   For   for   form   frame   from   function   further   Genome   genome   get   However   in   In   ina   index   indicates   installed   integers   into   ip   isn   iv   lane1   lane2   lane3   lane4   lately   length   levels   library   like   list   little   match   me   might   minus   mito   must   mutant   my   mygd   mysterious   name   names   new   Note   Object   object   objects   odd   of   officially   on   order   out   package   paste   plus   pos   position   positions   read   Reads   reads   representing   required   rm   rname   Rsamtools   save   sc   scan   Scchr01   Scchr02   Scchr03   Scchr04   Scchr05   Scchr06   Scchr07   Scchr08   Scchr09   Scchr10   Scchr11   Scchr12   Scchr13   Scchr14   Scchr15   Scchr16   Scmito   script   seem   selects   sep   seqnames   sh   shell   Short   sname   so   start   still   strand   strands   Striv   stuck   sublist   syntax   that   The   the   There   think   this   those   to   unmapped   use   using   usually   vanilla   ve   vectors   way   wce   weed   what   whereby   which   with   With   wt   yet  

Clear message

Convert BAM file to a GenomeDataObject

There must be a better way to do this, but I can't figure it out.

For ChIP SEQ analysis I usually start with Eland Export files that I read into R with ShortReads, and convert to GenomeData objects for use with the chipseq package. However, lately I've been stuck with BAM files, which are still mysterious and awkward to me.

In order to get the files into R in a form I can start analyzing, I use Rsamtools, which isn't officially out yet, but it's installed in my environment and can read BAM files. The code below basically reads in the BAM file, extracts out the chromosome, strand, and position, for every read, and builds a GenomeData object.

A genome data object is basically a list of chromosomes, whereby each chromosome has s sublist of "+" and "-" vectors of integers representing alignment positions for each strand.

Note: the code below selects reads based on certain flags. In my case, the BAM file is from bowtie, and the reads have flags of 0,16,4. I think 0 is plus strand 16 is minus strand, and 4 indicates an unmapped read so I have to weed those out.

library(Rsamtools)
library(chipseq)

# create a dataframe for chr name substitutions
chrnames <- data.frame(sc=c("chrI", "chrII", "chrIII", "chrIV", "chrV", "chrVI", "chrVII", "chrVIII", "chrIX", "chrX", "chrXI", "chrXII", "chrXIII", "chrXIV", "chrXV", "chrXVI", "mito"), bam=c("Scchr01", "Scchr02", "Scchr03", "Scchr04", "Scchr05", "Scchr06", "Scchr07", "Scchr08", "Scchr09", "Scchr10", "Scchr11", "Scchr12", "Scchr13", "Scchr14", "Scchr15", "Scchr16", "Scmito"))

# since I run this as a script
# capture command line args
Args <- commandArgs()

bamfile <- Args[4]
sname <- Args[5]

# read in bam data
bd <- scanBam(bamfile)

# get chr names
bamchrs <- levels(bd[[1]][["rname"]])

# only grab entries with certain flags
iv <- bd[[1]][["flag"]] == 0 | bd[[1]][["flag"]] == 16

seqnames <- bd[[1]][["rname"]][iv]
positions <- bd[[1]][["pos"]][iv]
strand <- bd[[1]][["strand"]][iv]

# no longer need the BAM data
rm(bd)

# create empty list
mygd <- list()
for( i in 1:length(bamchrs) ){
  # create index vectors for positions and strands
  chriv <- seqnames == bamchrs[i]
  plusStriv <- strand == "+"
  minusStriv <- strand == "-"
  # create genome data-like list
  mygd[[i]] <- list("+"=positions[chriv & plusStriv], "-"=positions[chriv & minusStriv])
}

# replace their names with UCSC names
names(mygd) <- chrnames[match(bamchrs, chrnames[,"bam"]),"sc"]

# convert to a GenomeData object
mygd <- GenomeData(listData = mygd)

filename <- paste(sname, ".RData", sep="")
assign(sname, mygd)

save(list=sname, file=filename)

The syntax of the save call might seem a little bit odd (list=sname), but that's what's required to save the object with the new name I assigned it using the "assign" function.

With the script above, I can convert a bunch of BAM files to GD objects using a shell script as follows:

#!/usr/local/bin/bash

cat bam2GD.r | R --vanilla --args lane1.bam mutant_wce
cat bam2GD.r | R --vanilla --args lane2.bam mutant_ip
cat bam2GD.r | R --vanilla --args lane3.bam wt_wce
cat bam2GD.r | R --vanilla --args lane4.bam wt_ip

I save the above code ina file called: bam2gd.sh, and it creates a bunch of GenomeData objects for further analysis.

R/BAM2GenomeDataObject (last edited 2010-07-22 22:33:44 by ChrisSeidel)