#acl ChrisSeidel:read,write,delete,revert All:read
#format wiki
#language en

= Convert BAM file to a GenomeDataObject =

There must be a better way to do this, but I can't figure it out.

For ChIP SEQ analysis I usually start with Eland Export files that I read into R with ShortReads, and convert to GenomeData objects for use with the chipseq package. However, lately I've been stuck with BAM files, which are still mysterious and awkward to me.

In order to get the files into R in a form I can start analyzing, I use Rsamtools, which isn't officially out yet, but it's installed in my environment and can read BAM files. The code below basically reads in the BAM file, extracts out the chromosome, strand, and position, for every read, and builds a GenomeData object.

A genome data object is basically a list of chromosomes, whereby each chromosome has s sublist of "+" and "-" vectors of integers representing alignment positions for each strand.

Note: the code below selects reads based on certain flags. In my case, the BAM file is from bowtie, and the reads have flags of 0,16,4. I think 0 is plus strand 16 is minus strand, and 4 indicates an unmapped read so I have to weed those out.

{{{
library(Rsamtools)
library(chipseq)

# create a dataframe for chr name substitutions
chrnames <- data.frame(sc=c("chrI", "chrII", "chrIII", "chrIV", "chrV", "chrVI", "chrVII", "chrVIII", "chrIX", "chrX", "chrXI", "chrXII", "chrXIII", "chrXIV", "chrXV", "chrXVI", "mito"), bam=c("Scchr01", "Scchr02", "Scchr03", "Scchr04", "Scchr05", "Scchr06", "Scchr07", "Scchr08", "Scchr09", "Scchr10", "Scchr11", "Scchr12", "Scchr13", "Scchr14", "Scchr15", "Scchr16", "Scmito"))

# since I run this as a script
# capture command line args
Args <- commandArgs()

bamfile <- Args[4]
sname <- Args[5]

# read in bam data
bd <- scanBam(bamfile)

# get chr names
bamchrs <- levels(bd[[1]][["rname"]])

# only grab entries with certain flags
iv <- bd[[1]][["flag"]] == 0 | bd[[1]][["flag"]] == 16

seqnames <- bd[[1]][["rname"]][iv]
positions <- bd[[1]][["pos"]][iv]
strand <- bd[[1]][["strand"]][iv]

# no longer need the BAM data
rm(bd)

# create empty list
mygd <- list()
for( i in 1:length(bamchrs) ){
  # create index vectors for positions and strands
  chriv <- seqnames == bamchrs[i]
  plusStriv <- strand == "+"
  minusStriv <- strand == "-"
  # create genome data-like list
  mygd[[i]] <- list("+"=positions[chriv & plusStriv], "-"=positions[chriv & minusStriv])
}

# replace their names with UCSC names
names(mygd) <- chrnames[match(bamchrs, chrnames[,"bam"]),"sc"]

# convert to a GenomeData object
mygd <- GenomeData(listData = mygd)

filename <- paste(sname, ".RData", sep="")
assign(sname, mygd)

save(list=sname, file=filename)
}}}

The syntax of the save call might seem a little bit odd (list=sname), but that's what's required to save the object with the new name I assigned it using the "assign" function.

With the script above, I can convert a bunch of BAM files to GD objects using a shell script as follows:

{{{
#!/usr/local/bin/bash

cat bam2GD.r | R --vanilla --args lane1.bam mutant_wce
cat bam2GD.r | R --vanilla --args lane2.bam mutant_ip
cat bam2GD.r | R --vanilla --args lane3.bam wt_wce
cat bam2GD.r | R --vanilla --args lane4.bam wt_ip

}}}
I save the above code ina file called: bam2gd.sh, and it creates a bunch of GenomeData objects for further analysis.