The following 132 words could not be found in the dictionary of 615 words (including 615 LocalSpellingWords) and are highlighted below:

alignment   alignments   bed   beyond   big   bits   by   can   certain   chomp   chop   chr   chr2   chr3   chr4   chrmap   chromosome   chrs   chrsize   close   cmap   color   companion   converts   correct   delimited   descriptions   descriptor   didn   die   dna   Drosophila   each   Eland   eland   else   ends   eq   equal   exists   export   Export   fa   failures   feedback   file   filtering   for   format   fp   from   get   give   greater   Het   if   in   interested   join   length   less   line   lines   map   mapping   match   matches   Matches   melanogaster   name   names   newval   next   nice   No   no   of   off   often   oldval   only   open   or   organism   our   outfile   output   P5   parse   pass   perl   processed   program   re   read   reads   reflect   repeat   scalar   score   script   seqlen   seqname   sequences   short   size   skip   Solexa   some   split   start   strand   tab   takes   text   than   that   the   This   threshold   to   tr   ugly   unless   usage   used   values   we   which   while   you   your  

Clear message

Solexa Eland Export format to BED file

Here is a perl script which converts from Eland export format to BED format.

#!/usr/bin/perl
# Program to convert eland export format to BED format
# for running MACS
# Chris Seidel, June 2009
#
# Requires tab delim file of chromosome or contig names 
# (eland fa match files) in the format:
# UCSC_chr_name chr_length eland_name
# corrects for alignments that go off the ends of the chrs
# negative bases are trimmed to 1, 
# bases > chr_length are set to chr_length
# (I know the former exist, I don't know if the latter exist)
# results are not sorted, but can be sorted in linux by:
# sort -o infile.bed -k 1,1 -k 2,2n infile.bed
# (sort in place, first column, then by second column numeric)

die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);

# create output filename
$outfile = $ARGV[1];
$outfile =~ s/\.txt$/\.bed/;
open(FOUT, ">$outfile") || die("can't open output file: $outfile");

# get info on chromosomes
open(cmap, $ARGV[0]) || die("no chromosome name mapping file!");
%chrmap = {};
while($line = <cmap>){
    chomp($line);
    ($newval, $size, $oldval) = split(/\t/, $line);
    $chrmap{$oldval} = $newval;
    $chrsize{$oldval} = $size;
}

# open input file
open(fp, $ARGV[1]) || die("can't open eland file");

$lines = 0;
while(<fp>){
    chop;
    ++$lines;
    @bits = split(/\t/);
    # skip reads that didn't pass filtering
    next if($bits[21] eq "N");
    # get match name
    $seqname = $bits[10];
    # skip No Matches or QC failures
    # next if($seqname =~ /NM|QC/);
    # skip repeat matches
    # next if($seqname =~ /\d+:\d+:\d+/);
    # we're only interested in sequences that match our chrs
    next unless(exists($chrmap{$seqname}));

    $seqlen = length($bits[8]);
    $start = $bits[12];
    $end = $start + $seqlen - 1;
    $strand = $bits[13];

    # parse match descriptor
    $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
    # skip reads beyond a certain threshold
    next if($n > 2);
    $read_code = "U".$n;

    # correct for alignments off the chromosome ends
    if( $start <= 0 ){
        print STDERR "start less than or equal to 0:   ", $start, "\n";
        print STDERR join("\t", @bits), "\n";
        $start = 1;
    }

    if($end > $chrsize{$seqname}){
        print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
        print STDERR join("\t", @bits), "\n";
        $end = $chrsize{$seqname};
    }

    if($strand eq "F"){
        $strand = "+";
        $color = "0,0,255";
    }
    else{
        $strand = "-";
        $color = "255,0,0";
    }

    $score = 0;
    print FOUT join("\t", $chrmap{$seqname}, $start, $end, $read_code, $score, $strand, $start, $end, $color), "\n";

    # give some feedback
    print STDERR "$lines processed\n" if(!($lines % 100000));
}

close(FOUT);
print STDERR "output file: $outfile\n";

it takes a companion file of chromosome descriptions. This file is used by the program to map the chromosome names used for the alignment - which are often big ugly names to nice short chromosome names. It also has the length of each chromosome. It's a tab delimited text file that you create for your organism to reflect the alignment values.

chr2L   23011544        Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.2L.fa
chr2LHet        368872  Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.2LHet.fa
chr2R   21146708        Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.2R.fa
chr2RHet        3288761 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.2RHet.fa
chr3L   24543557        Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.3L.fa
chr3LHet        2555491 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.3LHet.fa
chr3R   27905053        Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.3R.fa
chr3RHet        2517507 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.3RHet.fa
chr4    1351857 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.4.fa
chrX    22422827        Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.X.fa
chrXHet 204112  Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.XHet.fa
chrYHet 347038  Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.YHet.fa

Solexa/ElandExport2BED (last edited 2009-10-21 20:20:41 by ChrisSeidel)