Solexa Eland Export format to BED file

Here is a perl script which converts from Eland export format to BED format.

#!/usr/bin/perl
# Program to convert eland export format to BED format
# for running MACS
# Chris Seidel, June 2009
#
# Requires tab delim file of chromosome or contig names 
# (eland fa match files) in the format:
# UCSC_chr_name chr_length eland_name
# corrects for alignments that go off the ends of the chrs
# negative bases are trimmed to 1, 
# bases > chr_length are set to chr_length
# (I know the former exist, I don't know if the latter exist)
# results are not sorted, but can be sorted in linux by:
# sort -o infile.bed -k 1,1 -k 2,2n infile.bed
# (sort in place, first column, then by second column numeric)

die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);

# create output filename
$outfile = $ARGV[1];
$outfile =~ s/\.txt$/\.bed/;
open(FOUT, ">$outfile") || die("can't open output file: $outfile");

# get info on chromosomes
open(cmap, $ARGV[0]) || die("no chromosome name mapping file!");
%chrmap = {};
while($line = <cmap>){
    chomp($line);
    ($newval, $size, $oldval) = split(/\t/, $line);
    $chrmap{$oldval} = $newval;
    $chrsize{$oldval} = $size;
}

# open input file
open(fp, $ARGV[1]) || die("can't open eland file");

$lines = 0;
while(<fp>){
    chop;
    ++$lines;
    @bits = split(/\t/);
    # skip reads that didn't pass filtering
    next if($bits[21] eq "N");
    # get match name
    $seqname = $bits[10];
    # skip No Matches or QC failures
    # next if($seqname =~ /NM|QC/);
    # skip repeat matches
    # next if($seqname =~ /\d+:\d+:\d+/);
    # we're only interested in sequences that match our chrs
    next unless(exists($chrmap{$seqname}));

    $seqlen = length($bits[8]);
    $start = $bits[12];
    $end = $start + $seqlen - 1;
    $strand = $bits[13];

    # parse match descriptor
    $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
    # skip reads beyond a certain threshold
    next if($n > 2);
    $read_code = "U".$n;

    # correct for alignments off the chromosome ends
    if( $start <= 0 ){
        print STDERR "start less than or equal to 0:   ", $start, "\n";
        print STDERR join("\t", @bits), "\n";
        $start = 1;
    }

    if($end > $chrsize{$seqname}){
        print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
        print STDERR join("\t", @bits), "\n";
        $end = $chrsize{$seqname};
    }

    if($strand eq "F"){
        $strand = "+";
        $color = "0,0,255";
    }
    else{
        $strand = "-";
        $color = "255,0,0";
    }

    $score = 0;
    print FOUT join("\t", $chrmap{$seqname}, $start, $end, $read_code, $score, $strand, $start, $end, $color), "\n";

    # give some feedback
    print STDERR "$lines processed\n" if(!($lines % 100000));
}

close(FOUT);
print STDERR "output file: $outfile\n";

it takes a companion file of chromosome descriptions. This file is used by the program to map the chromosome names used for the alignment - which are often big ugly names to nice short chromosome names. It also has the length of each chromosome. It's a tab delimited text file that you create for your organism to reflect the alignment values.

chr2L   23011544        Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.2L.fa
chr2LHet        368872  Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.2LHet.fa
chr2R   21146708        Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.2R.fa
chr2RHet        3288761 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.2RHet.fa
chr3L   24543557        Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.3L.fa
chr3LHet        2555491 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.3LHet.fa
chr3R   27905053        Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.3R.fa
chr3RHet        2517507 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.3RHet.fa
chr4    1351857 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.4.fa
chrX    22422827        Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.X.fa
chrXHet 204112  Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.XHet.fa
chrYHet 347038  Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.YHet.fa

Solexa/ElandExport2BED (last edited 2009-10-21 20:20:41 by ChrisSeidel)