In making Venn diagrams to look at overlap of sets, I often wonder how significant a given amount of overlap is. What is the likelyhood of seeing a given amount of overlap from two sets, simply by chance?

One way to assess this, is to use the hypergeometric distribution. The R language has a nice function for calculating the p-value, but the explanation of how to use it involves an Urn of black and white balls.

phyper(q,m,n,k,lower.tail=F)

q = the number of white balls drawn from the urn (without replacement)

m = the number of white balls in the urn

n = the number of black balls in the urn

k = the number of balls drawn from the urn (sample size)

## Example comparing gene sets

Let's say you want to compare sets of genes identified in two independent experiments. For instance, in experiment one, you identify 1000 genes up regulated under a given condition. In experiment two you identify 2872 genes with promoters bound by a transcription factor. Now you want to compare the two experiments to see if the up-regulated genes are also those bound by the transcription factor. A venn diagram between the experiments indicates that the two sets (1000 up-regulated genes, and 2872 TF bound genes) have and intersection of 448. Is this significant? The total number of genes in the experiment is 14,800.

q = 448

m = 1000

n = 14800 - 1000

k = 2872

1 - phyper(448,1000,13800,2872) [1] 1.906314e-81

## Making Venn Diagrams

* I wrote a utility for making venn diagrams: venn diagrams

* But someone else wrote a better one recently: venny