ENCODE resources at Ensembl
The ENCODE project aims to discover all functional elements in the human genome. Ensembl is involved in ENCODE in two ways:
- Dr Ewan Birney, who is the joint leader of Ensembl from the EBI side, also leads the analysis group for ENCODE.
- Ensembl aims to integrate all genome-wide functional datasets, with ENCODE datasets expected to form a large proportion of these, to provide integrated functional annotation on the human genome. An initial dataset to this end was released on June 13th as part of Ensembl 45 (see below).
Ensembl works in tight coordination with the UCSC group, which is the data collection centre (DCC) for ENCODE.
Results of the recent ENCODE analysis
On June 14th, 2007, the paper detailing the analysis of the ENCODE pilot project was published in Nature along with 36 different companion papers in Genome Research. Dr Birney led the analysis for the main paper. The key results of this paper are:
- Transcription is more complex than expected, with many non-coding transcripts intercalating with standard protein-coding genes. However there was little evidence for protein-coding genes outside of established sets.
- There are many more Transcription Start Sites (TSS) than expected, around 10-fold more than the number of protein-coding genes.
- Regulatory information is distributed in a clustered manner across the genome, and its distribution near TSSs is symmetrical.
- There are many distal DNaseI Hypersensitive Sites (DHSs), some of which show binding by CTCF.
- Replication is correlated with histone structure in a more detailed manner than previously known.
- Evolutionary constraint be can detected at far higher resolution than was previously feasible; surprisingly, many experimentally identified features did not show constraint.
A number of key intermediate files in the analysis, and resources for ENCODE are available for FTP download.
Functional genomics in Ensembl
Building on this initial analysis, Ensembl aims to provide a richer annotation of the human genome. The key concept which we have introducted from Ensembl 45 is the concept of a "Regulatory Build". The regulatory build aims to provide a single "best guess" set of regulatory elements, with growing annotation of those elements from different experiments. This "Regulatory Build" augments the standard "Gene Build" (itself incorporating both protein coding and non protein coding Genes) to form the union of functional elements in the genome.
We also will augment our Gene Build to take account of new functional datasets such as CAGE and PET tags across the genome. These tags are already available in Ensembl for Human and Mouse, using data generated by the RIKEN group in Japan and GIS group in Singapore. We will be using these markers of Transcription start sites and termini to provide more accurate transcript definition.
An initial Regulatory Build has been developed in Ensembl by Paul Flicek and colleagues. It integrates 8 genome-wide datasets, mainly in pre-publication "resource" status, including the DNaseI Hypersensitive site set from Greg Crawford's group at Duke University, a set of 6 histone modifications from Martin Hirst's group at the BGGC at Vancouver, and the CTCF dataset from Bing Ren's group at UCSD.
This set takes DNaseI Hypersensitive Sites, CTCF binding regions and H3K4me3 as three "focus regions" each defining a potential element. The union of these three foci define 110,000 elements across the genome. We then took all the factors (with an additional 5 histone marks) to look for specific patterns diagnostic of certain features. A number of patterns which were high enriched for gene starts, genic regions and distal regions (away from genes) were developed.
The results of this analysis can be seen in the "Regulatory features" track on Ensembl displays in human (on by default) and are available to download from our FTP site.
We are extremely grateful to the Crawford and Hirst laboratories for use of their data in pre-publication status, in line with the open data access of ENCODE and the human genome project, and the CTCF dataset from Bing Ren's group at UCSD (Kim, et al. 2007. Cell 128:1231-45). In the future we hope to integrate more functional datasets and use genetic association studies, such as those published in Stranger et al, to provide the link between elements and genes.
For general questions about the dataset or how to access it, please email the helpdesk as firstname.lastname@example.org. If you wish to learn more about future plans, please email Steve Searle for questions on transcription information, and Paul Flicek and Ewan Birney on regulatory information. (all of us are on the helpdesk email, which is internally tracked to ensure a response to each question, so we recommend this route. However, we realise that some people have specific strategic questions to pose which they maybe more comfortable sending to us directly).