HumanEnsembl mirror

ENCODE resources at Ensembl

The ENCODE project aims to discover all functional elements in the human genome. Ensembl is involved in ENCODE in two ways:

  • Dr Ewan Birney, who is the joint leader of Ensembl from the EBI side, also leads the analysis group for ENCODE.
  • Ensembl aims to integrate all genome-wide functional datasets, with ENCODE datasets expected to form a large proportion of these, to provide integrated functional annotation on the human genome. An initial dataset to this end was released on June 13th as part of Ensembl 45 (see below).

Ensembl works in tight coordination with the UCSC group, which is the data collection centre (DCC) for ENCODE.

Results of the recent ENCODE analysis

On June 14th, 2007, the paper detailing the analysis of the ENCODE pilot project was published in Nature along with 36 different companion papers in Genome Research. Dr Birney led the analysis for the main paper. The key results of this paper are:

  • Transcription is more complex than expected, with many non-coding transcripts intercalating with standard protein-coding genes. However there was little evidence for protein-coding genes outside of established sets.
  • There are many more Transcription Start Sites (TSS) than expected, around 10-fold more than the number of protein-coding genes.
  • Regulatory information is distributed in a clustered manner across the genome, and its distribution near TSSs is symmetrical.
  • There are many distal DNaseI Hypersensitive Sites (DHSs), some of which show binding by CTCF.
  • Replication is correlated with histone structure in a more detailed manner than previously known.
  • Evolutionary constraint be can detected at far higher resolution than was previously feasible; surprisingly, many experimentally identified features did not show constraint.

A number of key intermediate files in the analysis, and resources for ENCODE are available for FTP download.

Press releases

Functional genomics in Ensembl

Building on this initial analysis, Ensembl aims to provide a richer annotation of the human genome. The key concept which we have introducted from Ensembl 45 is the concept of a "Regulatory Build". The regulatory build aims to provide a single "best guess" set of regulatory elements, with growing annotation of those elements from different experiments. This "Regulatory Build" augments the standard "Gene Build" (itself incorporating both protein coding and non protein coding Genes) to form the union of functional elements in the genome.

We also will augment our Gene Build to take account of new functional datasets such as CAGE and PET tags across the genome. These tags are already available in Ensembl for Human and Mouse, using data generated by the RIKEN group in Japan and GIS group in Singapore. We will be using these markers of Transcription start sites and termini to provide more accurate transcript definition.

An initial Regulatory Build has been developed in Ensembl by Paul Flicek and colleagues. It integrates 8 genome-wide datasets, mainly in pre-publication "resource" status, including the DNaseI Hypersensitive site set from Greg Crawford's group at Duke University, a set of 6 histone modifications from Martin Hirst's group at the BGGC at Vancouver, and the CTCF dataset from Bing Ren's group at UCSD.

This set takes DNaseI Hypersensitive Sites, CTCF binding regions and H3K4me3 as three "focus regions" each defining a potential element. The union of these three foci define 110,000 elements across the genome. We then took all the factors (with an additional 5 histone marks) to look for specific patterns diagnostic of certain features. A number of patterns which were high enriched for gene starts, genic regions and distal regions (away from genes) were developed.

The results of this analysis can be seen in the "Regulatory features" track on Ensembl displays in human (on by default) and are available to download from our FTP site.

We are extremely grateful to the Crawford and Hirst laboratories for use of their data in pre-publication status, in line with the open data access of ENCODE and the human genome project, and the CTCF dataset from Bing Ren's group at UCSD (Kim, et al. 2007. Cell 128:1231-45). In the future we hope to integrate more functional datasets and use genetic association studies, such as those published in Stranger et al, to provide the link between elements and genes.

If you have questions about the dataset or how to access it, please email our helpdesk.

ENCODE regions

Region name Chr. start..end Description Compara
ENr324 X 122609996..123109995 Random PicksMultiSpecies
ENm006 X 152767492..154063081 Manual Picks:ChrXMultiSpecies
ENr231 1 149424685..149924684 Random PicksMultiSpecies
ENr131 2 234156564..234656627 Random PicksMultiSpecies
ENr331 2 219985590..220485589 Random PicksMultiSpecies
ENr112 2 51512209..52012208 Random PicksMultiSpecies
ENr121 2 118011044..118511043 Random PicksMultiSpecies
ENr113 4 118466104..118966103 Random PicksMultiSpecies
ENr212 5 141880151..142380150 Random PicksMultiSpecies
ENm002 5 131284314..132284313 Manual Picks:InterleukinMultiSpecies
ENr221 5 55871007..56371006 Random PicksMultiSpecies
ENr222 6 132218540..132718539 Random PicksMultiSpecies
ENr223 6 73789953..74289952 Random PicksMultiSpecies
ENr323 6 108371397..108871396 Random PicksMultiSpecies
ENr334 6 41405895..41905894 Random PicksMultiSpecies
ENm013 7 89621625..90736048 Manual PicksMultiSpecies
ENm001 7 115597757..117475182 Manual Picks:CFTRMultiSpecies
ENm010 7 26924046..27424045 Manual Picks:HOXAMultiSpecies
ENm012 7 113720369..114720368 Manual Picks:FOXP2MultiSpecies
ENm014 7 125865892..127029088 Manual PicksMultiSpecies
ENr321 8 118882221..119382220 Random PicksMultiSpecies
ENr232 9 130725123..131225122 Random PicksMultiSpecies
ENr114 10 55153819..55653818 Random PicksMultiSpecies
ENr312 11 130604798..131104797 Random PicksMultiSpecies
ENr332 11 63940889..64440888 Random PicksMultiSpecies
ENm009 11 4730996..5732587 Manual Picks:BetaMultiSpecies
ENm011 11 1699992..2306039 Manual Picks:1GF2/H19MultiSpecies
ENm003 11 115962316..116462315 Manual Picks:ApoMultiSpecies
ENr123 12 38626477..39126476 Random PicksMultiSpecies
ENr111 13 29418016..29918015 Random PicksMultiSpecies
ENr132 13 112338065..112838064 Random PicksMultiSpecies
ENr311 14 52947076..53447075 Random PicksMultiSpecies
ENr322 14 98458224..98958223 Random PicksMultiSpecies
ENr233 15 41520089..42020088 Random PicksMultiSpecies
ENm008 16 1..500000 Manual Picks:AlphaMultiSpecies
ENr313 16 60833950..61333949 Random PicksMultiSpecies
ENr211 16 25780428..26280428 Random PicksMultiSpecies
ENr213 18 23719232..24219231 Random PicksMultiSpecies
ENr122 18 59412301..59912300 Random PicksMultiSpecies
ENm007 19 59023585..60024460 Manual Picks:Chr19MultiSpecies
ENr333 20 33304929..33804928 Random PicksMultiSpecies
ENr133 21 39244467..39744466 Random PicksMultiSpecies
ENm005 21 32668237..34364221 Manual Picks:Chr21MultiSpecies
ENm004 22 30133954..31833953 Manual Picks:Chr22MultiSpecies