Human assembly and gene annotation

Assembly

This site provides a data set based on the December 2013 Homo sapiens high coverage assembly GRCh38 from the Genome Reference Consortium. This assembly is used by UCSC to create their hg38 database. The data set consists of gene models built from the genewise alignments of the human proteome as well as from alignments of human cDNAs using the cDNA2genome model of exonerate.(/p>

This release of the assembly has the following properties:

  • contig length total 3.4 Gb.
  • chromosome length total 3.1 Gb (excluding haplotypes).

It also includes 261 alt loci scaffolds, mainly in the LRC/KIR complex on chromosome 19 (35 alternate sequence representations) and the MHC region on chromosome 6 (7 alternate sequence representations).

Watch a video on YouTube about patches and haplotypes in the Human genome.

Patches

As the GRC maintains and improves the assembly, patches are being introduced. Currently, assembly patches are of two types:

  • Novel patch: new sequences that add alternative sequence at a loci and will remain as haplotypes in the next major assembly release by GRC
  • Fix patch: sequences that correct the reference sequence and will replace the given region of the reference assembly at the next major assembly release by GRC.

Neanderthal genome

A preliminary assembly of the Neanderthal (Homo sapiens neanderthalensis) genome is available via the Neanderthal Genome Browser, an Ensembl-powered project based at the Max Planck Institute.

The genome assembly represented here corresponds to GenBank Assembly ID GCA_000001405.15

Gene annotation

The Ensembl human gene annotations have been updated using Ensembl's automatic annotation pipeline. The updated annotation incorporates new protein and cDNA sequences which have become publicly available since the last GRCh38 genebuild (December 2013).

In current release, we continue to display a joint gene set based on the merge between the automatic annotation from Ensembl and the manually curated annotation from Havana. See the statistics table, right, for the corresponding GENCODE version number. The Consensus Coding Sequence (CCDS) identifiers have also been mapped to the annotations. More information about the CCDS project.

Updated manual annotation from Havana is merged into the Ensembl annotation every release. Transcripts from the two annotation sources are merged if they share the same internal exon-intron boundaries (i.e. have identical splicing pattern) with slight differences in the terminal exons allowed. Importantly, all Havana transcripts are included in the final Ensembl/Havana merged (GENCODE) gene set.

Vega logo Additional manual annotation of this genome can be found in Vega

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

AssemblyGRCh38 (Genome Reference Consortium Human Build 38), INSDC Assembly GCA_000001405.15, Dec 2013
Database version76.38
Base Pairs3,381,944,081
Golden Path Length3,096,649,726
Genebuild byEnsembl
Genebuild methodFull genebuild
Genebuild startedJan 2014
Genebuild releasedJul 2014
Genebuild last updated/patchedJul 2014
Gencode versionGENCODE 20

Gene counts (Primary assembly)

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

20,389 (incl. 500 readthrough

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

)
Small non coding genes9,656
Long non coding genes

Long non coding genes are usually greater than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as long non coding genes: 3prime_overlapping_ncrna, ambiguous_orf, antisense, antisense_RNA, lincRNA, ncrna_host, non_coding, non_stop_decay, processed_transcript, retained_intron, sense_intronic, sense_overlapping. The majority of the long non coding genes in Ensembl are annotated manually by HAVANA.

14,470 (incl. 180 readthrough

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

)
Pseudogenes

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

14,345 (incl. 4 readthrough

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

)
Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.194,353

Gene counts (Alternative sequence)

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

2,098 (incl. 23 readthrough

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

)
Small non coding genes656
Long non coding genes

Long non coding genes are usually greater than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as long non coding genes: 3prime_overlapping_ncrna, ambiguous_orf, antisense, antisense_RNA, lincRNA, ncrna_host, non_coding, non_stop_decay, processed_transcript, retained_intron, sense_intronic, sense_overlapping. The majority of the long non coding genes in Ensembl are annotated manually by HAVANA.

336 (incl. 7 readthrough

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

)
Pseudogenes

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

855
Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.11,830

Other

Genscan gene predictions50,117
Short Variants65,134,479