EnsemblEnsembl mirror

Variation API Tutorial

This is a tutorial about getting the Ensembl Variation data programmatically, using a Perl API. To get a description about the Ensembl Variation data, please go to this page.

Introduction

This tutorial is an introduction to the Ensembl Variation API. Knowledge of the Ensembl Core API and of the concepts and conventions in the Ensembl Core API tutorial is assumed. Documentation about the Variation database schema is available here , and while not necessary for this tutorial, an understanding of the database tables may help as many of the adaptor modules are table-specific.


Code Conventions

Refer to the Ensembl Core tutorial for a good description of the coding conventions normally used in Ensembl. We try as much as possible to stick to these rules in Variation.


Connecting an Ensembl Variation database

Connecting to an Ensembl Variation database is made simple by using the Bio::EnsEMBL::Registry module:

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

The use of the registry ensures you will load the correct versions of the Ensembl databases for the software release it can find on a database instance. Using the registry object, you can then create any of number of database adaptors. Each of these adaptors is responsible for generating an object of one type. The Ensembl variation API uses a number of object types that relate to the data stored in the database. For example, in order to generate variation objects, you should first create a variation adaptor:

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $variation_adaptor = $registry->get_adaptor(
	'human',	# species
	'variation',	# database
	'variation'	# object type
);

my $variation = $variation_adaptor->fetch_by_name('rs1333049');

The get_adaptor method will automatically create a connection to the relevant database; in the example above, a connection will be made to the variation database for human. The three parameters passed specify the species, database and object type you require. Below is a non exhaustive list of Ensembl variation adaptors that are most often used

  • IndividualAdaptor to fetch Bio::EnsEMBL::Variation::Individual objects
  • LDFeatureContainerAdaptor to fetch Bio::EnsEMBL::Variation::LDFeatureContainer objects
  • PopulationAdaptor to fetch Bio::EnsEMBL::Variation::Population objects
  • ReadCoverageAdaptor to fetch Bio::EnsEMBL::Variation::ReadCoverage objects
  • TranscriptVariationAdaptor to fetch Bio::EnsEMBL::Variation::TranscriptVariation objects
  • VariationAdaptor to fetch Bio::EnsEMBL::Variation::Variation objects
  • VariationFeatureAdaptor to fetch Bio::EnsEMBL::Variation::VariationFeature objects

Only some of these adaptors will be used for illustration as part of this tutorial through commented perl scripts code.



Variations

Variations in the genome

One of the most important uses for the variation database is to be able to get all variations in a certain region in the genome. Below it is a simple commented perl script to illustrate how to get all variations in chromosome 25 in zebrafish.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $slice_adaptor = $registry->get_adaptor('danio_rerio', 'core', 'slice'); #get the database adaptor for Slice objects
my $slice = $slice_adaptor->fetch_by_region('chromosome',25); #get chromosome 25 in zebrafish

my $vf_adaptor = $registry->get_adaptor('danio_rerio', 'variation', 'variationfeature'); #get adaptor to VariationFeature object
my $vfs = $vf_adaptor->fetch_all_by_Slice($slice); #return ALL variations defined in $slice

foreach my $vf (@{$vfs}){
  print "Variation: ", $vf->variation_name, " with alleles ", $vf->allele_string, 
        " in chromosome ", $slice->seq_region_name, " and position ", $vf->start,"-",$vf->end,"\n";
}

Consequence type of variations

Another common use of the variation database is to retrieve the effects that variations have on a transcript. In human, Ensembl also provides SIFT and PolyPhen predictions of the effects of non-synonymous protein changes. For a complete list of the consequence types predicted by Ensembl, click here.

Consequences are represented by a hierarchy of objects as shown below:

  • TranscriptVariation - represents the overlap of a variation feature and a transcript
    • TranscriptVariationAllele - the component represented by a particular allele of a variation
      • OverlapConsequence - represents the consequence itself

In the example below, it is explained how to get all variations in a particular human transcript and see what is the effect of that variation in the transcript, including the PolyPhen and SIFT predictions. It is also shown how to retrieve the Sequence Ontology terms for the consequences.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $stable_id = 'ENST00000393489'; #this is the stable_id of a human transcript
my $transcript_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'transcript'); #get the adaptor to get the Transcript from the database
my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id); #get the Transcript object

my $trv_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'transcriptvariation'); #get the adaptor to get TranscriptVariation objects
my $trvs = $trv_adaptor->fetch_all_by_Transcripts([$transcript]); #get ALL effects of Variations in the Transcript

foreach my $tv (@{$trvs}) {
	my $tvas = $tv->get_all_alternate_TranscriptVariationAlleles();

	foreach my $tva(@{$tvas}) {
		my @ensembl_consequences;
		my @so_consequences;
		
		my $ocs = $tva->get_all_OverlapConsequences();
		
		foreach my $oc(@{$ocs}) {
			push @ensembl_consequences, $oc->display_term;
			push @so_consequences, $oc->SO_term;
		}
		
		my $sift = $tva->sift_prediction;
		my $polyphen = $tva->polyphen_prediction;
		
		print
			"Variation ", $tv->variation_feature->variation_name,
			 " allele ", $tva->variation_feature_seq,
			 " has consequence ", join(",", @ensembl_consequences),
			 " (SO ", join(",", @so_consequences), ").";
			 
		if(defined($sift)) {
			print " SIFT=$sift";
		}
		if(defined($polyphen)) {
			print " PolyPhen=$polyphen";
		}
		
		print "\n";
	}
}

It is also possible to calculate consequence types for variations not currently in the database. In the example below, we create a VariationFeature object from scratch, given a slice and VariationFeatureAdaptor object, and use this to calculate the consequence of our new SNP. Here we use the aggregation methods in the TranscriptVariation object instead of retrieving each TranscriptVariationAllele and OverlapConsequence object.

A script is also provided in ftp://ftp.ensembl.org/pub/misc-scripts/ that will calculate consequence types from a list of variations provided in an input file.

use strict;
use Bio::EnsEMBL::Registry;

# get registry
my $reg = 'Bio::EnsEMBL::Registry';
$reg->load_registry_from_db(-host => 'ensembldb.ensembl.org',-user => 'anonymous');

my $vfa = $reg->get_adaptor('human', 'variation', 'variationfeature');
my $sa = $reg->get_adaptor('human', 'core', 'slice');

# get a slice for the new feature to be attached to
my $slice = $sa->fetch_by_region('chromosome', 13);

# create a new VariationFeature object
my $new_vf = Bio::EnsEMBL::Variation::VariationFeature->new(
  -start => 114268626,
  -end => 114268626,
  -slice => $slice,           # the variation must be attached to a slice
  -allele_string => 'A/C',    # the first allele should be the reference allele
  -strand => 1,
  -map_weight => 1,
  -adaptor => $vfa,           # we must attach a variation feature adaptor
  -variation_name => 'newSNP',
);

# get the consequence types
my $cons = $new_vf->get_all_TranscriptVariations();

foreach my $con(@{$new_vf->get_all_TranscriptVariations}) {
  foreach my $string(@{$con->consequence_type}) {
    print
      "Variation ", $new_vf->variation_name,
      " at position ", $new_vf->seq_region_start,
      " on chromosome ", $new_vf->seq_region_name,
      " has consequence ", $string,
      " in transcript ", $con->transcript->stable_id, "\n";
  }
}

Variations, Flanking sequences and Genes

Below is a complete example on how to use the variation API to retrieve different data from the database. In this particular example, we want to get, for a list of variation names, information about alleles, flanking sequences, locations, effects of variations in transcripts, position in the transcript (in case it has a coding effect) and genes containing the transcripts.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $va_adaptor = $registry->get_adaptor('human', 'variation', 'variation'); #get the different adaptors for the different objects needed
my $vf_adaptor = $registry->get_adaptor('human', 'variation', 'variationfeature');
my $gene_adaptor = $registry->get_adaptor('human', 'core', 'gene');

my @rsIds = qw(rs1367827 rs1367830);
foreach my $id (@rsIds){
# get Variation object
  my $var = $va_adaptor->fetch_by_name($id); #get the Variation from the database using the name
  get_VariationFeatures($var);
}

sub get_VariationFeatures{
  my $var = shift;
  # get all VariationFeature objects: might be more than 1 !!!
  foreach my $vf (@{$vf_adaptor->fetch_all_by_Variation($var)}){
      print $vf->variation_name(),","; # print rsID
      print $vf->allele_string(),","; # print alleles
      print join(",",@{$vf->consequence_type()}),","; # print consequenceType
      print substr($var->five_prime_flanking_seq,-10) , "[",$vf->allele_string,"]"; #print the allele string
      print substr($var->three_prime_flanking_seq,0,10), ","; # print RefSeq
      print $vf->seq_region_name, ":", $vf->start,"-",$vf->end; # print position in Ref in format Chr:start-end
      get_TranscriptVariations($vf); # get Transcript information
  }
}

sub get_TranscriptVariations{
  my $vf = shift; 
  
  # get all TranscriptVariation objects: might be more than 1 !!!
  my $transcript_variations = $vf->get_all_TranscriptVariations; #get ALL the effects of the variation in 
                                                                    # different Transcripts
  if (defined $transcript_variations){
    foreach my $tv (@{$transcript_variations}){
      print ",", $tv->pep_allele_string if (defined $tv->pep_allele_string);
                                              # the AA change, but only if it is in a coding region
      my $gene = $gene_adaptor->fetch_by_transcript_id($tv->transcript->dbID);
      print ",",$gene->stable_id if (defined $gene->external_name); # and the external gene name
    }
  }
  print "\n";
}

Failed variations

In the Variation pipeline, quality checks are performed in order to remove variations that contain errors or inconsistencies. Prior to Ensembl release 61, much of the data related to these variations that failed the quality checks were deleted from the database and no longer accessible. Starting from release 61, these variations are flagged as being failed but all related data is kept in the database and can be retrieved via the API or the web interface.

The default API behaviour when fetching multiple objects is not to return data for variations that have been flagged as failed but this can be modified in the Bio::EnsEMBL::Variation::DBSQL::DBAdaptor module by setting the include_failed_variations flag. This will affect all adaptors that have been created (and will be created) using the connection until the behaviour is explicitly changed again (or the connection to the database is closed).

Below it is a simple commented perl script to illustrate how to get data related to variations that have been flagged as failed.

	
use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

# Get the database adaptor for Slice objects for human
my $slice_adaptor = $registry->get_adaptor('human', 'core', 'slice');
# Get a slice from chromosome 6
my $slice = $slice_adaptor->fetch_by_region('chromosome','6',133017695,133161157); 

# Get adaptor to VariationFeature object
my $vf_adaptor = $registry->get_adaptor('human', 'variation', 'variationfeature');

# First, get all variations on the slice. The default behaviour is to exclude variations that have been flagged as failed
my $vfs = $vf_adaptor->fetch_all_by_Slice($slice); 

# Count the number of variations returned
print "There are " . scalar(@{$vfs}) . " variations in the " . $slice->display_id() . " region\n";

# Now, tell the DBAdaptor that we want the failed variations as well. We get the DBAdaptor via the db() subroutine in our adaptor
$vf_adaptor->db->include_failed_variations(1);

# Again, fetch all variations on the slice but this time include the failed variations
$vfs = $vf_adaptor->fetch_all_by_Slice($slice); 

# Count the number of variations returned, this number is likely to be higher than when failed variations were omitted
print "There are " . scalar(@{$vfs}) . " variations in the " . $slice->display_id() . " region\n";

There are a few subroutines to query the failed status of a variation. In the code snippet below we loop over the variations we got and find out how many are flagged as failed and why.

	
my $failed = 0;
my %descriptions;
foreach my $vf (@{$vfs}) {
	
	# Check whether the variation for this variation feature is failed
	if ($vf->variation->is_failed()) {
		$failed++;
		
		# Get the reason why the variation failed and count the number of variations failed for the same reason.
		map {$descriptions{$_}++} @{$vf->variation->get_all_failed_descriptions()};
	}
}

# Print the results
print "There are $failed variations on slice " . $slice->display_id() . " that have been flagged as failed:\n";
map { print "\t" . $descriptions{$_} . " variations failed because " . $_ . "\n"; } keys(%descriptions);
	

NOTE: some fetch methods are unaffected by the include_failed_variations flag in DBAdaptor. These are methods that return data related to specific variations, rather than a general batch fetch. For example, the fetch_by_dbID and fetch_all_by_VariationFeatures methods in TranscriptVariationAdaptor will not be affected by the flag whereas the fetch_all_by_Transcripts method will be.

Even though a variation has not been failed in the Variation pipeline, some of the alleles associated with it may have been. Variations having failed alleles are not themselves flagged as failed but we can query them to see if any associated alleles have been flagged as failed.

	
foreach my $vf (@{$vfs}) {
	
	# Check whether the variation for this variation feature has failed alleles
	if ($vf->variation->has_failed_alleles()) {
		
		# Get all the alleles associated with the variation and check which ones have failed and why
		my %failed_alleles;
		foreach my $allele (@{$vf->variation->get_all_Alleles()}) {
		
			# Check if the allele is flagged as failed
			if ($allele->is_failed()) {
				
				# Get the reason why
				$failed_alleles{$allele->subsnp() . ":" . $allele->allele()} = $allele->failed_description();
			}
		}
		
		# Print the result
		print "Variation " . $vf->variation_name() . " has the following failed alleles:\n";
		map { print "\t" . $_ . " has been flagged as failed because " . $failed_alleles{$_} . "\n"; } keys(%failed_alleles);
	}
}
	

Phenotype annotations

The Ensembl variation API provides some methods to retrieve phenotype data associated with various Ensembl object types (variations, structural variations, genes, and QTLs). These data, stored into the Ensembl variation database, are imported from different sources (e.g. NHGRI GWAS catalog, OMIM, UniProt, Open Access GWAS Database, EGA, COSMIC, HGMD, ...).

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

# Fetch a variation object
my $var_adaptor = $registry->get_adaptor('human', 'variation', 'variation');
my $var = $var_adaptor->fetch_by_name('rs1421085');

# Fetch all the phenotype features associated with the variation
my $pf_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'phenotypefeature');
foreach my $pf (@{$pf_adaptor->fetch_all_by_Variation($var)}) {
	print "Variation ", $pf->variation_names, " is associated with the phenotype '", $pf->phenotype->description,
	      "' in source ", $pf->source;
			
	print " with a p-value of ",$pf->p_value if (defined($pf->p_value));
	
	print ".\n";
	
	print "The risk allele is ", $pf->risk_allele, ".\n" if (defined($pf->risk_allele));
}

It is also possible to retrieve all features associated with a phenotype description and an optional source name.

my $source_name = 'NHGRI_GWAS_catalog';
my $phenotype   = 'Cardiac hypertrophy';

my $pf_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'phenotypefeature');
 
foreach my $pf (@{$pf_adaptor->fetch_all_by_phenotype_description_source_name($phenotype,$source_name)}) {
	my $external_reference = $pf->external_reference;
	$external_reference =~ s/\// ID: /;
	
	print $pf->type, " ", $pf->object_id, " is associated with the phenotype '", $phenotype,
	      "' in source ", $source_name," (", $external_reference,")\n";
}		

By default, the API will only returns the variations with significant phenotype associations (low P value). However the database stores also non significant phenotype associations. In order to fetch this data, an extra line in the code needs to be added after the creation of the adaptor:

my $pf_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'phenotypefeature');

$pf_adaptor->db->include_non_significant_phenotype_associations(1); # Use the value 0 to revert to only significant associations.

Variation sets

Variation sets group variations that share some properties i.e. from the same project/study, individual, phenotype, chip, to make it easier to retrieve these for analysis.
> For example, all the variations identified in the 1000 Genomes - phase 1 study have been grouped into one variation set ('1000 Genomes - All').

Sets can be be part of supersets or divided into subsets depending on the hierarchical relationships between them.
> For example, the set representing variations identified in the 1000 Genomes - phase 1 study is named '1000 genomes - All' and has 4 subsets based on population: '1000 Genomes - EUR', '1000 Genomes - ASN', '1000 Genomes - AMR', '1000 Genomes - AFR'.

A list of all the variation sets with their description is available on the Variation Data Description page.
NOTE: The variation sets and the populations are differents! e.g. The variation set associated with the European 1000 Genomes data is "1000 Genomes - EUR", whereas the associated population is "1000GENOMES:phase_1_EUR".

In the Ensembl Variation API, variation sets are handled by the module VariationSet, and the accompanying adaptor module, VariationSetAdaptor, that interfaces with the underlying database. A VariationSet has a name attribute, a brief description and also a short name attribute. In the example below, we will connect to a human variation database, get a VariationSetAdaptor, fetch all available variation sets and print out their names and hierarchical relationships.

First, we get a connection and an adaptor:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

# Load the registry from db
$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

# Get a VariationSetAdaptor on the human variation database
my $vs_adaptor = $registry->get_adaptor('human','variation','variationset');

Next, we get the VariationSet objects from the adaptor. In order to display the hierarchical relationships between VariationSets, we fetch only the top level sets and for each set, recursively get the subsets. We print the name and short name attributes for each set and indicate the hierarchy by increasing indentation.


# Get all top-level variation sets
my $top_vss = $vs_adaptor->fetch_all_top_VariationSets();

# Loop over the top level variation sets and recursively print the subsets
foreach my $top_vs (@{$top_vss}) {
	print_set($top_vs);
}

# We define a function that will help us recurse over the set hierarchy and print the data   
sub print_set {
	my $set = shift;
	my $indent = shift || "";
	
	# Print the set attributes
	printf("\%s\%s (\%s)\n",$indent,$set->name(),$set->short_name());
	
	# Get the subsets that have the current set as immediate parent
	my $subsets = $set->get_all_sub_VariationSets(1);
	
	# Call the print subroutine for each of the subsets with an increased indentation
	foreach my $subset (@{$subsets}) {
		print_set($subset,$indent . "  ");
	}
}

Running the code in the example above on the human variation database produces a table similar to the one on the Variation Data Description page.

In order to get objects for the variations assigned to a variation set, we use the get_Variation_Iterator method. Note that rather than returning a reference to a list of all variation objects, this returns an Iterator object. The Iterator allows you to iterate over large numbers of objects without trying to fit them all into memory at once which could otherwise easily cause your script to crash. The main methods we will use on the iterator are the has_next() and next() methods. In the example below, we fetch a variation set that contains variations that have been linked to phenotypes and print the first 10 examples. Note that getting variations from a variation set that has subsets below it automatically returns the variations from the subsets.


  # Get the variation set for the phenotype-associated variants.
  my $vs = $vs_adaptor->fetch_by_short_name('ph_variants');
  
  printf("\%s (\%s):\n\t\%s\n",$vs->name(),$vs->short_name(),$vs->description());
  
  my $limit = 10;
  my $fetched = 0;
  my $it = $vs->get_Variation_Iterator();
  
  # Get the first 10 examples and print some data from them
  while ($fetched < $limit && $it->has_next()) {
  	
    # Print the name of the variation
    my $var = $it->next();
    printf("\t\%s has been found to be associated with:\n",$var->name());
  	
    # Get the PhenotypeFeature objects for the variation
    my $pfs = $var->get_all_PhenotypeFeatures();
  	
    # Loop over the annotations and print the phenotypes
    foreach my $pf (@{$pfs}) {
      my @output = ();
      push @output, 'Desc: ' . $pf->phenotype->description;
      push @output, 'p-value: ' . $pf->p_value if (defined $pf->p_value);
      push @output, 'study_type: ' . $pf->study->type if (defined $pf->study && defined $pf->study->type);
      push @output, 'study_name: ' . $pf->study->name if (defined $pf->study && defined $pf->study->name);
      print join(' ', @output), "\n";
    }
    $fetched++;
  }

You can also use Slices to retrieve VariationFeatures belonging to a desired variation set in a particular genomic region. In the example below, we fetch all VariationFeatures discovered by the third 1000 genomes pilot study on the RIC8A gene.

  # Get the variation set object
  $vs = $vs_adaptor->fetch_by_short_name('1kg_hct');

  # Get a gene adaptor and the gene object for the RIC8A gene
  my $gene_adaptor = $registry->get_adaptor('human','core','gene');
  my $gene = $gene_adaptor->fetch_by_display_label('RIC8A');

  # Get the variation features on the slice belonging to the variation set
  my $vfs = $vs->get_all_VariationFeatures_by_Slice($gene->feature_Slice());

  # Loop over the variation features and print the variation using HGVS notation relative to the canonical transcript
  foreach my $vf (@{$vfs}) {
    my $tvs = $vf->get_all_TranscriptVariations([$gene->canonical_transcript()]);
    my $tv = $tvs->[0];
    my @hgvs_notations = ();
    foreach my $hgvs_type (qw/hgvs_genomic hgvs_transcript hgvs_protein/) {
      for my $hgvs_notation (values %{$tv->$hgvs_type}) {
        push @hgvs_notations, ((defined $hgvs_notation) ? $hgvs_notation : "no $hgvs_type");
      }
    }
    print $vf->variation_name, ': ', join(', ', @hgvs_notations), "\n";
  }

The variation sets can also be retrieved from the structural variations.

  # Get the StructuralVariation object
  my $sv_adaptor  = $registry->get_adaptor( 'human', 'variation', 'structuralvariation' );
  my $sv = $sv_adaptor->fetch_by_name('esv2663683');

  # Get the structural variation features
  my $svfs = $sv->get_all_StructuralVariationFeatures();

  # Loop over the structural variation features and print the corresponding variation set(s)
  foreach my $svf (@{$svfs}) {
    my $vss = $svf->get_all_VariationSets();
    foreach my $vs (@{$vss}) {
      print "The structural variation belongs to the set ", $vs->name, "\n";
    }
  }


Alleles and Genotypes

Alleles and frequencies

Most variations in the Ensembl Variation database have associated allele frequencies. Each allele object associated with a variation represents an observation of a variant allele in a given population, and may have an associated frequency. Populations are also represented as objects.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

my $variation_adaptor = $registry->get_adaptor('mus_musculus', 'variation', 'variation');

my $variation = $variation_adaptor->fetch_by_name('rs4224828');
my $alleles = $variation->get_all_Alleles();

foreach my $allele (@{$alleles}) {
  next unless (defined $allele->population);
  my $allele_string   = $allele->allele;
  my $frequency       = $allele->frequency || 'NA';
  my $population_name = $allele->population->name;
  printf("Allele %s has frequency: %s in population %s.\n", $allele_string, $frequency, $population_name);
}

Genotypes

Many variations in the Ensembl Variation database, especially those genotyped in the HapMap or 1000 Genomes projects, have associated genotypes. Genotypes are stored per individual as IndividualGenotype objects, and are associated with a Variation and Individual object. They can be retrieved from an IndividualGenotypeAdaptor, or directly from a Variation object.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');

my $variation = $variation_adaptor->fetch_by_name('rs1333049');

my $genotypes = $variation->get_all_IndividualGenotypes();

foreach my $genotype(@{$genotypes}) {
	print "Individual ", $genotype->individual->name, " has genotype ", $genotype->genotype_string, "\n";
}

LD calculation

Requirement

In order to be able to use the LD calculation, you need to compile the C source code and install a Perl module, called IPC::Run.
There is more information on how to do this in here.

In the example below, it calculates the LD in a region in human chromosome 6 for a HAPMAP population, but only prints when there is a high LD

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $chr = 6;  #defining the region in chromosome 6
my $start = 25_834_000;
my $end = 25_854_000;
my $population_name = 'CSHL-HAPMAP:HapMap-CEU'; #we only want LD in this population

my $slice_adaptor = $registry->get_adaptor('human', 'core', 'slice'); #get adaptor for Slice object
my $slice = $slice_adaptor->fetch_by_region('chromosome',$chr,$start,$end); #get slice of the region


my $population_adaptor = $registry->get_adaptor('human', 'variation', 'population'); #get adaptor for Population object
my $population = $population_adaptor->fetch_by_name($population_name); #get population object from database

my $ldFeatureContainerAdaptor = $registry->get_adaptor('human', 'variation', 'ldfeaturecontainer'); #get adaptor for LDFeatureContainer object
my $ldFeatureContainer = $ldFeatureContainerAdaptor->fetch_by_Slice($slice,$population); #retrieve all LD values in the region

foreach my $r_square (@{$ldFeatureContainer->get_all_r_square_values}){
  if ($r_square->{r2} > 0.8){ #only print high LD, where high is defined as r2 > 0.8
    print "High LD between variations ", $r_square->{variation1}->variation_name,"-", $r_square->{variation2}->variation_name, "\n";
  }
}

Please note the following warning MSG: variation features have no pairwise ld information is returned when there is no data in the LDFeatureContainer structure for the specified pair of VariationFeature objects. Here are some reasons why this may happen:

  • the LDFeatureContainer does not span the region contained by one or both VariationFeature objects specified
  • there are no genotypes for one or both variations, meaning no LD can be calculated
  • one or both variations is non-variant in the specified population (i.e. has a minor allele frequency of 0)
  • the estimated r2 value is lower than the cut-off threshold in the calculation software (the threshold is 0.05)

Looking at other associated data will help to narrow down the reason. If there are variants in the region spanned by the LDFeatureContainer with genotypes with a minor allele frequency greater than zero, then assume the r2 is < 0.05, which essentially means no correlation.


Specific strain information

With the apparition of the new technologies, one of the new functionalities that the variation API has is the possibility to work with your specific strain as if it was the reference one, and compare it against others. In the example, we create a StrainSlice object for a region in Craig Venter's sequence and compare it against the reference sequence.

use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $reg = 'Bio::EnsEMBL::Registry';
my $host= 'ensembldb.ensembl.org';
my $user= 'anonymous';

$reg->load_registry_from_db(
	-host => $host,
    -user => $user
);

# get exon adaptor from core
my $sa = $reg->get_adaptor("human", "core", "slice");

my $slice = $sa->fetch_by_region('chromosome', 8, 9213000, 9216000);

# get strainSlice from the slice
my $venter = $slice->get_by_strain("Venter");

my @differences = @{$venter->get_all_AlleleFeatures_Slice()};

foreach my $diff (@differences){
  print "Locus ", $diff->seq_region_start, "-", $diff->seq_region_end, ", Venter's alleles: ",$diff->allele_string, "\n";
}


Structural variants

Structural variations

The Ensembl variation database also stores information about structural variations. These data are imported from the DGVa (Database of Genomic Variants archive). Structural variants are remapped to the current genome assembly using the $slice->project() method from the Ensembl Core API, allowing for one gap in the resultant mapping.

use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $reg = 'Bio::EnsEMBL::Registry';
my $host= 'ensembldb.ensembl.org';
my $user= 'anonymous';

$reg->load_registry_from_db(
  -host => $host,
  -user => $user
);

# Get adaptor to StructuralVariation object
my $sva = $reg->get_adaptor("human", "variation", "structuralvariation");

# Get the StructuralVariation object
my $sv = $sva->fetch_by_name('esv2663683');

print $sv->variation_name, " ",  $sv->var_class, " (SO term: ", $sv->class_SO_term, ")\n";

# Get Study information
if ($sv->study) {
  print "This structural variation comes from the study ", $sv->source, "-", $sv->study->name, ": ", $sv->study->description, "\n";
}

Like the variation, you can fetch the structural variations using a Slice.

#  Get adaptor to Slice object from core
my $sa = $reg->get_adaptor("human", "core", "slice");

my $slice = $sa->fetch_by_region('chromosome', 5, 1, 1000000);

# get all structural variation features on the slice
my $svfs = $slice->get_all_StructuralVariationFeatures();

# StructuralVariationFeature objects
foreach my $svf (@$svfs) {
  print $svf->variation_name, " ", $svf->seq_region_name, ":", $svf->seq_region_start, "-", $svf->seq_region_end, " ", 
	$svf->var_class, " (SO term: ", $svf->class_SO_term, ")\n";
}

You can also retrieve all the supporting evidences associated to a structural variant.

# Get adaptor to StructuralVariation object from variation
my $sva = $reg->get_adaptor("human", "variation", "structuralvariation");

# Get the StructuralVariation object
my $sv = $sva->fetch_by_name('esv2663683');

# Get the supporting evidences associated to the structural variation
my $ssvs = $sv->get_all_SupportingStructuralVariants();

foreach my $ssv (@$ssvs) {
  print $ssv->variation_name, " ",  $ssv->var_class, " (SO term: ", $ssv->class_SO_term, ")\n";
}


Others

Fetching Iterators

Sometimes an adaptor method may return more objects than can fit into memory at once, and for such cases the variation API provides methods to fetch Iterator objects instead of the usual array reference. An Iterator is an object which allows you to iterate over the entire set of objects fetched by the adaptor without loading them all into memory at once, by calling the next() method to fetch each object in turn. You can tell when you have finished iterating over the set of objects by using the has_next() method which returns true when the Iterator still has objects to fetch and false when it is exhausted. These are the most commonly used Iterator methods, but there are also some other useful methods which allow you to transform and filter the set of objects in an analogous way to using map and grep on standard perl arrays, using the methods called (imaginatively) map() and grep(). Various other methods are also supported, please see the Iterator API documentation for full details.

Some example code using Iterators to fetch somatic mutations from the VariationAdaptor is listed below. After the first fetch_Iterator_somatic() call we just iterate over all somatic mutations found in the variation database, using next() to fetch each object, and has_next() to check when we are done. In the second call we filter the set of mutations down to only those which have a source of 'COSMIC' using the grep() method. To do this we pass an anonymous subroutine as the only argument to grep() which checks each Variation object in turn (set to $_ for convenience in the subroutine) and returns a true value if the source is equal to 'COSMIC' and false otherwise. This anonymous subroutine is very similar to a block that you would use when using perl's own grep function, but you have to prefix it with the 'sub' keyword. In the third call to fetch_Iterator_somatic we use the map() method to return the name of each Variation rather than the object itself using an anonymous subroutine just as you would use a block for perl's map function. In the final call we combine the grep() and map() methods to return an Iterator over all the names of somatic mutations from COSMIC. The map() and grep() methods return a new Iterator and only do their transforming and filtering on each object in turn, without needing all objects in memory, they are therefore much more memory-efficient than mapping and grepping a native perl array.

use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $reg = 'Bio::EnsEMBL::Registry';

$reg->load_registry_from_db(
    -host   => 'ensembldb.ensembl.org',
    -user   => 'anonymous',
);

my $va = $reg->get_adaptor('human', 'variation', 'variation') or die;

my $somatic_iter = $va->fetch_Iterator_somatic();

while ($somatic_iter->has_next()) {
    my $mutation = $somatic_iter->next();
    print $mutation->name(), "\n";
}

my $cosmic_iter = $va->fetch_Iterator_somatic()->grep(sub {$_->source eq 'COSMIC'});

while ($cosmic_iter->has_next()) {
    print $cosmic_iter->next()->name(), "\n";
}

my $name_iter = $va->fetch_Iterator_somatic()->map(sub {$_->name});

while ($name_iter->has_next()) {
    print $name_iter->next(), "\n";
}

my $cosmic_name_iter = $va->fetch_Iterator_somatic()->grep(sub {$_->source eq 'COSMIC'})->map(sub {$_->name});

while ($cosmic_name_iter->has_next()) {
    print $cosmic_name_iter->next(), "\n";
}


Fetching variation features or structural variation features from a Slice

Variation features and structural variation features can be fetched using a Slice object (from the Core API).

# Slice adaptor
my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );
# Slice object
my $slice = $slice_adaptor->fetch_by_region('chromosome', 3, 125439614, 125462946, 1);

# Fetch the variation feature objects overlapping the Slice
my @var_features = @{$slice->get_all_VariationFeatures()};

# Fetch the structural variation feature objects overlapping the Slice
my @sv_features = @{$slice->get_all_StructuralVariationFeatures()};

print "Number of variations: ", scalar @var_features, "\n";
print "Number of structural variations: ", scalar @sv_features, "\n";

It is also possible to fetch all the variation features overlapping a structural variation feature, using a Slice through a structural variation feature object.

# Structural variation adaptor
my $sv_adaptor = $registry->get_adaptor( 'human', 'variation', 'structuralvariation' );
# Structural variation object
my $sv = $sv_adaptor->fetch_by_name('esv2663683');

# Structural variation feature(s)
foreach my $svf (@{$sv->get_all_StructuralVariationFeatures()}){
  # Corresponding Slice
  my $slice = $svf->feature_Slice();

  # Fetch the variation feature objects overlapping the Slice
  my @var_features = @{$slice->get_all_VariationFeatures()};
  print "Number of variations: ", scalar (@var_features), "\n";
}

And you can do the same with the variation features, for example.

# Variation adaptor
my $var_adaptor = $registry->get_adaptor( 'human', 'variation', 'variation' );
# Variation object
my $var = $var_adaptor->fetch_by_name('rs1333049');

# Variation feature(s)
foreach my $vf (@{$var->get_all_VariationFeatures()}){
  # Corresponding Slice
  my $slice = $vf->feature_Slice();

  # Fetch the structural variation feature objects overlapping the Slice
  my @svf_features = @{$slice->get_all_StructuralVariationFeatures()};
  print "Number of structural variations: ", scalar (@svf_features), "\n";
}

Further help

For additional information or help mail the ensembl-dev mailing list. You will need to subscribe to this mailing list to use it. More information on subscribing to any Ensembl mailing list is available from the Ensembl Contacts page.