Genome Assemblies

Assembly versions

The genome assemblies presented by Ensembl Genomes are provided by a wide variety of sources, but Ensembl Genomes follows a policy compatible with the Browser Genome Release Agreement of only using assemblies that have been submitted to the INSDC archives (ENA, GenBank and DDBJ). These are usually loaded directly from the deposited sequences and assembly information in ENA or GenBank, even where the annotation is subsequently loaded from another source. Note that there are a small number of cases with legacy data that may not be completely consistent with that in the INSDC archives. Ensembl Genomes are actively working to resolve these inconsistencies.

Details of each individual assembly are available on the information page from the browser for each species (e.g. Anopheles gambiae). These pages describe in outline how the assembly was created, and provide stable, third-party identifiers for the assembly. In addition to an accepted community assembly identifer, these identifiers include an accession number from the INSDC Genome Assembly Database, which is used as an authorative source of assemblies.

When assemblies change between releases, automatic mapping is carried out between releases to allow features localised on an old assembly to be projected forward. This process and its uses are described in more detail in Assembly Mapping.

Assembly information in Ensembl Genomes databases

The assembly identifiers described above are also used in the MySQL databases used by the Ensembl platform, including the Ensembl API. For most databases, the final number on the database name reflects the version number of the corresponding INSDC Assembly Database entry e.g. nasonia_vitripennis_core_21_74_2 corresponds to GCA_000002325.2. Note that this may not hold true for assemblies where an existing version history is not reflected by the corresponding INSDC assembly record. Within the database, the assembly identifiers can be found in the meta tables of the core databases, with the keys 'assembly.name' for the establised assembly name and 'assembly.accession' for the INSDC assembly accession.

In addition to top-level assembly identifiers, all "sequence-level" sequence regions (normally contigs) use the versioned INSDC accession as their name. "Top-level" sequences (e.g. chromosomes or supercontigs) usually use the names assigned by their submitter, but the INSDC accession is provided as a synonym to allow third-party data to be mapped to the correct location.