Variant Effect Predictor Custom annotations
VEP can integrate custom annotation from standard format files into your results by using the --custom flag.
These files may be hosted locally or remotely, with no limit to the number or size of the files. The files must be indexed using the tabix utility (BED, GFF, GTF, VCF); bigWig files contain their own indices.
Annotations typically appear as key=value pairs in the Extra column of the VEP output; they will also appear in the INFO column if using VCF format output. The value for a particular annotation is defined as the identifier for each feature; if not available, an identifier derived from the coordinates of the annotation is used. Annotations will appear in each line of output for the variant where multiple lines exist.
VEP supports the following formats:
- Gene/transcript annotations
- GFF : a format for describing genes and other genomic features — format specifications.
GTF : a similar format derived from GFF — format specifications.
See more documentation about GFF/GTF format requirements for VEP.
NOTE: It requires a FASTA file on the offline mode.
- Variant data
- VCF : a format used to describe genomic variants. VEP will use the 3rd column of the file as the identifier. INFO fields from records may be added to the VEP output.
- Basic/uninterpreted data
- BED : a simple tab-delimited format containing 3-12 columns of data. The first 3 columns contain the coordinates of the feature. If available, VEP will use the 4th column of the file as the identifier of the feature.
- bigWig : a format for storage of dense continuous data. VEP uses the value for the given position as the "identifier". Note that bigWig files contain their own indices, and do not need to be indexed by tabix. Requires Bio::DB::BigFile.
Any other files can be easily converted to be compatible with VEP; the easiest format to produce is a BED-like file containing coordinates and an (optional) identifier:
chr1 10000 11000 Feature1 chr3 25000 26000 Feature2 chrX 99000 99001 Feature3
Chromosomes can be denoted by either e.g. "chr7" or "7", "chrX" or "X".
Custom annotation files must be prepared in a particular way in order to work with tabix and therefore with VEP. Files must be stripped of comment lines, sorted in chromosome and position order, compressed using bgzip and finally indexed using tabix. Here are some examples of that process for:
- GFF file
grep -v "#" myData.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > myData.gff.gz tabix -p gff myData.gff.gz
- BED file
grep -v "#" myData.bed | sort -k1,1 -k2,2n -k3,3n -t$'\t' | bgzip -c > myData.bed.gz tabix -p bed myData.bed.gz
The tabix utility has several preset filetypes that it can process, and it can also process any arbitrary filetype containing at least a chromosome and position column. See the documentation for details.
If you are going to use the file remotely (i.e. over HTTP or FTP protocol), you should ensure the file is world-readable on your server.
Each custom file that you configure VEP to use can be configured. Beyond the filepath, there are further options, each of which is specified in a comma-separated list, like this:
./vep [...] --custom Filename , Short_name , File_type , Annotation_type , Force_report_coordinates , VCF_fields
The options are as follows:
- Filename :
The path to the file. For tabix indexed files, the VEP will check that both the file and the corresponding .tbi file exist. For remote files, VEP will check that the tabix index is accessible on startup.
- Short name :
A name for the annotation that will appear as the key in the key=value pairs in the results.
If not defined, this will default to the annotation filename for the first set of annotation added (e.g. "myPhenotypes.bed.gz" in the second example below if the short name was missing).
- File type :
"bed", "gff", "gtf", "vcf" or "bigwig"
- Annotation type :
"exact" or "overlap"(if left blank, assumed to be overlap)When using "exact" only annotations whose coordinates match exactly those of the variant will be reported. This would be suitable for position specific information such as conservation scores, allele frequencies or phenotype information. Using "overlap", any annotation that overlaps the variant by even 1bp will be reported.
- Force report coordinates :
"0" or "1"(if left blank, assumed to be 0)If set to "1", this forces VEP to output the coordinates of an overlapping custom feature instead of any found identifier (or value in the case of bigWig) field. If set to "0" (the default), VEP will output the identifier field if one is found; if none is found, then the coordinates are used instead.
- VCF fields :
You can specify any info type (e.g. "AC") present in the INFO field of the custom input VCF, to add these as custom annotations:
- If using "exact" annotation type, allele-specific annotation will be retrieved.
- The INFO field name will be prefixed with the short name, e.g. using short name "test", the INFO field "foo" will appear as "test_FOO" in the VEP output.
- In VCF files the custom annotations are added to the CSQ INFO field.
- Alleles in the input and VCF entry are trimmed in both directions in an attempt to match complex or poorly formatted entries.
# BigWig file ./vep [...] --custom frequencies.bw,Frequency,bigwig,exact,0 # BED file ./vep [...] --custom http://www.myserver.com/data/myPhenotypes.bed.gz,Phenotype,bed,exact,1 # VCF file ./vep [...] --custom ftp://ftp.ensemblgenomes.org/pub/bacteria/data_files/homo_sapiens/GRCh37/variation_genotype/TOPMED_GRCh37.vcf.gz,,vcf,exact,0,TOPMED
Example - ClinVar
We include the most recent public variant and phenotype data available in each Ensembl release, but some projects release data more frequently than we do.
If you want to have the very latest annotations, you can use the data files from your prefered projects (in any format listed in Data formats) and use them as a VEP custom annotation.
See below an example about how to use ClinVar VCF files as a VEP custom annotation:
- Download the VCF files (you need the compressed VCF file and the index file), e.g.:
# Compressed VCF file curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz # Index file curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi
- Example of command you can use:
./vep [...] --custom clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN ## Where the selected ClinVar INFO fields (from the ClinVar VCF file) are: # - CLNSIG: Clinical significance for this single variant # - CLNREVSTAT: ClinVar review status for the Variation ID # - CLNDN: ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB # Of course you can select the INFO fields you want in the ClinVar VCF file # Quick example on GRCh38: ./vep --id "1 230710048 230710048 A/G 1" --species homo_sapiens -o /path/to/output/output.txt --cache --offline --assembly GRCh38 --custom /path/to/custom_files/clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN