CRAVAT: Cancer-Related Analysis of VAriants Toolkit

Introduction

CRAVAT is a web server with simple interface where cancer-related analysis of variants are performed. To cite CRAVAT, please use this article.

CRAVAT currently employs three analysis tools, CHASM, SNVGet, and VEST. For more information on these tools, refer to Analysis Tools chapter.

On how to use CRAVAT, refer to How to Use chapter.

On how to interpret the reports by CRAVAT, refer to Output Report.

How to Cite

To cite CRAVAT, please use the following literature:
  • Douville C, Carter H, Kim R, Niknafs N, Diekhans M, Stenson PD, Cooper DN, Ryan M, Karchin R (2013). CRAVAT: Cancer-Related Analysis of VAriants Toolkit Bioinformatics, 29(5):647-648.
To cite CHASM, please use the following literature:
  • Carter H, Chen S, Isik L, Tyekucheva S, Velculescu VE, Kinzler KW, Vogelstein B, Karchin R (2009) Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations Cancer Res, 69(16):6660-7.
To cite VEST, please use the following literature:
  • Douville C, Christopher, Masica DL, Stenson PD, Cooper DN, Gygax DM, Kim R, Ryan M, and Karchin R (2015) Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST-indel) Human Mutation, doi: 10.1002/humu.22911.
  • Carter H, Douville C, Stenson P, Cooper D, Karchin R (2013) Identifying Mendelian disease genes with the Variant Effect Scoring Tool BMC Genomics, 14(Suppl 3):S3.
To cite SNVBox, please use the following literature:
  • Wong WC, Kim D, Carter H, Diekhans M, Ryan M, Karchin R (2011). CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer Bioinformatics, 27(15):2147-2148.

Analysis Tools of CRAVAT

CRAVAT currently employs three analysis tools, CHASM, SNVGet, and VEST:
  • CHASM (Cancer-specific High-throughput Annotation of Somatic Mutations) is a method that predicts the functional significance of somatic missense variants observed in the genomes of cancer cells, allowing variants to be prioritized in subsequent functional studies, based on the probability that they confer increased fitness to a cancer cell. CHASM uses a machine learning method called Random Forest to distinguish between driver and passenger somatic missense variation, The Random Forest is trained on a positive class of drivers curated from the COSMIC database and a negative class of passengers, generated in silico, according to passenger base substitution frequencies estimated for a specific tumor type. Each variant is represented by a list of features, including amino acid substitution properties, alignment-based estimates of conservation at the variant position, predicted local structure and annotations from the UniProt Knowledgebase. Only missense mutations are analyzed by CHASM. For more information on CHASM, please visit http://wiki.chasmsoftware.org and refer to this and this articles.
  • VEST is a method that predicts the functional effect of a variant. The classifier and null distribution for VEST has been updated on November 12, 2012, so the VEST result obtained before November 13, 2012 might be different from those obtained after that date. For more information on VEST, please visit http://wiki.chasmsoftware.org and refer to this article.
  • SNVGet retrieves selected predictive features for a variant. Features can be broadly categorized into 3 types:
    • Amino Acid Substitution features
    • Protein-based position-specific features
    • Exon-specific features
    Only missense mutations are analyzed by SNVGet. For more information on SNVBox (database made with SNVGet), please visit http://wiki.chasmsoftware.org and refer to this article.
  • For more information on CRAVAT, please refer to this article.

User Account

CRAVAT provides user account functionality. You can create your user account, retrieve/change your password, and see the status of your jobs and retrieve the results of your jobs through "My Jobs" page. Your username is your email.

There are two ways to create a CRAVAT account:
  • When you submit a job for the first time, CRAVAT will create an account with your email and a temporary password and this account information will be sent to you as a part of the result notification email.
  • A CRAVAT account can be created by clicking "Log-In" > "Create an account" on the top menu.
Your username is your email.
If you forgot your password, click "Log-In" > "Forgot password?", type your username (your email) and click "Submit". A temporary password will be sent to you.
To change your password, first log in, and then click "My Profile" > "Change password". In the "Change Password" pop-up window, type your current password, your new password, and again your new password. Click "Submit".
After having logged in, click "My Jobs" on the top menu to open the My Jobs page in a new browser tab. This page shows your past and current jobs and their parameters and status (success, fail, running, and in-queue). By clicking "Here" in the "Result file" column, you can download the result files through this My Jobs page conveniently.

Submitting an Analysis Job

Input

Prepare the mutations you wish to score either as a text of amino-acid residue substitutions or as a text of genomic-coordinate variants, in the following formats:
  • Comment lines: All the lines that start with ">", "#", or "!" will be ignored as comments.
  • UID in the below examples is an identifying string uniquely given to each variant-sample pair. UID should not contain any comma.
  • Genomic-coordinate format (separated by a tab or a space):
    				# UID / Chr. / Position / Strand / Ref. base / Alt. base / Sample ID (optional)
    				TR1	chr17	7577506	-	G	T	TCGA-02-0231
    				TR2	chr10	123279680	-	G	A	TCGA-02-3512
    				TR3	chr13	49033967	+	C	A	TCGA-02-3532
    				TR4	chr7	116417505	+	G	T	TCGA-02-1523
    				TR5	chr7	140453136	-	T	A	TCGA-02-0023
    				TR6	chr17	37880998	+	G	T	TCGA-02-0252
    				Ins1 chr17	37880998	+	-	T	TCGA-02-0252
    				Del1 chr17	37880998	+	A	-	TCGA-02-0252
    				CSub1 chr2	39871235	+	ATGCT	GA	TCGA-02-0252
    				
    Position is a 1-based open coordinate. For insertions and deletions, use "-" as the reference base for insertion and "-" as the alternate base for deletion. In the above example, Ins1 is that "T" is inserted between the 37880997th and the 37880998th bases. Del1 is that "A" at the 37880998th position is deleted. CSub1 is that "ATGCT" from the 39871235th to the 39871239th positions are changed to "GA". If you do not have strand information from your sequencing results, it is likely that they are all reported on the + strand. Make sure that your reported reference base matches the base in the reported position in the hg19 reference sequence (or hg18 if you checked hg18 checkbox).

    * The old format for indels, where you have to specify the base before the insertion/deletion location, are still supported. However, if this old format is used in any row of your input, your entire input will be treated as being in the old format.

  • Amino-acid residue substitution format (separated by a tab or a space):
    						# UID / Transcript / AA change / Sample ID (optional)
    						TR1	NM_001126116.1	D127Y	TCGA-02-0231
    						TR2	NM_001144919.1	R162Q	TCGA-02-3512
    						TR3	NM_000321.2	Q702K	TCGA-02-3532
    						TR4	NM_000245.2	A1108S	TCGA-02-1523
    						TR5	NM_004333.4	V600E	TCGA-02-0023
    						TR6	NM_001005862.1	G746V	TCGA-02-0252
    						
    trascript identifier can be from either NCBI Refseq (NM accessions), CCDS, or Ensembl (ENST accessions). Refseq and CCDS accessions can be specified without version numbers. The format of "AA change" column is (reference AA)(AA position)(alternate AA), without "(" and ")". Reference and alternate AAs should be from the 20 essential amino acids and each of them should be one amino acid-long.
VCF format v4.0 and above is supported by CRAVAT. CRAVAT converts VCF format input to CRAVAT format and uses the converted input for analysis and annotation. Only CHROM, POS, ID, REF, ALT, GT, and sample name fields will be preserved in the conversion (The ID field in VCF format will become the UID field in CRAVAT format. If there are multiple samples in the VCF format input, the sample name will be added to ID in the ID -> UID conversion to differentiate the same variant from different samples).

CRAVAT is run with a queuing system, which has two separate queues for small and large jobs. This is done so that small jobs are not held up behind longer-running jobs. Currently, small jobs are defined as those with 25,000 or less mutations. 25,000 mutations will take approximately 1 hour for any analysis. However, your job may finish earlier if the server is not heavily loaded. As of March 4, 2014, the largest single job CRAVAT has processed had 4 million mutations.

Analysis

Choose an analysis type:
  • Cancer driver analysis: This analysis predicts whether the submitted variants are cancer drivers or not.
  • Pathogenicity analysis: This analysis predicts whether the submitted variants will have any pathogenic effect on their translated proteins or not.
  • Gene annotation only: This analysis provides GeneCard and PubMed information on the genes containing the submitted variants.
When an analysis type is chosen, the options for analysis programs will show up. Multiple analysis programs can be chosen, and if any of the program needs a cancer tissue type to be specified, a list box for the selection of the cancer type also will appear.
Currently, the following tissue types can be chosen at CRAVAT.
Name Full name Source Date
Bladder Bladder Urothelial Carcinoma BLCA (TCGA) Jun 2013
Blood-Lymphocyte Chronic Lymphocytic Leukemia CLL (ICGC) Mar 2013
Blood-Myeloid Acute Myeloid Leukemia LAML (TCGA) Jun 2013
Brain-Cerebellum Medulloblastoma MB (mixed source) Dec 2010
Brain-Glioblastoma-Multiforme Glioblastoma Multiforme GBM (TCGA) Jun 2013
Brain-Lower-Grade-GliomaBrain Lower Grade GliomaLGG (TCGA)Jun 2013
BreastBreast Invasive CarcinomaBRCA (TCGA)Jun 2012
CervixCervical Squamous Cell Carcinoma and Endocervical AdenocarcinomaCESC (TCGA)Jun 2013
ColonColon AdenocarcinomaCOAD (TCGA)Jun 2013
Head and NeckHead and Neck Squamous Cell CarcinomaHNSC (TCGA)Jun 2013
Kidney-ChromophobeKidney ChromophobeKICH (TCGA)Jun 2013
Kidney-Clear-CellKidney Renal Clear Cell CarcinomaKIRC (TCGA)Jun 2013
Kidney-Papillary-CellKidney Renal Papillary Cell CarcinomaKIRP (TCGA)Jun 2013
Liver-NonviralHepatocellular Carcinoma (Secondary to Alcohol and Adiposity)HCCA (ICGC)Mar 2013
Liver-ViralHepatocellular Carcinoma (Viral)HCCV (ICGC)Mar 2013
Lung-AdenocarcinomaLung AdenocarcinomaLUAD (TCGA)Jun 2013
Lung-Squamous CellLung Squamous Cell CarcinomaLUSC (TCGA)Jun 2013
MelanomaMelanomaML (Yardena Samuels lab)Dec 2011
OtherGeneral purposeOV (TCGA)Jun 2013
OvaryOvarian Serous CystadenocarcinomaOV (TCGA)Jun 2013
PancreasPancreatic CancerPNCC (ICGC))Mar 2013
Prostate-AdenocarcinomaProstate AdenocarcinomaPRAD (TCGA)Jun 2013
RectumRectum AdenocarcinomaREAD (TCGA)Jun 2013
SkinSkin Cutaneous MelanomaSKCM (TCGA)Jun 2013
StomachStomach AdenocarcinomaSTAD (TCGA)Jun 2013
ThyroidThyroid CarcinomaTHCA (TCGA)Jun 2013
UterusUterine Corpus Endometriod CarcinomaUCEC (TCGA)Jun 2013

Lastly, check "Include gene annotation" based on whether you want to include in the result email the GeneCard and PubMed annotation of the genes containing the submitted variants.

Submit

Enter your email address (if you have logged in you don't need to), and if you want to receive machine processing-friendly, tab-separated text version of the CRAVAT analysis report in addition to its default Microsoft Excel version, check "Include text reports for machine processing". Then, click "SUBMIT". When all the analyses are complete, an email with reports will be sent to you. If you have logged in you can check the status and history of your jobs at 'My Jobs' page, where you can also download your result by clicking 'Here' in the 'Result file' column.

RESTful Web Service

With CRAVAT's RESTful web service, you can submit and check the status of your jobs withuot using a browser.

Jobs

  • Job submission via POST

    URL: http://www.cravat.us/CRAVAT/rest/service/submit
    Method: POST
    Consumes: Multipart/form-data
    Produces: a JSON object, notable fields of which are as follows.
    • status: "submitted" for successful job submission, "submissonfailed" for an error in the job submission
    • errormsg: If there was any error during the job submission, the error message is written here.
    • jobid: The Job ID of the submitted job. This job ID can be used to check the status of the job later using "status" method which is explained below.
    Form data parameters (* = essential parameters):
    • analyses: "CHASM", "SnvGet", "VEST", "CHASM;VEST", "CHASM;SnvGet", "VEST;SnvGet", or "CHASM;VEST;SnvGet"
    • chasmclassifier: classifier name for CHASM analysis
    • *email: email of the submitter
    • functionalannotation: "on" or "off". GeneCards and PubMed annotation.
    • hg18: "on" or "off". Input mutations are in hg18 coordinates or not.
    • *inputfile: Input mutation file. This is from the file input element in the POST form.
    • mupitinput: "on" or "off". MuPIT input format returned or not.
    • tsvreport: "on" or "off". Text format reports returned or not.
  • Job submission via GET

    URL: http://www.cravat.us/CRAVAT/rest/service/submit
    Method: GET
    Produces: a JSON object, notable fields of which are as follows.
    • status: "submitted" for successful job submission, "submissonfailed" for an error in the job submission
    • errormsg: If there was any error during the job submission, the error message is written here.
    • jobid: The Job ID of the submitted job. This job ID can be used to check the status of the job later using "status" method which is explained below.
    Query parameters (* = essential parameters):
    • analyses: "CHASM", "SnvGet", "VEST", "CHASM;SnvGet", or "VEST;SnvGet"
    • chasmclassifier: classifier name for CHASM analysis
    • *email: email of the submitter
    • functionalannotation: "on" or "off". GeneCards and PubMed annotation.
    • hg18: "on" or "off". Input mutations are in hg18 coordinates or not.
    • *mutations: a string with mutations, the format of which is the same as described in the "Input" section above.
    • mupitinput: "on" or "off". MuPIT input format returned or not.
    • tsvreport: "on" or "off". Text format reports returned or not.
  • Job status checking

    URL: http://www.cravat.us/CRAVAT/rest/service/status
    Method: GET
    Produces: a JSON object, notable fields of which are as follows.
    • status: "running" for still running, "success" for successful completion, "jobfailed" for failed
    • errormsg: Error message if the job failed.
    • resultfileurl: If the job completed successfully, the URL of the result file.
    Query parameters (* = essential parameter):
    • *jobid: The job ID to query.
    Example: http://www.cravat.us/CRAVAT/rest/service/status?jobid=test@20140204_102423

Single Variant

  • Single variant Web API

    URL:http://www.cravat.us/CRAVAT/rest/service/query
    Method: GET
    Produces: a JSON object, notable fields of which are as follows.
    • Chromosome: Chromosome of the variant
    • Position: Position of the variant
    • Strand: DNA strand on which the variant is on
    • Reference base: Base(s) at the variant position in the reference genome (hg18 or hg19)
    • Alternate base: Sequence of the variant
    • Hugo symbol: Gene symbol from HUGO in which the variant resides
    • Sequence ontology transcript: Transcript used to get the most severe sequence Ontology. If there are more than one transcript of the most severe sequence ontology, the longest RefSeq transcript (if not, the longest Ensembl one, or the longest CCDS one, in this order) is chosen.
    • Protein sequence change: Protein sequence change for the Sequence ontology column
    • Sequence ontology: Sequence Ontology annotation. See Sequence Ontology section below. When more than one sequence ontology is found due to multiple transcript mapping, the most severe consequence is reported, according to the order of FI, FD, SG, SS, SL, II, ID, CS, MS, and SY.
    • Sequence ontology all transcripts: Sequence ontology for each transcript mapped to the variant position. An asterisk is assigned to the transcript that was used to get the most severe sequence ontology.
    • ExAC total allele frequency: Total allele frequency from ExAC
    • ExAC allele frequency (African/African American): ExAC allele frequency in African and African American population
    • ExAC allele frequency (Latino): ExAC allele frequency in Latino population
    • ExAC allele frequency (East Asia): ExAC allele frequency in East Asian population
    • ExAC allele frequency (Finish): ExAC allele frequency in Finnish population
    • ExAC allele frequency (Non-Finnish European): ExAC allele frequency in Non-Finnish European population
    • ExAC allele frequency (Other): ExAC allele frequency in Other population
    • ExAC allele frequency (South Asian): ExAC allele frequency in South Asian population
    • 1000 Genomes allele frequency: Allele frequency from the 1000 Genomes project
    • ESP6500 allele frequency (European American): Allele frequency in the European American population, from ESP6500
    • ESP6500 allele frequency (African American): Allele frequency in the African American population, from ESP6500
    • Transcript in COSMIC: COSMIC Transcript that is mapped to the input variant
    • Protein sequence change in COSMIC: Protein sequence change caused by the variant, according to COSMIC
    • Occurrences in COSMIC [exact nucleotide change]: How many times the variant is observed in COSMIC
    • Occurrences in COSMIC by primary sites [exact nucleotide change]: How many times the mutation is observed in COSMIC, grouped by primary sites
    • Mappability Warning: Warning codes for whether the mutation's mapping is reliable or not. See Mappability section below.
    • Driver Genes: Cancer driver gene hits (oncogenes and tumor suppressor genes) according to Vogelstein et al.
    • TARGET: TARGET drug association DB hits
    • dbSNP: dbSNP record which has the mutation
    Query parameters (* = essential parameter):
    • *mutation: The chromsome, position, strand direction, reference base and alternate base of the variant separated by underscores (chomosome_position_strand_refBase_altBase
    Example: http://www.cravat.us/CRAVAT/rest/service/query?mutation=chr22_30421786_+_A_T

Downloadable Results

Upon a successful submission and analysis, you will receive a link to your results via email (if you have logged in you can check the status and history of your jobs at 'My Jobs' page, where you can also download your result by clicking 'Here' in the 'Result file' column), which will be available for 30 days from the date of submission. The results will be delivered as one zip-compressed file containing several report files, including a MS Excel format spreadsheet and optional tab-separated text files. There are three levels of analysis: variant, codon, and gene level. The spreadsheet has each level as a tab, and the tab-separated text files have each level as a separate .tsv file. SNVGet analysis result also shows up as a separate tab or file. The result of the analysis at each level is shown as a table, and the columns of the table are explained below.

Variant Analysis Result

Column Meaning
Input line number Line number from the input file
ID Unique ID of a mutation input line
Chromosome Chromosome of the mutation
Position Position of the mutation
Strand DNA strand on which the mutation is on
Reference base(s) Base(s) at the mutation position in the reference genome (hg18 or hg19)
Alternate base(s) Sequence of the mutation
Sample ID ID of the sample from which the mutation was observed
HUGO symbol Gene symbol from HUGO in which the mutation resides
Sequence ontology Sequence Ontology annotation. See Sequence Ontology section below. When more than one sequence ontology is found due to multiple transcript mapping, the most severe consequence is reported, according to the order of FI, FD, SG, SS, SL, II, ID, CS, MS, and SY.
Protein sequence change Protein sequence change for the Sequence ontology column.
QUAL Phred-scaled quality score for the assertion made in the alternate bases. This column appears only with a VCF-format input.
FILTER PASS if the mutation position passed all filters. Otherwise, a semicolon-separated list of codes for filters that fail (e.g. "q10;s50"). This column appears only with a VCF-format input.
Zygosity "hom" or "het" depending on whether the alternate allele is present on both chromosomes or only one of them, respectively. This column appears only with a VCF-format input.
CHASM cancer driver p-value (missense) Empirically-derived p-value of the CHASM cancer driver score. Only missense mutations are considered.
CHASM cancer driver FDR (missense) Benjamini-Hochberg false discovery rate. Only missense mutations are considered.
VEST pathogenicity p-value (non-silent) Empirically-derived p-value of the VEST pathogenicity score. Only non-silent mutations are considered.
VEST pathogenicity FDR (non-silent) Benjamini-Hochberg false discovery rate. Only non-silent mutations are considered.
Mappability Warning Warning codes for whether the mutation's mapping is reliable or not. See Mappability section below.
Driver Genes Cancer driver gene hits (oncogenes and tumor suppressor genes) according to Vogelstein et al.
TARGET TARGET drug association DB hits
dbSNP dbSNP record which has the mutation
1000 Genomes allele frequency Allele frequency from the 1000 Genomes project
ESP6500 allele frequency (average) Average allele frequency from ESP6500
ExAC total allele frequency Total allele frequency from ExAC
Occurrences in COSMIC by primary sites [exact nucleotide change] How many times the mutation is observed in COSMIC, grouped by primary sites
Number of samples in study having the exact nucleotide change Number of samples in study having the exact nucleotide change
MuPIT Link If the mutation falls on a known protein structure or a homology model (see here), it can be visualized with MuPIT by clicking the link in this column.
GeneCards summary Information on the gene containing the mutation, pulled from GeneCards
Number of retrieved articles from PubMed Number of the records retrieved in PubMed, using the name of the gene which contains the mutation and "cancer" as keywords. First, the keywords are searched in MeSH terms. If nothing is found, title and abstract of literature are searched. If nothing is still found, the keywords are searched without restriction on their appearance.
PubMed search term Link to the PubMed search result with the mutation's gene name and "cancer" as keywords

Variant Additional Details Result

Column Meaning
Input line number Line number from the input file
ID Unique ID of a mutation input line
Chromosome Chromosome of the mutation
Position Position of the mutation
Strand DNA strand on which the mutation is on
Reference base(s) Base(s) at the mutation position in the reference genome (hg18 or hg19)
Alternate base(s) Sequence of the mutation
Sample ID ID of the sample from which the mutation was observed
HUGO symbol Gene symbol from HUGO in which the mutation resides
Sequence ontology Sequence Ontology annotation. See Sequence Ontology section below. When more than one sequence ontology is found due to multiple transcript mapping, the most severe consequence is reported, according to the order of FI, FD, SG, SS, SL, II, ID, CS, MS, and SY.
Sequence ontology transcript Transcript used to get the most severe sequence Ontology. If there are more than one transcript of the most severe sequence ontology, the longest RefSeq transcript (if not, the longest Ensembl one, or the longest CCDS one, in this order) is chosen.
Sequence ontology transcript strand The strand (+ or -) of the transcript used to get the sequence ontology
Protein sequence change Protein sequence change for the Sequence ontology column
Sequence ontology all transcripts Sequence ontology for each transcript mapped to the variant position. An asterisk is assigned to the transcript that was used to get the most severe sequence ontology.
CHASM cancer driver score transcript Transcript used to get the CHASM cancer driver score
Cancer missense driver score (1 - CHASM score) 1 - CHASM cancer driver score. Closer to 1 means that the mutation is more likely a cancer driver.
CHASM cancer driver p-value (missense) Empirically-derived p-value of the CHASM cancer driver score. Only missense mutations are considered.
CHASM cancer driver FDR (missense) Benjamini-Hochberg false discovery rate. Only missense mutations are considered.
Cancer missense driver score of all transcripts Cancer missense driver score (1 - CHASM score) and p-value of each transcript that has mapping to the input variant. Format is Transcript:Protein sequence change(Cancer missense driver score:CHASM cancer driver p-value). An asterisk is assigned to the transcript that has the highest cancer missense driver score.
VEST pathogenicity score transcript Transcript used to get VEST pathogenicity score
VEST pathogenicity score (missense) VEST pathogenicity score for missense variants
VEST pathogenicity score (frameshift indels) VEST pathogenicity score for frameshift indels
VEST pathogenicity score (inframe indels) VEST pathogenicity score for inframe indels
VEST pathogenicity score (stop-gain) VEST pathogenicity score for stop-gain variants
VEST pathogenicity score (stop-loss) VEST pathogenicity score for stop-loss variants
VEST pathogenicity score (splice site) VEST pathogenicity score for splice site variants
VEST pathogenicity score and p-value of all transcripts (non-silent) VEST pathogenicity score and p-value of each transcript that has mapping to the input variant. Format is Transcript:Protein sequence change(VEST pathogenicity score:VEST pathogenicity p-value). An asterisk is assigned to the transcript that has the highest VEST pathogenicity score.
ESP6500 allele frequency (European American) Allele frequency in the European American population, from ESP6500
ESP6500 allele frequency (African American) Allele frequency in the African American population, from ESP6500
ExAC allele frequency (Latino) ExAC allele frequency in Latino population
ExAC allele frequency (African/African American) ExAC allele frequency in African and African American population
ExAC allele frequency (East Asian) ExAC allele frequency in East Asian population
ExAC allele frequency (Finnish) ExAC allele frequency in Finnish population
ExAC allele frequency (Non-Finnish European) ExAC allele frequency in Non-Finnish European population
ExAC allele frequency (Other) ExAC allele frequency in Other population
ExAC allele frequency (South Asian) ExAC allele frequency in South Asian population
Transcript in COSMIC COSMIC Transcript that is mapped to the input variant
Protein sequence change in COSMIC Protein sequence change caused by the variant, according to COSMIC
Occurrences in COSMIC [exact nucleotide change] How many times the variant is observed in COSMIC

Variant Non-coding Result

Non-coding regions are regions in the genome that are not in a protein coding portion of a gene. This includes UTR, intron, non-coding RNA, and intergenic regions.
Column Meaning
Input line number Line number from the input file
ID Unique ID of a mutation input line
Chromosome Chromosome of the mutation
Position Position of the mutation
Strand DNA strand on which the mutation is on
Reference base(s) Base(s) at the mutation position in the reference genome (hg18 or hg19)
Alternate base(s) Sequence of the mutation
Sample ID ID of the sample from which the mutation was observed
HUGO symbol Gene symbol from HUGO in which the mutation resides
Sequence ontology Sequence Ontology annotation. See Sequence Ontology section below. When more than one sequence ontology is found due to multiple transcript mapping, the most severe consequence is reported, according to the order of FI, FD, SG, SS, SL, II, ID, CS, MS, and SY.
QUAL Phred-scaled quality score for the assertion made in the alternate bases. This column appears only with a VCF-format input.
FILTER PASS if the mutation position passed all filters. Otherwise, a semicolon-separated list of codes for filters that fail (e.g. "q10;s50"). This column appears only with a VCF-format input.
Zygosity "hom" or "het" depending on whether the alternate allele is present on both chromosomes or only one of them, respectively. This column appears only with a VCF-format input.
Mappability Warning Warning codes for whether the mutation's mapping is reliable or not. See Mappability section below.
dbSNP dbSNP record which has the mutation
1000 Genomes allele frequency Allele frequency from the 1000 Genomes project
ESP6500 allele frequency (average) Average allele frequency from ESP6500
ExAC total allele frequency Total allele frequency from ExAC
Occurrences in COSMIC by primary sites [exact nucleotide change] How many times the mutation is observed in COSMIC, grouped by primary sites
Number of samples in study having the exact nucleotide change Number of samples in study having the exact nucleotide change

Gene Level Analysis Result

Column Meaning
HUGO Symbol Gene symbol from HUGO in which the mutation resides
Sequence ontology Sequence Ontology annotation. See Sequence Ontology section below.
Cancer missense driver score (1-CHASM score) Most cancer driving CHASM cancer driver score found in the gene. The closer to 1, the more cancer driving variant the gene has.
VEST pathogenicity score (non-silent) Most pathogenic VEST pathogenicity score found in the gene. The closer to 1, the more pathogenic variant the gene has.
VEST pathogenicity composite p value (non-silent) Composite p-value based on Stouffer's Z-score method
VEST pathogenicity FDR (non-silent) Composite FDR based on Stouffer's Z-score method
Driver Genes Cancer driver gene hits (oncogenes and tumor suppressor genes) according to Vogelstein et al.
TARGET TARGET drug association DB hits
Occurrences in COSMIC [gene mutated] How many times any mutation in the gene is observed in COSMIC
Occurrences in COSMIC by primary sites [gene mutated] How many times any mutation in the gene is observed in COSMIC, grouped by primary sites
Number of samples in study having the gene mutated Number of samples in study having the gene mutated
MuPIT Link If the mutations in the gene fall on a known protein structure, they can be visualized with MuPIT by clicking the link in this column.
GeneCards summary (from http://www.genecards.org) Information on the gene containing the mutation, pulled from GeneCards
Number of retrieved articles from PubMed Number of the records retrieved in PubMed, using the name of the gene which contains the mutation and "cancer" as keywords. First, the keywords are searched in MeSH terms. If nothing is found, title and abstract of literature are searched. If nothing is still found, the keywords are searched without restriction on their appearance.
PubMed search term Link to the PubMed search result with the mutation's gene name and "cancer" as keywords

SNVBox Analysis Result

Column Meaning
Input line number Line number from the input file
ID Unique ID of a variant input line
Chromosome Chromosome of the mutation
Position Position of the mutation
Strand DNA strand on which the mutation is on
Reference base(s) Base(s) at the mutation position in the reference genome (hg18 or hg19)
Alternate base(s) Sequence of the mutation
Sample ID ID of the sample from which the mutation was observed
HUGO Symbol Gene symbol from HUGO in which the mutation resides
Sequence ontology Sequence Ontology annotation. See Sequence Ontology section below.
Sequence ontology transcript Transcript used to get the most severe sequence Ontology. If there are more than one transcript of the most severe sequence ontology, the longest RefSeq transcript (if not, the longest Ensembl one, or the longest CCDS one, in this order) is chosen.
Protein sequence change Position and amino acid changed by the variant, in the representative transcript
To understand the other columns of the SNVBox analysis result table, please refer to this document for comprehensive explanation.

Input Errors Result

Column Meaning
Input line number Line number from the input file
Input line UID Unique ID of the input line
Gene The gene the input occurs on
Error The reason this input line caused an error
Input Line The variant from the input file that caused the error

Sequence Ontology Codes

CodeMeaning
SYSynonymous Variant
SLStop Lost
SGStop Gained
MSMissense Variant
IIInframe Insertion
FIFrameshift Insertion
IDInframe Deletion
FDFrameshift Deletion
CSComplex Substitution

The source of Sequence Ontology terms is here.

Mappability Codes

CodeMeaning
A75The hg19 reference genome has more than 1 location with the 75 mer sequence from the query position
ACRACRO1 (Human acromeric satellite)
ALCALR/Alpha
BSRBeta satellite repeat/beta
CAT(CATTC)n
CHMChromosome M
CNRCentromeric Repeat
GAA(GAATG)n
GAG(GAGTG)n
HMIHigh artifact island
LMILow artifact island
LSULarge subunit rRNA Hsa
snRSmall nuclear RNA
SSUSmall subunit rRNA Hsa
STLSatellite repeat
TARTAR1
TIIHSATII (Human satellite II DNA)
TLMTelomeric repeat

The source of the mappability tags is here.

CRAVAT Galaxy Tool

A Galaxy Tool for querying CRAVAT is available at https://toolshed.g2.bx.psu.edu/view/insilicosolutions/cravat/9e29dd2972ab. CRAVAT input format is used.