The following are questions that have been frequently asked.
For more support please open a new issue.
Both Cancer Cell Fractions (CCF) and binary data, but only one type of data at a time (you should not use CCF for some patients with, and binary data for others).
To generate the input data for the tool, you need to prepare a set of high-quality somatic calls for all of your patients (e.g., mutations and copy number data). Then:
To generate CCF values you can use tools for sub-clonal deconvolution that estimate the sub-populations admixture in your bulk samples. This is a common step for phylogenetic analysis of bulk cancer sequencing data, we suggest to use tools like MOBSTER, pyClone, sciClone etc.
If you use binary data, you do not need to carry out sub-clonal deconvolution and you can immediately use
Important remark. The analysis with binary data is faster to set up, but is usually lower-resolution compared to the analysis with CCF.
The tool can compute trees form your data, but can also receive pre-computed trees in input.
According to the input data distinct types of trees are computed:
clone trees from CCF;
mutation trees from binary data
There are no hard constraint, but rather suggestions that apply to any Cancer Evolution analysis.
At a broad level, to get reliable CCFs estimates, whole-genome sequencing generates the best signal to deconvolve cancer subpopulations. However, you can carry out a CCF-based analysis from whole-exome sequencing, as shown in the TRACERx study that we analysed with
REVOLVER. To use just binary data and mutation trees, you can also use a simpler targeted panel.
Notice that sequencing coverage and purity impact on the quality of the analysis; if you want to detect small sub-clones, or low-frequency mutations in a CCF-based analysis, you have to sequence at high coverage samples with “good” purity. The TRACERx study, for instance, reached ~430x median coverage on exome regions.
To find repeated evolutionary trajectories, you need to detect a reliable statistical signal. If your target subtype contains 10% of certain cancer types, maybe you won’t be able to pick it up with ~50 patients. Note that the best cohorts that we analyzed so far, which are the largest available to date, have either 50 or 100 patients each.
Concerning the number of samples per patient, it very much depends on the level of spatial heterogeneity of the tumour tissue under scrutiny.
REVOLVER supports any number of samples/ biopsies and, in principle, with CCFs, you can also use single-samples cohorts. Instead, in order to use binary data you need multi-region sampling (>1 sample per patient).
Always beware that the extent to which a single-biopsy can capture the phylogenetic structure of a complex tumour is debatable.
We did not focus the manuscript on this topic because there are not yet multi-patient single-cell datasets available in the community.
The statistical framework and the implementation, however, support this type of data because, once the trees for each patient have been built and scored, then the inference strategy is independent of the input data.
Notice that since single-cell data are binary, you will have to use
REVOLVER’s mutation trees that 1) do not allow for missing data and 2) employ an infinite site assumption. Probably, if you have a single-cell tree-generation method, you might want to input your trees for this analysis.
If you detect multiple driver mutations in different positions/ nucleotides of a gene
GENE, then these might appear in (possibly) different positions of the phylogentic tree for the patient. It is possible to handle these events in
REVOLVER, but this requires some considerations.
REVOLVER computes correlated evolutionary trajectories among recurrent drivers, which are indexed using an ID-based identification system. The ID has to be unique, and consistent across all patients where we think the same driver occurs; in many cases this could be just the gene name
In the above case, however, id-ing all the driver mutations in the gene as
GENE renders the id not-unique. This would not work with the current model implemented by
REVOLVER, and the tool would raise an error during the computation. You can circumvent this by using a finer-grained resolution in the preparation of these data: for instance, you can add to the id the nucleotide position of the mutation, using something like
YYY are positions or protein domains.
Doing so would allow one using multiple driver events onto the same gene. A obvious drawback could be, however, that those drivers might be far less “recurrent” across the cohort, requiring larger datasets to find the repeated trajectories that involve those events.
A phylogenetic clone tree is a tree built from CCFs, real values that provide the estimate of cancer cells harbouring a certain SNV/ Copy Number/ etc.
A mutation tree is built from binary data, which is a simple 0/1 format to represent the absence/ presence of a certain SNV/ Copy Number/ etc. in a sample.