| HOME | HELP | CONTACT US | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
The Community Oncologist |
Clinical and Molecular Epidemiology Unit, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, and Biomedical Research Institute, Foundation for Research and Technology-Hellas, Ioannina, Greece and Institute for Clinical Research and Health Policy Studies, Tufts University School of Medicine, Boston, Massachusetts, USA
Key Words. Molecular profiling • Microarrays • Clinical use • Clinical practice • Prediction Prognosis
Correspondence: John P.A. Ioannidis, M.D., Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina 45110, Greece. Telephone: 302651097807; Fax: 302651097867; e-mail: jioannid{at}cc.uoi.gr
Received November 1, 2006; accepted for publication January 10, 2007.
Access and take the CME test online and receive 1 AMA PRA Category 1 CreditTM at CME.TheOncologist.com
![]()
LEARNING OBJECTIVES
Top
Learning Objectives
Abstract
Introduction
Failed Research on Cancer...
Assay Development and...
Demonstration of Diagnostic,...
Validation of Performance
Provision of Independent...
Nonselective and Transparent...
Demonstration of Clinical Effect...
Demonstration of Benefit on...
Integration in Clinical Care
Cost-Effectiveness
The Future Is Now
Disclosure of Potential...
References
After completing this course, the reader will be able to:
| ABSTRACT |
|---|
|
|
|---|
Disclosure of potential conflicts of interest is found at the end of this article.
| INTRODUCTION |
|---|
|
|
|---|
Cancer applications represent the most prominent medical domain of this exponentially increasing literature [4]. Moreover, breast cancer is the first field in which molecular profiling has been approved and reimbursed for clinical use. Several other molecular profiles are also close to clinical application in cancer patients. The interest is intense: several of the most-cited articles across all medicine in the last decade pertain to molecular profiling [5]. This review examines whether molecular-profiling technologies are likely to affect clinical decision making in the routine management of cancer patients. I discuss what the difficulties are and how they could possibly be bypassed.
| FAILED RESEARCH ON CANCER MARKERS: WHAT IS DIFFERENT NOW? |
|---|
|
|
|---|
The relative failure of cancer markers to date can teach us several lessons that may be useful to also consider for complex molecular profiles. Until now, tumor marker research has depended mostly on small underpowered studies. Most of them represent opportunistic analyses performed post hoc based on readily available databases and assays. Validation of claims has been uncommon, fragmented, and incomplete [10, 11].
What is different now? Studies on molecular profiling don't have larger sample sizes than past one-marker-at-a-time studies [12]. However, they are much larger in terms of complexity and volume of information. In theory, they should capture biological complexity more comprehensively [13]. However, this complexity requires even more heightened attention to robust design, methodological detail, and avoidance of bias. Table 1. shows some prerequisites to making a cancer marker useful for the clinic. One may examine whether the current status of new molecular profiles satisfies these criteria and whether we can bypass the problems of the past.
|
| ASSAY DEVELOPMENT AND STANDARDIZATION |
|---|
|
|
|---|
This complexity sensitized researchers early to the need for presenting information in a way that crucial aspects of the measurements are accurately conveyed. The minimum information about a microarray experiment (MIAME) guidelines largely serve this purpose [16]. Nevertheless, it is unclear whether the whole series of complex decisions can be fully captured even with best intentions. This does not pertain only to the laboratory component of the measurements. The analytical calibration and informatics analysis plan can also be very convoluted. Preprocessing of data can be cumbersome. Major decisions need to be made for transformation, normalization, data filtering, use of shrinkage methods, removal of technical artifacts, and whether background subtraction is to be used, to name a few decision nodes [17, 18]. For each decision node, the possibilities are numerous, and new methods and software appear almost weekly in the literature. Additional decisions are made on quality control measures to eliminate, or not, samples and readings as ineligible. Finally, analysis of data can use unsupervised or supervised methods with literally hundreds of minor or major variants on how exactly to arrive at a molecular profile. Bias can lurk at each of these steps [19, 20].
The availability of routine informatics platforms does not necessarily improve the situation. Many of the analytical decisions are made with little understanding about what they entail. A side effect of the simplification of commercial statistical software has been the ability of nonexperts to apply them. This risk is magnified with bioinformatics software. Moreover, data may be analyzed with many different approaches, but only the "best" results may be shown.
It is not unfair to say that any molecular profile emerges eventually out of an abyss of experiments and analyses. This is still acceptable, provided that whatever emerges from the abyss is then standardized for further extended testing and practical use. Standardization means that the proposed profile is made definitive and considered fixed for further testing; detailed instructions are provided so that the profile can be measured and analyzed in exactly the same way in different collections of data.
As of now, several molecular profiles have reached the point of being fairly standardized. This is particularly the case for breast cancer, where at least four profiles exist based on a 21-gene recurrence score (Oncotype DX®; Genomic Health, Redwood City, CA), 70-gene signature, 76-gene signature, and wound-response profile [2124]. Some profiles for hematological malignancies may also be considered standardized [2527]. For other malignancies, profiles are mostly in a more exploratory phase.
Standardization should not leave ambivalent points and subjective interpretation. For example, if a molecular profile is supposed to create three categories of high risk, intermediate risk, and low risk, then these three categories should be maintained separate and ordered in all subsequent analyses and evaluations of the profile. Oncotype DX® has such a categorization. However, in some validation analyses the intermediate category has been merged with the low category, with the excuse that "their survival curves were not significantly different" [28]. Other analyses focus more on the comparison of the extreme (high versus low) categories [21]. Even for such a simple three-class categorization, there are at least seven contrasts that can be conceived of, and of these only the first one is properly representing the original categorization (Table 2). When continuous value cutoffs are involved, changing the cutoff can create a literature with an infinite number of variants of the classifier.
|
| DEMONSTRATION OF DIAGNOSTIC, PROGNOSTIC, AND PREDICTIVE PERFORMANCE |
|---|
|
|
|---|
There are two approaches to test whether the proposed profile is an accurate classifier: cross-validation and independent validation [17]. Regardless of the approach, different metrics may be used to describe the classifier performance. One may present statistical testing measures (e.g., p-values), multiplicative effect measures (e.g., likelihood ratios or hazard ratios), or absolute effect measures (e.g., sensitivity and specificity). While all information has some utility, absolute effect measures are most meaningful from a clinical perspective (Table 3) [29].
|
| VALIDATION OF PERFORMANCE |
|---|
|
|
|---|
|
Independent validation means that a profile is generated in one data set and is then tested in one or more completely different data sets. It is not the same population being resampled, and there is no overlap between the training and testing datasets. In theory, independent validation is a most rigorous technique. If it is applied correctly, results should be reliable. However, there are many reasons why independent validation may not be performed or reported appropriately, as summarized and exemplified in Table 4. Validation may further be compromised by flexibility in definitions of outcomes, including even survival [30].
These bias threats are not just theoretical concerns. We know that they eventually have an impact in the circulating literature. While we cannot tell which of these several biases has played the key role(s) in each case, we have evidence that the validation performance of several proposed signatures in the literature is inflated. Michiels et al. [31], using a multiple random sampling approach, showed that of seven molecular profiles with proposed high classifier accuracy, five really should not have had classifier accuracy better than chance, if the training and validation had been performed truly without any bias. The other two had only modest classifier accuracy. Some published classifier accuracies were completely incompatible even with the 95% credibility boundary of what one would get based on 500 possible trainingtesting validations: the published results were far too good to be true. The gradual decline in classifier accuracy across subsequent studies is occasionally alarming. The first paper on the 70-gene signature proposed practically perfect accuracy, while the latest "validation" shows sensitivity of 90% and specificity of about 40% for time to distant metastasis at 5 years, and slightly worse performance for survival at 10 years [22, 29].
Another possible hint to bias with inflated results is the fact that the genes selected for each proposed profile are very unstable. Juxtaposition of the proposed profiles for breast cancer shows that their overlap is minimal to nonexistent. Different splits of the training and validation data result in very different genes being selected [32]. To have some certainty about the selected genes, the required sample size for the training process should be in the range of many thousands [33], that is, about 100-fold larger than the sample sizes that have been used to date. A counterargument, however, is that specific genes are not so important and what matters for a profile is to include some members of key pathways implicated in the biological behavior. Different genes may have interchangeable roles in different profiles [28].
While some interchangeable features make sense, one would be skeptical when nearly all proposed multigene profiles seem to work well. The predictions of the poorly overlapping 70-gene signature and 21-gene recurrence score of Oncotype DX® agree for about 80% of patients, and similar high concordance is seen between other distinct molecular profiles [28]. However, this is based on a dataset that was used in part for the training of three of the five compared profiles (including three of the four that show concordance). In general, keeping the training data into a combined training-plus-testing database spuriously inflates the accuracy and concordance of profiles. Training data should always be discarded in the independent validation process. Figure 1 shows another example, in which consideration of the training and testing data together can make a failed validation seem as if it were highly successful [34, 35].
|
| PROVISION OF INDEPENDENT INFORMATION BEYOND AND ABOVE CLASSIC RISK FACTORS |
|---|
|
|
|---|
For most applications of molecular profiling to date, such investigation of incremental classification ability is not performed at all [12]. Large datasets are required to do this [36]. Other excuses include the lack of information or incomplete information on classic risk factors, poor standardization of classic risk factors, and even a lack of consensus about which classic risk factors are important to consider. Many classic risk factors and tumor markers are supported by unreliable evidence.
Some of the best work on addressing the incremental discriminatory ability of molecular profiles has been done in breast cancer. The 21-gene recurrence score and the 70-gene signature may provide independent information after multivariate adjustment for classic risk factors [29, 34]. Even here, though, the verdict is not final, and other investigators have reached opposite conclusions when gene expression is compared with a strong classic predictive system (e.g., the Nottingham Prognostic Index) or optimized combinations of conventional markers [37, 38].
Molecular profiles may have modest to strong correlation with some classic risk factors, for example, tumor size and grade. Therefore, when both the molecular profile and the classic risk factors are included in multivariate models, coefficients may be affected by collinearity and become unstable. Generally, a variable that carries information from many subvariables (the typical molecular profile) may outperform one that is more narrowly defined, when both are included in the same model. Collinearity may damage more the coefficient of the classic risk factor.
Sometimes the molecular profile coefficient remains formally significant, while other classic risk factors are no longer statistically significant in the multivariate model. However, this is clinically meaningless when we have routinely available classic risk factors. Age, sex, tumor size, and lymph node status are known without any special effort. It makes no sense to say that if one can measure the molecular profile, then we don't need to know the age, sex, tumor size, and lymph node status. The real question is the absolute increment in discrimination offered by a model with the molecular profile plus classic routine risk factors versus a model with classic risk factors alone. This question is usually not addressed at all. The few presented data suggest that this incremental benefit is small in the best successes to date. For the 70-gene signature for breast cancer, the area under the curve (in the curve showing the interchange between specificity and sensitivity) for metastasis at 5 years improves from 0.659 with the classic risk factors (Adjuvant! Online) to only 0.681 using the molecular profile [29].
Paradoxically, the selection of appropriate study populations for which a molecular profile is most needed may sometimes underestimate the importance of classic risk factors. For example, the dataset of 295 women with breast cancer that was used for the training of several of the breast cancer signatures was chosen in a way that ensured that classic predictors had little to offer [39]: eligibility criteria included a strict cutoff to ensure small tumor size, a strict cutoff for young age (52 years), and no infiltration of apical lymph nodes. Within this narrow range, classic risk factors are already reduced to have little influence, so even a mediocre profile would add incremental information. Would this profile be able to add equally incremental information in a more general population in which the classic predictors carry more information? Once in clinical practice, generalization of the use of tumor markers beyond their original training and validation setting ("transportability") is likely to happen [40]. The broader population in which a predictive tool is eventually applied may not necessarily be plausibly related to what was studied in the training and validation setting.
| NONSELECTIVE AND TRANSPARENT ACCUMULATION OF EVIDENCE |
|---|
|
|
|---|
The exact extent of selective reporting biases in gene-expression profiling research is unknown. However, selection biases are unavoidable when the prevailing mentality remains that these studies should remain data-rich exercises on a few subjects [13]. Failure to act on this front may generate unreliable literature [45]. At a minimum, studies with large sample sizes and those that reach close to clinical translation should enjoy full transparency. The poststudy odds of a research finding being true are small when effect sizes are small, when studies are small, when a field is "hot" (many teams working on it), when there is strong interest in the results, when databases are large, and when analyses are more flexible [45]. Molecular profiling research fulfils all these criteria.
| DEMONSTRATION OF CLINICAL EFFECT (EFFICACY) |
|---|
|
|
|---|
Answering these questions is not easy. For example, Oncotype DX® was approved for use before any results from clinical trials were obtained. Clinical Laboratory Improvement Amendments (CLIA) approval and reimbursement through the Centers for Medicare & Medicaid Services were granted considering only the good performance of retrospective data from available datasets (with the caveats discussed above) and extrapolating that this would also translate into a clinical benefit. Nevertheless, a large clinical trial, the Trial Assigning IndividuaLized Options for Treatment (Rx) (TAILORx), is already also under way to validate the utility of the profile prospectively. Another large trial, the Microarray In Node negative Disease may Avoid ChemoTherapy (MINDACT) trial, has started with the aim to test the utility of the 70-gene signature [15].
Both trials test the utility of the respective molecular signature only in selected strata of patients. In TAILORx, patients are randomized to receive or not receive chemotherapy only if they have an Oncotype DX® score between 11 and 25. Patients with higher scores are all given chemotherapy and patients with lower scores are not given chemotherapy. All patients receive hormonal therapy. In the MINDACT trial, patients whose prediction is "high risk" according to both the traditional prognostic tool (Adjuvant! Online) and the 70-gene signature are given chemotherapy, patients whose prediction is "low risk" according to both tools are not given chemotherapy, while patients who have discordant predictions (high risk by Adjuvant! Online and low risk by the 70-gene signature) are randomized to either receive or not receive chemotherapy.
What outcomes should these trials have? Survival is the most important, patient-relevant outcome [46], but some investigators have argued against survival being the key outcome in cancer trials [47], and trials with survival as the primary endpoint require extremely long follow-up. Moreover, molecular profiling may have no impact on survival, but may still affect other patient-relevant outcomes such as avoidance of drug toxicity and improvement in quality of life. More rational use of treatments may also reduce costs, with less chemotherapy being administered and fewer hospitalizations resulting from adverse events. The clinical utility of potential outcomes should be carefully graded [48].
No matter what outcomes are selected, one has to ensure that the design of the trials carries objectivity. For example, blinding may not be feasible given the nature of the intervention, but allocation concealment should be guaranteed. Knowledge of the intervention assignment may affect the interpretation of soft, subjective outcomes and decisions such as quality of life, use of treatment, and hospital admission.
Of the two ongoing trials, the primary outcomes of TAILORx are disease-free survival, distant recurrence-free interval, recurrence-free interval, and overall survival. The trial expects to follow-up patients for up to 20 years. For the MINDACT trial, the primary endpoint is distant metastasis-free survival, but the power calculations are not based on the comparison of the two randomized arms. The trial is powered to reject the null hypothesis that the 5-year distant metastasis-free survival rate is 92% in the discordant-test patients (low risk molecular, high risk clinical prediction), who receive no chemotherapy, if the true rate is 95%. Thus, it aims to prove that molecular prediction can safely be used to spare chemotherapy and thus its toxicity and cost in selected patients.
The number of available profiles may escalate geometrically in the future. Should each new profile be tested with one or more clinical trials? This will make the approval and use of these tests very cumbersome and their development and testing prohibitively costly. In this regard, proof-of-concept trials may be recommended. However, the decision of whether or not a new profile can fit into the already tested concept is difficult. For treatments at least, we have been repeatedly misled into believing that testing one intervention suffices and all others would be "similar" [49]. Certainly most available tests that are already used in everyday clinical practice have not been evaluated for clinical utility in the setting of clinical trials. Yet should this be an argument in favor of continuing this laxity toward diagnostic and predictive evidence as opposed to therapeutic interventions?
Perhaps the answer to these questions should be given on a case-by-case basis. We should consider each time what the prospects are for improving outcomes for a specific disease, stage, and setting. For example, for a disease and stage for which no effective treatment exists, one may argue that molecular markers may be able to identify people who do benefit, despite no average effect. While the concept is attractive, we hardly have any examples in which this approach has worked to date to transform an ineffective treatment into an effective one for a subgroup. More likely the benefits should be sought in individualizing treatments that are already known to have some average benefit [50].
| DEMONSTRATION OF BENEFIT ON ROUTINE CLINICAL USE (EFFECTIVENESS) |
|---|
|
|
|---|
Given the funding limitations for conducting clinical research, clinical trials should try to replicate as closely as possible the wide variety of conditions and settings in which a particular profile may be employed. Both TAILORx and the MINDACT trial, the two ongoing clinical trials, have struggled to introduce pragmatic designs [51], as we discussed above. Still, TAILORx is performing a randomized comparison in a stratum of intermediate-risk patients according to the molecular predictor score without any effort being made to compare this with a traditional clinical prediction.
Perhaps more importantly, it is difficult to model in a clinical trial the extent of misuse of a technology that will occur when it is available on the market. By definition, even the most pragmatic trials do the best possible to ensure that the diagnostic or predictive test is used and interpreted appropriately. Simplifying these tests to the maximum possible extent before introduction into clinical use is also critical. Some of the proposed profiles are already simplified. However, others are using more complex classifications [52] that may escape the average oncology specialist. Interestingly, the split of the Oncotype DX® score into three categories of risk was different in the studies that led to the training and validation of this molecular predictor (<18, 1830, >30) from the three categories examined in the "pragmatic" TAILORx (<11, 1125, >25). When there is such inconsistency even in the research phase, moving the goalpost (perhaps inappropriately) may be even more frequent in clinical applications.
Misuse may take different forms. These include employing a test when it is not likely to be helpfulthen the results will lead to misleading reassurance about the test-driven decisionsand not using a test when it could provide useful information. These mishaps go beyond the errors in management that would result from simple misclassification due to inherently imperfect test performance.
| INTEGRATION IN CLINICAL CARE |
|---|
|
|
|---|
A series of other challenges follows for which we have no clear answers yet. What are some minimal quality criteria that should exist and how are they to be enforced? How will these tests be introduced in the clinical care routine? Who will order them? Would they require a minimum of specialization and knowledge-based training? Should they be used only by "experts"? If so, what level of expertise is warranted and required? Should the use of these tests be audited? If so, what should be the audit criteria? With the rapidly increasing number of tests, should expertise be continuously re-evaluated and reinforced?
Perhaps the above concerns are exaggerated. Multicenter experience should give us a clearer picture [5356]. As medicine evolves toward a more patient- rather than physician-centered decision-making environment, patients may be prime motivating forces advocating the use of these tests. The prospect of individualized medicine may empower people to use technology directly. However, perhaps use of these complex technologies by patients without expert input may complicate or even worsen care and outcomes.
| COST-EFFECTIVENESS |
|---|
|
|
|---|
Until then, what is a suitable price for a molecular profile? A formal decision and cost-effectiveness analysis is premature in the absence of estimates of the true benefits of this technology. The current price of molecular profiles moving into the market is a few thousand dollars per test. This is expensive compared with most tests used in modern medicine. However, in the absence of the complete picture, it is impossible to tell whether this is worth it or not.
| THE FUTURE IS NOW |
|---|
|
|
|---|
First, we need full transparency of the experimentation and results obtained in this field, and every effort should be made to avoid selective reporting. Second, we need larger datasets for all stages of development of these assays, including training, validation, proof of concept, and proof of clinical merit. Third, there is no excuse for anything less than full, extensive independent validation under completely independent conditions for any molecular profile that wants to be even a candidate for clinical use. Fourth, evaluation of classifier accuracy should consider all available classic risk factors, and this should be done in settings and with datasets for which classic risk factors are given a fair chance to show their merit. Replacing routine, free information with costly, convoluted biological signatures makes no sense. Fifth, although there is a debate about how many clinical trials we need, at the moment we have none that have been completed and only a couple that are ongoing with quite novel designs. Therefore, we are certainly far from the point of saying that we have had enough randomized evidence. Trials should be designed carefully to minimize the potential for biased results and they should have a pragmatic outlook. Finally, we should keep thinking about how to best employ these tests when they emerge for widespread use into clinical practice, some time soon.
| DISCLOSURE OF POTENTIAL CONFLICTS OF INTEREST |
|---|
|
|
|---|
| REFERENCES |
|---|
|
|
|---|