Toxicogenomic analysis methods for predictive toxicology
Jeff Maggioli, Aubree Hoover and Lee Weng
Rosetta Biosoftware, 401 Terry Avenue, North Seattle, WA 98109, U.S.A.
Abstract
Toxicogenomics, the application of genomic data to elucidate or predict an organism‘s response to a toxicant, can inform the drug development process in important ways. It is apparent that standardized approaches to many types of toxicogenomic questions are still being formulated. Specifically, a significant body of proof of principle studies has emerged that demonstrates a range of statistical methodologies applied to predictive toxicology. These studies rely on class prediction methods – mathematical models generated using the gene expression profiles of known toxins from representative toxicological classes – to predict the toxicological effect of a compound based on the similarities between its gene expression profile and the profiles of a given toxicological class. Class prediction methods hold promise for increasing the rate at which compounds can be evaluated for toxicity early in the drug discovery process, while at the same time reducing the length of toxicological studies and their associated costs. Class prediction methods are informed by class comparison and class discovery steps, which inform, respectively, the selection of genes whose response can be used to distinguish among the toxicological classes and the number of classes distinguishable using the response of these genes. Together these steps use a variety of complementary statistical techniques to achieve a successful class prediction model. This report attempts to review some of the themes that appear to be emerging in the application of these techniques to predictive toxicology methods over toxicogenomics‘ short history.
Keywords: Predictive toxicology; Toxicogenomics ; Gene expression; Review
1. Introduction
As a provider of commercial software for gene-expression data analysis and management, we work closely with pharmaceutical companies, biotechnology organizations, and academic institutions. Through our interaction with these entities, we have observed a clear trend toward the use of expression profiling in toxicogenomic endeavors. When applied to mechanistic toxicology studies, gene expression data can be mined to find which genes out of hundreds or thousands monitored are perturbed by a treatment, providing important clues about a toxin‘s underlying mechanism of toxicity (Afshari, Nuwaysir, & Barrett, 1999). A growing number of studies demonstrate that gene expression data are also useful for class prediction studies (predictive toxicology), in which expression signatures from known toxins are used to predict the toxicological class of an unknown toxicant (e.g., a new chemical entity or drug candidate). If robust class prediction methods can be routinely generated, the drug development process will benefit, as significant gains are expected in both the speed of analyzing candidate compounds and in reducing development costs, as many compounds could be eliminated before undergoing traditional toxicological studies.
Gene expression data can provide an early indication of toxicity because toxin-mediated changes in gene expression are often detectable before clinical chemistry, histopathology, or clinical observations suggest a toxic effect (Ulrich & Friend, 2001). However, to fulfill the promise of accelerating preclinical evaluation of drug candidates, many hurdles remain, including the creation of databases containing relevant gene expression data from studies of known toxins, division of known toxins into toxicant classes distinguishable using expression data, understanding the time and dose-dependency of gene response, and correlation of gene response to phenotypes (Pognan, 2004). Class prediction methods only indicate possible relationships between gene response and phenotypes. Further study is necessary to distinguish causative from reactive genes.
In addition, the information-rich data set and the dynamic nature of gene expression present computational challenges for the routine use of class prediction methods in drug development. Genomic data sets are a complex matrix, often containing thousands of individual data points. Genes useful for class prediction are selected for inclusion in the discriminatory gene set, whose expression values (the gene signature) can be used to distinguish among the toxicological classes studied. This discriminatory gene set must be discovered against a complex background of other gene expression changes, some resulting from factors unrelated to the treatment (e.g., sampling time) and others a result of the treatment, but not useful for class prediction, as they are perturbed in a similar way by diverse toxins (e.g., genes involved in metabolic pathways) (Hamadeh, Bushel, Jayadev, Matin et al., 2002).
Gene expression experiments also present a challenge to traditional statistical significance testing because significant change must be calculated for datasets with many variables (potentially tens of thousands) but few available experimental replicates. Finally, toxins may affect gene expression in complex ways, requiring statistical methods that can consider the interaction of expression changes, i.e., genes excluded by statistical significance tests like ANOVA (because they do not change significantly across groups) may still have predictive value when coupled to the response of genes that do change.
The most successful computational methods for class prediction are the supervised learning methods (or classifiers). These methods rely on a training set consisting of gene expression profiles from representatives of the different toxicological classes to be modeled. The gene signatures from samples in the training set and the knowledge of their origins (toxicological class) are used to derive a set of algorithms that can be used to classify unknowns. Class prediction methods most often follow class comparison and class discovery steps, which, respectively, inform the selection of the discriminatory gene set and help to define the toxicological classes distinguishable by gene expression signatures. These steps often make use of complimentary statistical techniques. For example, in the class comparison step, statistical significance tests like ANOVA may be applied to select genes that vary among the toxicological classes being studied. Clustering techniques, considered unsupervised learning methods in that they only consider gene expression data and not toxicological class information when representing similarities between treatments, are usually applied in the class discovery step. Literature examples exist that describe the class comparison, class discovery, and class prediction steps for the classification of known hepatotoxins as either peroxisome proliferators or enzyme inducers (Hamadeh, Bushel, Jayadev, DiSorbo et al., 2002), known toxins to one of five characterized toxicological classes (Thomas et al., 2001), and toxic metals into seven or nine distinct groups (Tsai et al., 2005).
The published accounts of class prediction methods, including the related steps of class comparison and class discovery, show much diversity in the statistical methods applied, with some general themes repeated in many, but not all, studies (e.g., the use of unsupervised learning methods for class discovery and supervised learning methods for class prediction). This review attempts to survey the ways different groups are approaching the computational challenges posed by the use of gene expression data for predictive toxicology and provide a discussion of possible future directions for class prediction methods.
2. The process of predictive toxicogenomics
At a high level, the process leading up to a successful class prediction model can be represented as three to five steps (see Fig. 1).
Data Preparation— Datasets are corrected for sources of variability that result from causes other than the treatments under study (e.g., hybridization differences in preparation of the microarrays, variable recoveries of mRNA, fluorescent dye labeling efficiencies).
Class Comparison— The prepared data from a training set are analyzed to define the discriminatory gene set—the set of genes that allow for differentiation among the toxicological classes represented in the training set.
Class Discovery— The similarity among treatments present in the training set is visualized using techniques like clustering, an unsupervised learning method that groups treatments based only on the similarities in their gene signatures and does not employ knowledge about the samples‘ toxicological class.
Class Prediction— Unknown or blinded samples are assigned to toxicological classes. Typically, a classifier (supervised learning method) is applied to the gene signatures of the training set samples to generate a mathematical model for predicting the toxicological class of unknowns.
Evaluation— The model generated for class prediction is evaluated. Blinded samples can be used to estimate success rates in predicting the toxicological class of unknowns, or individual samples from the training set can be used to evaluate the model, using a “leave one out” validation approach.
Fig. 1. An abstract view of the relationship of class prediction to the related steps that inform it.
Current literature examples include a wide variety of techniques used for class prediction and the related steps shown in Fig. 1. Though it is convenient to break toxicogenomic studies down into discrete steps, mapping statistical methods to these steps can be a challenge. Multiple statistical methods are often evaluated at each step in the process of generating a class prediction model. Furthermore, techniques commonly associated with one step may be applied in another. For example, supervised learning methods, typically associated with the class prediction step, may be applied as a class comparison technique, to identify the discriminatory gene set (Hamadeh, Bushel, Jayadev, DiSorbo et al., 2002). Finally, because supervised learning methods do not lend themselves well to visualization, a number of other statistical techniques may be used to provide qualitative views of the similarities between treatment groups. The flow diagram shown in Fig. 2 attempts to capture the diversity of approaches used to generate class prediction methods.
Fig. 2. An information-centric view of the class prediction process and the steps that inform it. Families of techniques are represented by the blue boxes (e.g., Hypothesis Testing includes parametric methods like t-tests, ANOVA, and non-parametric methods like Wilcoxon and SAM). In any one study, multiple techniques from the same family are often applied for comparison. The evaluation step informs the success of each technique. Selection of statistical methods and discriminatory gene sets is often refined in an iterative process to generate a final classification model for unknowns or blinded samples.
2.1. Data preparation
All microarray experiments are affected by systematic and random error. Random error can be generated by factors such as background noise, scanner noise, and hybridization noise. The ideal way to reduce random error is to generate many replicates and perform data analysis on the combined replicates. When replicates are limiting, as they often are in microarray studies, an estimation of random errors can be useful. Systematic errors have known causes or well understood behaviors, and can be corrected. Examples of microarray systematic error include scanner sensitivity or non-zero background intensities. Preprocessing algorithms such as background subtraction, normalization, and de-trending can reduce or eliminate systematic error. An in depth discussion of data preprocessing and normalization methods is beyond the scope of this paper, but can be found in a number of references (e.g., Baldi & Hatfield, 2002).
2.2. Class comparison
A common approach to class comparison is to search for a discriminatory gene set among expression profiles generated from studies of toxins representative of known toxicological classes. Statistical significance testing is often used to select the discriminatory gene sets. For example, to select discriminatory genes from a training set of expression profiles from rats exposed to nine toxic metals, an ANOVA F-test was used to find those genes that varied significantly across the nine treatment groups and an OVA (one-versus-all) test identified gene expression that varied significantly when each group was compared to the average of the other eight (Tsai et al., 2005). The resulting two discriminatory gene sets (the set defined from the F-test and the union of the nine groups returned from the OVA analysis), as well as a third set, consisting of those genes appearing in both original sets, were then evaluated for their ability to classify toxic metals successfully.
Another approach is to reduce the dimensionality of the complex data set using dimension-reducing techniques such as principal component analysis (PCA), multidimensional scaling (MDS), or wavelet transformation (Yang, Blomme, & Waring, 2004). Rather than requiring the selection of specific genes from a data set, these techniques reduce the high-dimensionality of the original data set, which can include thousands of variables, into a smaller number of weighted variables. One disadvantage of this approach is that information about which genes are modified most for individual classes is obscured (Tsai et al., 2005).
A combination approach is sometimes used. To identify a discriminatory set for classifying hepatotoxins, ANOVA analysis was used to identify the top 200 genes that varied among the groups studied. Wavelet transformation was then applied to the expression profiles from these 200 genes, reducing their response into seven components (Yang et al., 2004). These seven components were carried forward to the class prediction stage.
2.3. Class discovery
Clustering methods are often applied to visualize the similarities between individual treatments as well as multiple treatments from different toxicological classes. Hierarchical clustering represents the distance between samples visually—more similar gene expression profiles are grouped together. Clustering is a valuable exploratory technique for helping to characterize how many classes a given set of treatments can be divided into.
Two common types of clustering algorithms are hierarchical and partitioning algorithms. Hierarchical algorithms yield a hierarchy of clusters for a data set that can be visualized in a dendrogram tree (see Fig. 3). Data sets belonging to the same branch of a cluster are similar to each other at some level, whereas data sets in separate branches are less similar.
Fig. 3. An example of two types of hierarchical clustering algorithms applied to the data sets derived from rats treated with 15 known hepatotoxins (taken from Waring et al., 2001). The 2D clusters were generated using the Rosetta Resolver? System. Reproduced with permission.
Partitioning algorithms like K-Means divide the data set into an a priori specified number of clusters that are viewed in tabular format to make inferences about their similarity within a cluster. Because partitioning algorithms result in bins, unique inferences about the relationship of each data point in a cluster to every other data point in the cluster are not apparent. Similarly, further inferences about the relationship of all the data points in one cluster to all the data points in another cluster cannot be drawn (see Fig. 4).
Fig. 4. Representation of a K-means cluster.
The results of clustering are influenced by the type of similarity measure used to calculate the distance between items in the clusters. Distance based similarity measures, like Euclidian distance, emphasize the magnitude of the fold changes between data sets. Correlation-based measures, such as combination with mean subtraction, emphasize the pattern of the fold changes.
Cluster analysis was used to study how gene response varied over time for a given treatment (Hamadeh, Bushel, Jayadev, Matin et al., 2002). The results of time course analysis can help further refine the genes included in the discriminatory gene set, as one of the common goals for a class prediction method is a time-independent model, a model that excludes genes whose response is highly unstable with time. In the same study, clustering, PCA, and correlation analysis, were all used to demonstrate the similarities between test compounds from the two classes (peroxisome proliferators and enzyme inducers) and provide preliminary evidence that creation of a model to distinguish these classes should be attainable.
Hierarchical clustering algorithms using a distance (Euclidian distance) or correlation-based (one minus correlation coefficient) similarity measures were compared for their ability to cluster datasets from nine rats treated with toxic metals (Tsai et al., 2005). Though clustering was an investigative step to examine the number of classes represented by the nine treatments, the clusters were compared by examining their ability to group replicate treatments together into eight groups in a class prediction type exercise (groupings were defined by the study authors). Though clustering methods are useful for investigating similarities between treatments (class discovery), clustering is not recommended for use in class prediction. Clustering is a subjective technique, whose results are highly influenced by selection of the clustering algorithm and similarity metric (Simon, Radmacher, Dobbin, & McShane, 2003).
2.4. Class prediction and evaluation
Class prediction typically relies on supervised learning methods (classifiers) to assign a toxicant to a known group. The methods use a discriminatory gene set (or a data set that has been reduced using a dimension-reducing technique) derived from a training set, to obtain a mathematical model that can predict the class membership of unknowns. One active area of study is the practice of filtering out invariant gene expression signals versus applying classifiers to the full data set. A recent study compares the effects of two different types of data filtering on the performance of four classifiers for distinguishing genotoxic from non-genotoxic compounds (Van Delft et al., 2005). There are a variety of classification methods available, including Linear Discriminant Analysis (LDA), nearest-neighbor (NN) methods, Na?ve Bayesian classifiers, as well as machine learning methods, such as bagging methods, support vector machines (SVM), and artificial neural networks (ANN). Dudoit et al. compared many of these methods for their ability to classify tumors and found that for their test data set, LDA and NN methods yielded the best prediction accuracies (Dudoit, Fridlyand, & Speed, 2002).
In an approach that combines class comparison and class prediction, Hamadeh et al. used both kNN and LDA classifiers to select genes most useful for distinguishing between the two classes of compounds being studied: peroxisome proliferators and enzyme inducers (Hamadeh, Bushel, Jayadev, DiSorbo et al., 2002). The K-nearest neighbors (applied with a Genetic Algorithm used for searching) yielded a ranked list of genes useful for distinguishing between the two compound classes. LDA (applied after an initial ANOVA analysis to exclude genes whose expression did not change significantly across compound classes) yielded a second set of genes that appeared to discriminate between compound classes. The 22 genes appearing in the intersection of these two sets were used to classify blinded samples. Classification was accomplished using correlation set analysis, by calculating the pairwise Pearson correlation coefficient between each sample in the training set and each blinded sample. Samples were considered similar if their correlation coefficient was > 0.8.
Tsai et al. compared Fisher‘s Linear Discriminant Analysis (FLDA) and kNN approaches for predicting the class of gene signatures from the liver tissue of rats exposed to various toxic metals (Tsai et al., 2005). A leave-one-out validation approach was used to evaluate the methods. In this approach, the classifier is calculated using data from all but one sample. The resulting algorithm is then used to classify the sample left out. This process is repeated until all samples have been classified. A leave-one-out validation approach was also taken by Thomas et al (Thomas et al., 2001) who applied a Na?ve Bayesian classifier to assign samples to five distinct toxicological classes.
Especially for classifiers built from small training sets, class prediction of a larger set of independent samples may be necessary to characterize the method‘s true performance (Simon et al., 2003). A known weakness of classifiers is the tendency to “overfit” the data in the training set, which limits the utility of the model for predicting profiles outside of the training set. With this view, the leave-one-out validation approach is a reasonable first step in characterizing the performance of a method. But to accurately characterize the method‘s ability to predict the class of unknowns, a validation would need to include samples independent from the training set, representing each toxicological class recognized by the model (Simon et al., 2003).
3. The future of predictive toxicogenomics
With further study, the computational methods used in class comparison, class discovery, class prediction, and evaluation will likely become more standardized. For example, the choice of supervised learning methods for specific applications is likely to be narrowed by ongoing research, in which the theoretical merits and performance of various classifiers are compared. The oncology literature includes a comparison of linear discriminant analysis, classification tree, and nearest neighbor methods (Dudoit et al., 2002). The methods were compared for their ability to successfully predict tumor class using the gene expression data from a number of published studies. A more recent study has appeared in the toxicology literature, comparing the performance of four supervised learning methods on their ability to distinguish genotoxic from non-genotoxic carcinogens (Van Delft et al., 2005). In this study, the evaluation of the classifier methods was combined with an evaluation of different input data sets (different methods for class comparison). An analysis of methods of error rate reporting for classification methods has appeared, stressing the importance of a significantly large and diverse independent validation dataset for sufficient characterization of class prediction methods (Simon et al., 2003).
A number of other publications have begun to detail the many obstacles that remain before class prediction methods can begin to fulfill the promises of accelerating the drug development process, or possibly even replacing some traditional toxicological studies. Among these challenges is the cost-intensive process of building relevant databases of gene expression profiles of known toxins (Lühe et al., 2005 and Van Delft et al., 2005), the difficulties of comparing gene expression data collected using different technologies (Hayes et al., 2005), and the challenge of making useful predictions of toxicity across species or from in vitro systems (e.g., cultured primary hepatocytes) to living organs in human beings (Pognan, 2004).
Though many obstacles remain, work continues to try and make class prediction methods robust and sufficiently relevant for routine use in toxicological evaluation of novel compounds. A continual refinement in the application and evaluation of computational approaches will undoubtedly continue to be an important part of this effort.