Bruce, Craig L. (2010) Classification and interpretation in quantitative structure-activity relationships. PhD thesis, University of Nottingham.
A good QSAR model comprises several components. Predictive accuracy is paramount, but it is not the only important aspect. In addition, one should apply robust and appropriate statistical tests to the models to assess their significance or the significance of any apparent improvements. The real impact of a QSAR, however, perhaps lies in its chemical insight and interpretation, an aspect which is often overlooked.
This thesis covers three main topics: a comparison of contemporary classifiers, interpretability of random forests and usage of interpretable descriptors. The selection of data mining technique and descriptors entirely determine the available interpretation. Using interpretable approaches we have demonstrated their success on a variety of data sets.
By using robust multiple comparison statistics with eight data sets we demonstrate that a random forest has comparable predictive accuracies to the de facto standard, support vector machine. A random forest is inherently more interpretable than support vector machine, due to the underlying tree construction. We can extract some chemical insight from the random forest. However, with additional tools further insight would be available. A decision tree is easier to interpret than a random forest. Therefore, to obtain useful interpretation from a random forest we have employed a selection of tools. This includes alternative representations of the trees using SMILES and SMARTS. Using existing methods we can compare and cluster the trees in this representation. Descriptor analysis and importance can be measured at the tree and forest level. Pathways in the trees can be compared and frequently occurring subgraphs identified. These tools have been built around the Weka machine learning workbench and are designed to allow further additions of new functionality.
The interpretability of a model is dependent on the model and the descriptors. They must describe something meaningful. To this end we have used the TMACC descriptors in the Solubility Challenge and literature data sets. We report how our retrospective analysis confirms existing knowledge and how we identify novel C-domain inhibition of ACE.
In order to test our hypotheses we extended and developed existing software forming two applications. The Nottingham Cheminformatics Workbench (NCW) will generate TMACC descriptors and allows the user to build and analyse models, including visualising the chemical interpretation. Forest Based Interpretation (FBI) provides various tools for interpretating a random forest model. Both applications are written in Java with full documentation and simple installations wizards are available for Windows, Linux and Mac.
|Item Type:||Thesis (PhD)|
|Uncontrolled Keywords:||qsar data mining cheminformatics chemoinformatics random forest machine learning|
|Faculties/Schools:||UK Campuses > Faculty of Science > School of Chemistry|
|Deposited By:||Dr Craig L Bruce|
|Deposited On:||12 Apr 2011 11:59|
|Last Modified:||12 Apr 2011 11:59|
Archive Staff Only: item control page