Statistical Analysis of Proteomic Mass Spectrometry Data

Handley, Kelly (2007) Statistical Analysis of Proteomic Mass Spectrometry Data. PhD thesis, University of Nottingham.



This thesis considers the statistical modelling and analysis of proteomic mass spectrometry data. Proteomics is a relatively new field of study and tried and tested methods of analysis do not yet exist. Mass spectrometry output is high-dimensional and so we firstly develop an algorithm to identify peaks in the spectra in order to reduce the dimensionality of the datasets. We use the results along with a variety of classification methods to examine the classification of new spectra based on a training set. Another method to reduce the complexity of the problem is to fit a parametric model to the data. We model the data as a mixture of Gaussian peaks with parameters representing the peak locations, heights and variances, and apply a Bayesian Markov chain Monte Carlo (MCMC) algorithm to obtain their estimates. These resulting estimates are used to identify m/z values where differences are apparent between groups, where the m/z value of an ion is its mass divided by its charge. A multilevel modelling framework is also considered to incorporate the structure in the data and locations exhibiting differences are again obtained.

We consider two mass spectrometry datasets in detail. The first consists of mass spectra from breast cancer cells which either have or have not been treated with the chemotherapeutic agent Taxol. The second consists of mass spectra from melanoma cells classified as stage I or stage IV using the TNM system. Using the MCMC and multilevel techniques described above we show that, in both datasets, small subsets of the available m/z values can be identified which exhibit significant differences in protein expression between groups. Also we see that good classification of new data can also be achieved using a small number of m/z values and that the classification rate does not fall greatly when compared with results from the complete spectra. For both datasets we compare our results with those in the literature which use other techniques on the same data. We conclude by discussing potential areas for further research.

Item Type:Thesis (PhD)
Supervisors:Dryden, Ian L.
Browne, William J.
Uncontrolled Keywords:Markov chain Monte Carlo, MCMC, multilevel modelling, classification, high-dimensional, bioinformatics
Faculties/Schools:UK Campuses > Faculty of Science > School of Mathematical Sciences
ID Code:287
Deposited By:Kelly Handley
Deposited On:22 Oct 2007
Last Modified:06 Feb 2009 14:43

Repository Staff Only: item control page