Your browser version may not work well with NCBI's Web applications. More information here...
Related Articles, Links
Click here to read
Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions.

Somorjai RL, Dolenko B, Baumgartner R.

Institute for Biodiagnostics, National Research Council Canada, Winnipeg, MB, Canada R3B 1Y6. Ray.Somorjai@nrc-cnrc.gc.ca

MOTIVATION: Two practical realities constrain the analysis of microarray data, mass spectra from proteomics, and biomedical infrared or magnetic resonance spectra. One is the 'curse of dimensionality': the number of features characterizing these data is in the thousands or tens of thousands. The other is the 'curse of dataset sparsity': the number of samples is limited. The consequences of these two curses are far-reaching when such data are used to classify the presence or absence of disease. RESULTS: Using very simple classifiers, we show for several publicly available microarray and proteomics datasets how these curses influence classification outcomes. In particular, even if the sample per feature ratio is increased to the recommended 5-10 by feature extraction/reduction methods, dataset sparsity can render any classification result statistically suspect. In addition, several 'optimal' feature sets are typically identifiable for sparse datasets, all producing perfect classification results, both for the training and independent validation sets. This non-uniqueness leads to interpretational difficulties and casts doubt on the biological relevance of any of these 'optimal' feature sets. We suggest an approach to assess the relative quality of apparently equally good classifiers.

Publication Types:
PMID: 12912828 [PubMed - indexed for MEDLINE]