Extremely low false discovery rates were witnessed for all datasets as derived from comparing the distribu tions of classifier accuracies in unmodified and randomly permuted data, indicating high statistical significance of each classifier. Every classifier except the prediction of ulcerative colitis transformation had an estimated false discovery rate of well below 0. 01 from 10 independent permutations. the lower performance in this dataset was likely due to the smaller number of experimental samples included therein. Selected classification tasks are shown in Figure 2, includ ing the distribution of gene pair accuracy, and a graphical representation of top scoring classifiers. As would be expected, the vast majority of gene pairs have low predic tive accuracy in the given classification tasks, with only a small fraction exhibiting strong correlation with pheno type.
Importantly, the random permutation of class labels sharply reduces the apparent accuracy of the classification algorithm for most datasets, indicating that the classifiers derived on original, unmodified data are statistically sig nificant, corresponding to true molecular separation of the two phenotypes rather than being a product of chance. These results compare favourably with classifiers reported for these datasets using other statistical classification methods. Discussion We have shown that simple two transcript gene expres sion classifiers can accurately classify a wide spectrum of human diseases. This algorithm is invariant to data nor malization and generates robust, statistically significant biological classifiers even in the context of low sample sizes.
Our results reveal that many pathological processes, even those not traditionally considered genetic in nature such as infections and inflammatory disorders, can be diagnosed through just two transcriptional measure ments. Whereas previous work has Batimastat shown the diagnostic value of gene expression perturbations, this study demon strates that as few as two transcriptional measurements can reliably detect diverse human diseases. Transcriptional networks themselves can thus be seen to encode aspects of pathological phenotypes, with strong correlation observed between gene expression status and disease state. These transcriptional signatures were suffi ciently robust to be detected even in tissue samples of pos sibly heterogeneous cell populations.
The accuracies observed in these simple diagnostic modalities were com parable to pre existing transcription based classifiers that rely on more complex, multivariate measurements. For example, a 12 gene classifier generated against the same Crohns disease dataset using a weighted voting scheme exhibited a cross validation accuracy of 94%, compared with equivalent TSP cross validation performance of 87%.