Knowledge-based Analysis of Microarray Gene Expression Data By Using Support Vector Machines

Please download to get full document.

View again

of 6
36 views
PDF
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
Knowledge-based Analysis of Microarray Gene Expression Data By Using Support Vector Machines
Document Share
Document Tags
Document Transcript
  Knowledge-based analysis of microarray geneexpression data by using support vector machines Michael P. S. Brown*, William Noble Grundy †‡ , David Lin*, Nello Cristianini § , Charles Walsh Sugnet ¶ , Terrence S. Furey*,Manuel Ares, Jr. ¶ , and David Haussler* *Department of Computer Science and  ¶ Center for Molecular Biology of RNA, Department of Biology, University of California, Santa Cruz, Santa Cruz, CA95064;  † Department of Computer Science, Columbia University, New York, NY 10025;  § Department of Engineering Mathematics, University of Bristol, BristolBS8 1TR, United KingdomEdited by David Botstein, Stanford University School of Medicine, Stanford, CA, and approved November 15, 1999 (received for review August 31, 1999) We introduce a method of functionally classifying genes by usinggene expression data from DNA microarray hybridization experi-ments. The method is based on the theory of support vectormachines (SVMs). SVMs are considered a supervised computerlearning method because they exploit prior knowledge of genefunction to identify unknown genes of similar function fromexpression data. SVMs avoid several problems associated withunsupervised clustering methods, such as hierarchical clusteringandself-organizingmaps.SVMshavemanymathematicalfeaturesthat make them attractive for gene expression analysis, includingtheir flexibility in choosing a similarity function, sparseness ofsolution when dealing with large data sets, the ability to handlelarge feature spaces, and the ability to identify outliers. We testseveral SVMs that use different similarity metrics, as well as someother supervised learning methods, and find that the SVMs bestidentify sets of genes with a common function using expressiondata. Finally, we use SVMs to predict functional roles for unchar-acterized yeast ORFs based on their expression data. D NA microarray technology provides biologists with theability to measure the expression levels of thousands of genes in a single experiment. Initial experiments (1) suggest thatgenes of similar function yield similar expression patterns inmicroarray hybridization experiments. As data from such exper-iments accumulates, it will be essential to have accurate meansfor extracting biological significance and using the data to assignfunctions to genes.Currently,mostapproachestothecomputationalanalysisofgeneexpression data attempt to learn functionally significant classifica-tions of genes in an unsupervised fashion. A learning method isconsidered unsupervised if it learns in the absence of a teachersignal. Unsupervised gene expression analysis methods begin witha definition of similarity (or a measure of distance) betweenexpression patterns, but with no prior knowledge of the truefunctional classes of the genes. Genes are then grouped by using aclustering algorithm such as hierarchical clustering (1, 2) or self-organizing maps (3).Support vector machines (SVMs) (4–6) and other supervisedlearning techniques use a training set to specify in advance whichdatashouldclustertogether.Asappliedtogeneexpressiondata,anSVMwouldbeginwithasetofgenesthathaveacommonfunction:for example, genes coding for ribosomal proteins or genes codingfor components of the proteasome. In addition, a separate set of genes that are known not to be members of the functional class isspecified. These two sets of genes are combined to form a set of training examples in which the genes are labeled positively if theyare in the functional class and are labeled negatively if they areknown not to be in the functional class. A set of training examplescaneasilybeassembledfromliteratureanddatabasesources.Usingthis training set, an SVM would learn to discriminate between themembers and non-members of a given functional class based onexpressiondata.Havinglearnedtheexpressionfeaturesoftheclass,the SVM could recognize new genes as members or as non-members of the class based on their expression data. The SVMcould also be reapplied to the training examples to identify outliersthat may have previously been assigned to the incorrect class in thetraining set. Thus, an SVM would use the biological information intheinvestigator’strainingsettodeterminewhatexpressionfeaturesare characteristic of a given functional group and use this infor-mation to decide whether any given gene is likely to be a memberof the group.SVMs offer two primary advantages with respect to previouslyproposed methods such as hierarchical clustering and self-organizing maps. First, although all three methods employ distance(or similarity) functions to compare gene expression measure-ments,SVMsarecapableofusingalargervarietyofsuchfunctions.Specifically, SVMs can employ distance functions that operate inextremely high-dimensional feature spaces, as described in moredetail below. This ability allows the SVMs implicitly to take intoaccount correlations between gene expression measurements. Sec-ond, supervised methods like SVMs take advantage of priorknowledge (in the form of training data labels) in making distinc-tions between one type of gene and another. In an unsupervisedmethod, when related genes end up far apart according to thedistancefunction,themethodhasnowaytoknowthatthegenesarerelated.WedescribeheretheuseofSVMstoclassifygenesbasedongeneexpression. We analyze expression data from 2,467 genes from thebudding yeast  Saccharomyces cerevisiae  measured in 79 differentDNA microarray hybridization experiments (1). From these data, we learn to recognize five functional classes from the MunichInformation Center for Protein Sequences Yeast Genome Data-base (MYGD) (http:   www.mips.biochem.mpg.de  proj   yeast). Inaddition to SVM classification, we subject these data to analyses byfour competing machine learning techniques, including Fisher’slinear discriminant (7), Parzen windows (8), and two decision treelearners (9, 10). The SVM method out-performs all other methodsinvestigated here. We then use SVMs developed for these func-tional groups to predict functional associations for 15 yeast ORFsof unknown function. Methods and Approach DNA Microarray Data.  Each data point produced by a DNA mi-croarray hybridization experiment represents the ratio of expres-sion levels of a particular gene under two different experimentalconditions (11, 12). The result, from an experiment with  n genes ona single chip, is a series of   n  expression-level ratios. Typically, thenumerator of each ratio is the expression level of the gene in the This paper was submitted directly (Track II) to the PNAS office.Abbreviations: SVM, support vector machine; MYGD, Munich Information Center for ProteinSequences Yeast Genome Database; TCA, tricarboxylic acid. ‡ Towhomreprintrequestsshouldbeaddressedat:DepartmentofComputerScience,Colum-bia University, 450 Computer Science Building, Mail Code 0401, 1214 Amsterdam Avenue,New York, NY 10027. E-mail: bgrundy@cs.columbia.edu.The publication costs of this article were defrayed in part by page charge payment. Thisarticle must therefore be hereby marked “ advertisement  ” in accordance with 18 U.S.C.§1734 solely to indicate this fact. 262–267    PNAS    January 4, 2000    vol. 97    no. 1   varying condition of interest, whereas the denominator is theexpression level of the gene in some reference condition. The datafrom a series of   m  such experiments may be represented as a geneexpression matrix, in which each of the  n  rows consists of an  m -element expression vector for a single gene. Following Eisen  et al. (1),wedonotworkdirectlywiththeratioasdiscussedabovebutrather with its normalized logarithm. We define  X  i  to be thelogarithmoftheratioofexpressionlevel  E i forgene  X  inexperiment i  to the expression level  R i  of gene  X   in the reference state,normalized so that the expression vector  X  3    (  X  1 , . . . ,  X  79 ) hasEuclidean length 1:  X  i  log   E i   R i      j  179 log 2   E  j   R  j  .  [1] The expression measurement  X  i  is positive if the gene is induced(turned up) with respect to the reference state and negative if itis repressed (turned down) (1).Initial analyses described here are carried out by using a set of 79-elementgeneexpressionvectorsfor2,467yeastgenes(1).Thesegenes were selected by Eisen  et al.  (1) based on the availability of accurate functional annotations. The data were generated fromspottedarraysusingsamplescollectedatvarioustimepointsduringthediauxicshift(12),themitoticcelldivisioncycle(13),sporulation(14),andtemperatureandreducingshocks,andareavailableontheStanford web site (http:  rana.stanford.edu  clustering).Predictions of ORFs of unknown function were made by using aslightly different set of data that did not include temperature andreducingshocksdata.Thedataincluded6,221genes,ofwhich2,467 were the annotated genes described above. The 80-element geneexpression vectors used for these experiments included 65 of the 79elements from the initial data used, plus 15 additional mitotic celldivision cycle time points not used by Eisen  et al.  (1). This data isalso available on the Stanford web site. Support Vector Machines.  Each vector  X  3  in the gene expressionmatrixmaybethoughtofasapointinan  m -dimensionalexpressionspace. In theory, a simple way to build a binary classifier is toconstruct a hyperplane separating class members (positive exam-ples) from non-members (negative examples) in this space. Unfor-tunately, most real-world problems involve nonseparable data for which there does not exist a hyperplane that successfully separatesthe positive from the negative examples. One solution to theinseparabilityproblemistomapthedataintoahigher-dimensionalspace and define a separating hyperplane there. This higher-dimensional space is called the feature space, as opposed to theinput space occupied by the training examples. With an appropri-ately chosen feature space of sufficient dimensionality, any consis-tent training set can be made separable. However, translating thetraining set into a higher-dimensional space incurs both computa-tional and learning-theoretic costs. Furthermore, artificially sepa-rating the data in this way exposes the learning system to the riskof finding trivial solutions that overfit the data.SVMs elegantly sidestep both difficulties (4). They avoid over-fitting by choosing the maximum margin separating hyperplanefrom among the many that can separate the positive from negativeexamples in the feature space. Also, the decision function forclassifying points with respect to the hyperplane only involves dotproducts between points in the feature space. Because the algo-rithm that finds a separating hyperplane in the feature space can bestated entirely in terms of vectors in the input space and dotproducts in the feature space, a support vector machine can locatethe hyperplane without ever representing the space explicitly,simplybydefiningafunction,calledakernelfunction,thatplaystherole of the dot product in the feature space. This technique avoidsthe computational burden of explicitly representing the feature vectors.Forsomedatasets,theSVMmaynotbeabletofindaseparatinghyperplane in feature space, either because the kernel function isinappropriate for the training data or because the data containsmislabeled examples. The latter problem can be addressed by usinga soft margin that allows some training examples to fall on the wrong side of the separating hyperplane. Completely specifying asupport vector machine therefore requires specifying two param-eters: the kernel function and the magnitude of the penalty for violating the soft margin. The settings of these parameters dependon the specific data at hand.Givenanexpressionvector  X  3  foreachgene  X  ,thesimplestkernel  K   (  X  ,  Y  ) that we can use to measure the similarity between genes  X   and  Y   is the dot product in the input space  K   (  X  ,  Y  )    X  3   Y  3   i  179  X  i Y  i . For technical reasons (see http:   www.cse.ucsc.edu  research  compbio  genex), we add 1 to this kernel, obtaining akerneldefinedby  K  (  X  , Y  )   X  3   Y  3   1.Whenthisdotproductkernelis used, the feature space is essentially the same as the 79-dimensional input space, and the SVM will classify the examples withaseparatinghyperplaneinthisspace.Squaringthiskernel,i.e.,defining  K  (  X  , Y  )  (  X  3   Y  3   1) 2 ,yieldsaquadraticseparatingsurfacein the input space. The corresponding separating hyperplane in thefeature space includes features for all pairwise mRNA expressioninteractions  X  i  X   j , where 1  i ,  j  79. Raising the kernel to higherpowers yields polynomial separating surfaces of higher degrees inthe input space. In general, the kernel of degree  d  is defined by  K  (  X  ,  Y  )    (  X  3   Y  3    1)  d . In the feature space of this kernel, for anygene  X  therearefeaturesforall  d -foldinteractionsbetweenmRNA measurements,representedbytermsoftheform  X  i 1  X  i 2 . . .  X  i  d ,where1    i 1 ,. . . ,  i  d    79. We experiment here with these kernels fordegrees  d    1, 2, and 3.We also experiment with a radial basis kernel (15), which has aGaussian form  K   (  X  ,  Y  )    exp(    X  3    Y  3   2  2  2 ), where    is the width of the Gaussian. In our experiments,    is set equal to themedian of the Euclidean distances from each positive example tothe nearest negative example (16).The gene functional classes examined here contain very fewmembers relative to the total number of genes in the data set. Thisleads to an imbalance in the number of positive and negativetraining examples that, in combination with noise in the data, islikelytocausetheSVMtomakeincorrectclassifications.Whenthemagnitudeofthenoiseinthenegativeexamplesoutweighsthetotalnumberofpositiveexamples,theoptimalhyperplanelocatedbytheSVM will be uninformative, classifying all members of the trainingsetasnegativeexamples.Wecombatthisproblembymodifyingthematrix of kernel values computed during SVM optimization. Let  X  (1) ,. . .,  X  (  n ) bethegenesinthetrainingset,andlet K  bethematrix defined by the kernel function  K   on this training set; i.e.,  K  ij    K  (  X  ( i ) ,  X  (  j ) ).Byaddingtothediagonalofthekernelmatrixaconstant whosemagnitudedependsontheclassofthetrainingexample,onecan control the fraction of misclassified points in the two classes.This technique ensures that the positive points are not regarded asnoisy labels. For positive examples, the diagonal element is modi-fied by  K  ij  :   K  ij     (  n    N  ), where  n  is the number of positivetraining examples,  N   is the total number of training examples, and  isscalefactor.Asimilarformulaisusedforthenegativeexamples, with  n  replaced by  n  . In the experiments reported here, the scalefactorissetto0.1.Amoremathematicallydetaileddiscussionofthetechniques employed here is available at http:   www.cse.ucsc.edu  research  compbio  genex. Experimental Design  Using the class definitions made by theMYGD, we trained SVMs to recognize six functional classes:tricarboxylic acid (TCA) cycle, respiration, cytoplasmic ribosomes,proteasome, histones, and helix-turn-helix proteins. The MYGDclassdefinitionscomefrombiochemicalandgeneticstudiesofgenefunction whereas the microarray expression data measures mRNA levelsofgenes.ManyclassesinMYGD,especiallystructuralclassessuchasproteinkinases,willbeunlearnablefromexpressiondataby Brown  et al.  PNAS    January 4, 2000    vol. 97    no. 1    263         G      E      N      E      T      I      C      S  any classifier. The first five classes were selected because theyrepresent categories of genes that are expected, on biologicalgrounds, to exhibit similar expression profiles. Furthermore, Eisen  et al.  (1) suggested that the mRNA expression vectors for theseclasses cluster well using hierarchical clustering. The sixth class, thehelix-turn-helix proteins, is included as a control group. Becausethere is no reason to believe that the members of this class aresimilarly regulated, we did not expect any classifier to learn torecognize members of this class based on mRNA expressionmeasurements.The performance of the SVM classifiers was compared with thatof four standard machine learning algorithms: Parzen windows,Fisher’s linear discriminant, and two decision tree learners (C4.5and MOC1). Descriptions of these algorithms can be found athttp:   www.cse.ucsc.edu  research  compbio  genex. Performance was tested by using a three-way cross-validated experiment. Thegene expression vectors were randomly divided into three groups.Classifiers were trained by using two-thirds of the data and weretested on the remaining third. This procedure was then repeatedtwomoretimes,eachtimeusingadifferentthirdofthegenesastestgenes.The performance of each classifier was measured by examininghow well the classifier identified the positive and negative examplesin the test sets. Each gene in the test set can be categorized in oneoffourways:truepositivesareclassmembersaccordingtoboththeclassifier and MYGD; true negatives are non-members accordingto both; false positives are genes that the classifier places within thegiven class, but MYGD classifies as non-members; false negativesare genes that the classifier places outside the class, but MYGDclassifies as members. We report the number of genes in each of these four categories for each of the learning methods we tested.To judge overall performance, we define the cost of using themethod  M   as  C (  M  )   fp (  M  )  2   fn (  M  ), where  fp (  M  ) is the numberof false positives for method  M  , and  fn (  M  ) is the number of falsenegatives for method  M  . The false negatives are weighted moreheavily than the false positives because, for these data, the numberof positive examples is small compared with the number of nega-tives. The cost for each method is compared with the cost  C (  N  ) forusing the null learning procedure, which classifies all test examplesas negative. We define the cost savings of using the learningprocedure  M   as  S (  M  )    C (  N  )    C (  M  ).Experiments predicting functions of unknown genes were per-formed by first training SVM classifiers on the 2,467 annotatedgenes for the five learnable classes. For each class, the remaining3,754 genes were then classified by the SVM. Results and Discussion SVMs Outperform Other Methods.  Our experiments show that somefunctionalclassesofgenescanberecognizedbyusingSVMstrainedon DNA microarray expression data. We compare SVMs to fournon-SVM methods and find that SVMs provide superior perfor-mance.Table 1 summarizes the results of a three-fold cross-validationexperiment using all eight of the classifiers tested, including fourSVM variants, Parzen windows, Fisher’s linear discriminant, andtwodecisiontreelearners.Performanceisevaluatedinthestandardmachine learning setting, in which each method must produce apositive or negative classification label for each member of the testset based only on what it has learned from the training set. The firstfour columns are the categories false positive (FP), false negative(FN), true positive (TP), and true negative (TN), and the fifth is ameasure of overall performance.For every class (except the helix-turn-helix class), the best-performing method is a support vector machine using the radialbasis or a higher-dimensional dot product kernel. Other costfunctions, with different relative weights of the false positive andfalse negative rates, yield similar rankings of performance. In fiveseparate tests, the radial basis SVM performs better than Fisher’slinear discriminant. Under the null hypothesis that the methods areequallygood,theprobabilitythattheradialbasisSVMwouldbethebest all five times is 0.03. The results also show the inability of allclassifiers to learn to recognize genes that produce helix-turn-helix proteins, as expected.The results shown in Table 1 for higher-order SVMs are con-siderably better than the corresponding error rates for clustersderivedinanunsupervisedfashion.Forexample,usinghierarchical Table 1. Comparison of error rates for various classificationmethods Class Method FP FN TP TN S(M)TCA D-p 1 SVM 18 5 12 2,432 6D-p 2 SVM 7 9 8 2,443 9D-p 3 SVM 4 9 8 2,446 12Radial SVM 5 9 8 2,445 11Parzen 4 12 5 2,446 6FLD 9 10 7 2,441 5C4.5 7 17 0 2,443   7MOC1 3 16 1 2,446   1Resp D-p 1 SVM 15 7 23 2,422 31D-p 2 SVM 7 7 23 2,430 39D-p 3 SVM 6 8 22 2,431 38Radial SVM 5 11 19 2,432 33Parzen 22 10 20 2,415 18FLD 10 10 20 2,427 30C4.5 18 17 13 2,419 8MOC1 12 26 4 2,425   4Ribo D-p 1 SVM 14 2 119 2,332 224D-p 2 SVM 9 2 119 2,337 229D-p 3 SVM 7 3 118 2,339 229Radial SVM 6 5 116 2,340 226Parzen 6 8 113 2,340 220FLD 15 5 116 2,331 217C4.5 31 21 100 2,315 169MOC1 26 26 95 2,320 164Prot D-p 1 SVM 21 7 28 2,411 35D-p 2 SVM 6 8 27 2,426 48D-p 3 SVM 3 8 27 2,429 51Radial SVM 2 8 27 2,430 52Parzen 21 5 30 2,411 39FLD 7 12 23 2,425 39C4.5 17 10 25 2,415 33MOC1 10 17 18 2,422 26Hist D-p 1 SVM 0 2 9 2,456 18D-p 2 SVM 0 2 9 2,456 18D-p 3 SVM 0 2 9 2,456 18Radial SVM 0 2 9 2,456 18Parzen 2 3 8 2,454 14FLD 0 3 8 2,456 16C4.5 2 2 9 2,454 16MOC1 2 5 6 2,454 10HTH D-p 1 SVM 60 14 2 2,391   56D-p 2 SVM 3 16 0 2,448   3D-p 3 SVM 1 16 0 2,450   1Radial SVM 0 16 0 2,451 0Parzen 14 16 0 2,437   14FLD 14 16 0 2,437   14C4.5 2 16 0 2,449   2MOC1 6 16 0 2,445   6 ThemethodsaretheSVMsusingthescaleddotproductkernelraisedtothefirst,second,andthirdpower,theradialbasisfunctionSVM,Parzenwindows,Fisher’s Linear Discriminant, and the two decision tree learners, C4.5 andMOC1. The next five columns are the false positive, false negative, truepositive, and true negative rates summed over three cross-validation splits,followed by the total cost savings [ S  ( M  )], as defined in the text. 264    www.pnas.org Brown  et al.  clustering, the histone cluster only identified 8 of the 11 histones,and the ribosome cluster only found 112 of the 121 genes andincluded 14 others that were not ribosomal genes (1).WerepeatedtheexperimentwithallfourSVMsfourmoretimes with different random splits of the data. The results show that the variance introduced by the random splitting of the data is small,relative to the mean. The easiest-to-learn functional classes arethose with the smallest ratio of standard deviation to mean costsavings. For example, for the radial basis SVM, the mean andstandard deviations of the cost savings for the two easiest classes—ribosomal proteins and histones—are 225.8    2.9 and 18.0    0.0,respectively. The most difficult class, TCA cycle, had a mean andstandard deviation of 10.4    3.0. Results for the other classes andother kernel functions are similar (http:   www.cse.ucsc.edu  research  compbio  genex). Significance of Consistently Misclassified Annotated Genes.  The fivedifferent three-fold cross-validation experiments, each performed with four different kernels, yield a total of 20 experiments perfunctional class. Across all five functional classes (excluding helix-turn-helix) and all 20 experiments, 25 genes are misclassified in atleast 19 of the 20 experiments (Table 2). In general, these disagree-mentswithMYGDreflectthedifferentperspectiveprovidedbytheexpression data, which represents the genetic response of the cell,and the MYGD definitions, which have been arrived at throughexperiments or protein structure predictions. For example, inMYGD, the members of a complex are defined by biochemicalco-purification whereas the expression data may identify proteinsthat are not physically part of the complex but contribute to properfunctioning of the complex. This will lead to disagreements in theform of false positives. Disagreements between the SVM andMYGD in the form of false negatives may occur for a number of reasons. First, genes that are classified in MYGD primarily bystructure (e.g., protein kinases) may have very different expressionpatterns. Second, genes that are regulated at the translational levelorproteinlevel,ratherthanatthetranscriptionallevelasmeasuredby the microarray experiments, cannot be correctly classified byexpression data alone. Third, genes for which the microarray datais corrupt may not be correctly classified. False positives and falsenegatives represent cases in which further biological experimenta-tion may be fruitful.Many of the false positive genes in Table 2 are known frombiochemicalstudiestobeimportantforthefunctionalclassassignedby the SVM, even though MYGD has not included these genes intheir classification. For example, YAL003W and YPL037C, as-signed repeatedly to the cytoplasmic ribosome class, are not strictly Fig. 1.  Expression profile of YPL037C compared with the MYGD class ofcytoplasmicribosomalproteins.YPL037CisclassifiedasaribosomalproteinbytheSVMs but is not included in the class by MYGD. The figure shows the expressionprofileforYPL037C,alongwithstandarddeviationbarsfortheclassofcytoplas-micribosomalproteins.Ticksalongthe  x  axisrepresentthebeginningsofexper-imental series. Table 2. Consistently misclassified genes Class Gene Locus Error DescriptionTCA YPR001W CIT3 FN Mitochondrial citrate synthaseYOR142W LSC1 FN    subunit of succinyl-CoA ligaseYLR174W IDP2 FN Isocitrate dehydrogenaseYIL125W KGD1 FN   -ketoglutarate dehydrogenaseYDR148C KGD2 FN Component of   -ketoglutarate dehydrog. complex (mito)YBL015W ACH1 FP Acetyl CoA hydrolaseResp YPR191W QCR2 FN Ubiquinol cytochrome-c reductase core protein 2YPL271W ATP15 FN ATP synthase  subunitYPL262W FUM1 FP FumaraseYML120C NDI1 FP Mitochondrial NADH ubiquinone 6 oxidoreductaseYKL085W MDH1 FP Mitochondrial malate dehydrogenaseYGR207C FN Electron-transferring flavoprotein,    chainYDL067C COX9 FN Subunit VIIa of cytochrome c oxidaseRibo YPL037C EGD1 FP    subunit of the nascent-polypeptide-associated complexYLR406C RPL31B FN Ribosomal protein L31B (L34B) (YL28)YLR075W RPL10 FP Ribosomal protein L10YDL184C RPL41A FN Ribosomal protein L41A (YL41) (L47A)YAL003W EFB1 FP Translation elongation factor EF-1  Prot YHR027C RPN1 FN Subunit of 26S proteasome (PA700 subunit)YGR270W YTA7 FN Member of CDC48  PAS1  SEC18 family of ATPasesYGR048W UFD1 FP Ubiquitin fusion degradation proteinYDR069C DOA4 FN Ubiquitin isopeptidaseYDL020C RPN4 FN Involved in ubiquitin degradation pathwayHist YOL012C HTA3 FN Histone-related proteinYKL049C CSE4 FN Required for proper kinetochore function Thetablelistsall25genesthataremostconsistentlymisclassifiedbytheSVMs.Twotypesoferrorsareincluded:a false positive (FP) occurs when the SVM includes the gene in the given class but the MYGD classification doesnot; a false negative (FN) occurs when the SVM does not include the gene in the given class but the MYGDclassification does. Brown  et al.  PNAS    January 4, 2000    vol. 97    no. 1    265       G      E      N      E      T      I      C      S  ribosomal proteins; however, both are important for proper func-tioning of the ribosome. YAL003W encodes a translation elonga-tion factor, EFB1, known to be required for the proper functioningof the ribosome (17). YPL037C, EGD1, is part of the nascentpolypeptide-associated complex, which has been shown to bindtranslating ribosomes and help target nascent polypeptides toseveral locations, including the endoplasmic reticulum and mito-chondria (18). The cell ensures that expression of these proteinskeeps pace with the expression of ribosomal proteins, as shown inFig. 1. Thus, the SVM classifies YAL003W and YPL037C withribosomal proteins. A false positive in the respiration class, YML120C, encodesNADH:ubiquinone oxidoreductase. In yeast, this enzyme replacesrespiration complex 1 (19) and is crucial for transfer of high energyelectrons from NADH to ubiquinone, and thus for respiration (19,20). A consistent false positive in the proteasome class isYGR048W (UFD1). Although not strictly part of the proteasome,YGR048W is necessary for proper functioning of the ubiquitinpathway (21), which delivers proteins to the proteasome for pro-teolysis. Another interesting false positive in the TCA class isYBL015W (ACH1), an acetyl-CoA hydrolase. Although this en-zyme catalyzes what could be considered an unproductive reactionon a key TCA cycle-glyoxylate cycle substrate, its activity could be very important in regulating metabolic flux. Hence, it may besignificantthatexpressionofthisenzymeparallelsthatoftrueTCA cycle enzymes. A distinct set of false positives puts members of the TCA pathway, YPL262W and YKL085W, in the respiration class. Al-though MYGD separates the TCA pathway and respiration, bothclasses are important for the production of ATP. In fact, theexpression profiles of these two classes are strikingly similar (datanot shown). Thus, although MYGD considers these two classesseparate, both the expression data and other experimental worksuggest that there is significant regulatory overlap. The currentSVMs may lack sufficient sensitivity to resolve two such intimatelyrelated functional classes using expression data alone.Some of the false negatives occur when a protein assigned to afunctional class based on structure has a special function thatdemands a different regulation strategy. For example, YKL049C isclassified as a histone protein by MYGD based on its 61% aminoacid similarity with histone protein H3. YKL049C is thought to actas a part of the centromere (22); however, the expression datashows that it is not co-regulated with histone genes. A similarsituation arises in the proteasome class. Both YDL020C andYDR069C may be loosely associated with the proteasome (23–25),buttheSVMdoesnotclassifythemasbelongingtotheproteasomebecause they are regulated differently from the rest of the protea-some during sporulation.One limitation inherent in the use of gene expression data is thatsomegenesareregulatedatthetranslationalandproteinlevels.Forexample, four of the five genes that the SVM was unable to identifyas members of the TCA class are genes encoding enzymes knownto be regulated allosterically by ADP   ATP, succinyl-CoA, andNAD   NADPH (26). Thus, the activities of these enzymes areregulated by means that do not involve changes in mRNA level. If their mRNA levels do not keep pace with those of other TCA cycleenzymes, the SVM will not be able to classify them correctly byexpression data alone.Other discrepancies appear to be caused by corrupt data. Forexample, the SVM classifies YLR075W as a cytoplasmic ribosomalprotein, but MYGD did not. However, YLR075W is a ribosomalprotein (27, 28), and the srcinal annotation in MYGD has sincebeen corrected. Some proteins—for example, YGR207C andYGR270W—maybeprematurelyplacedinfunctionalclassesbasedonly on protein sequence similarities. Other errors occur in theexpression data itself. Occasionally, the microarrays contain badprobes or are damaged, and some locations in the gene expressionmatrix are marked as containing corrupt data. Four of the geneslisted in Table 2 (YPR001W, YPL271W, YHR027C, andYOL012C) are marked as such. In addition, although the SVMcorrectly assigns YDL075W to the ribosomal protein class,YLR406C, essentially a duplicate sequence copy of YDL075W, isnotassignedtothatclass.Similarly,YDL184Cisnotassignedtotheribosome class despite the correct assignment of its near twinYDL133C-A. Because pairs of nearly identical genes such as thesecannot be distinguished by hybridization, it is likely that theYLR406C and YDL184C data is also questionable. Table 3. Predicted functional classifications for previously unannotated genes Class Gene Locus CommentsTCA YHR188C Conserved in worm,  Schizosaccharomyces pombe , humanYKL039W PTM1 Major transport facilitator family; likely integral membraneprotein; similar YHL017w not co-regulated.Resp YKR016W Not highly conserved, possible homolog in  S. pombe YKR046C No convincing homologsYPR020W ATP20 Subsequently annotated: subunit of mitochondrial ATPsynthase complexYLR248W CLK1  RCK2 Cytoplasmic protein kinase of unknown functionRibo YKL056C Homolog of translationally controlled tumor protein,abundant, conserved and ubiquitous protein ofunknown functionYNL119W Possible remote homologs in several divergent speciesYNL255C GIS2 Cellular nucleic acid binding protein homolog, sevenCCHC (retroviral) type zinc fingersYNL053W MSG5 Protein–tyrosine phosphatase, overexpression bypassesgrowth arrest by mating factorYNL217W Similar to bis (5   nucleotidyl)-tetraphosphatasesProt YDR330W Ubiquitin regulatory domain protein,  S. pombe  homologYJL036W Member of sorting nexin familyYDL053C No convincing homologsYLR387C Three C2H2 zinc fingers, similar YBR267W not coregulated Thetableliststhenamesforunannotatedgenesthatwereclassifiedasmembersofaparticularfunctionalclassby at least three of the four SVM methods. No unannotated histones were predicted. 266    www.pnas.org Brown  et al.
Similar documents
View more...
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks