Classification of Microarray Gene Expression Data by Gene Combinations using Fuzzy Logic MGC-FL | Cluster Analysis | Statistical Classification

Please download to get full document.

View again

of 20
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
International Journal of Computer Science, Engineering and Applications (IJCSEA)
Document Share
Document Tags
Document Transcript
  International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.4, August 2012 DOI : 10.5121/ijcsea.2012.240979 Classification of Microarray GeneExpression Data by Gene Combinationsusing Fuzzy Logic (MGC-FL) V.Bhuvaneswari1and .Vanitha 2 1 AssistantProfessor,DepartmentofComputerApplications,BharathiarUniversity,Coi mb at or e ,Ind ia 2 M.PhilResearchScholar,DepartmentofComputerApplications,BharathiarUniversity,Coi mb at or e ,Ind ia  Abstrct Featureselectionhas attracted a huge amount of interest in both research and application communitiesof data mining. Among the large amount of genes presented in gene expression data, only a small fraction of them is effective for performing a certain diagnostic test. Hence, one of the major tasks withthe gene expression data is to find groups of co regulated genes whose collective expression is stronglyassociated with the sample categories or response variables.A framework is proposed in thispaperto find informativegene combinations and to classify gene combinations belonging to its relevant subtypeby using fuzzy logic. Thegenes are ranked based on their statistical scores andhighly informative genesare filtered.Such genes arefuzzified to identify 2-gene and 3-gene combinations and the intermediatevalue for each gene is calculated to select top gene combinations to further classify gene lymphomasubtypes by using fuzzy rules.Finally theaccuracyoftop gene combinationsiscompared withclustering results. The classificationisdone using the gene combinationsand it isanalyzed to predict the accuracy of the results. The work is implemented using java language.  Keywords: Featureselection,T-Test,Fuzzy, Classification, Clustering 1. INTRODUCTION Datamining or knowledge discovery is the process of discovering meaningful, new correlationpatterns and trends by shifting through large amount of data store in repositories, using patternrecognition techniques as well as statistical and mathematical techniques.Dataminingisconsidered as the nontrivial extraction of implicit, previously unknown, and potentially usefulinformation fromdata [13].Microarrays are capable of profiling the gene expression patterns of tens of thousands of genes ina single experiment. Gene expression data can be a valuable source for understanding the genesand the biological associations between them. It has high dimension, small samples and the geneselectioni.e. Feature selectionis very important to determine the classification accuracy.Thedataset utilized for this work is called Lymphoma Dataset which includes4026 gene expressionvalues withits subtypes.  International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.4, August 2012 80 The task of feature selection is generally divided into two aspects eliminating irrelevant  featuresand redundant  ones. Irrelevant features usually disturb the learner and degrade the accuracy,while redundant features add to computational cost without bringing in new information.All thegenes used in the expression profile are not informative; also many of them areredundant.Finding informative genes greatly reduces the computational burden and noise arising fromirrelevant genes.Reducing the number of genes by feature selection and still retaining best classprediction accuracy for the classifier is vital in caseof classification [2].Gene ranking simplifies gene expression tests to include only a very small number of genesrather than thousands of genes. The goal is to identify a small subset of genes which togethergive accurate predictions. The importance ranking of each gene is done using a feature rankingmeasure called T-Test which ranks the genes based on their statistical score.The methodT-Test includes the classes with different samples. The mean value of each geneexpression in a class is calculated. In fact, the TS(T-Scores)used here is a t-statistic betweenthe centroid of a specific class and the overall centroid of all the classes.The T-scores of thegenes are sorted and the genes with the highest T-scores are ranked from 1 to 100.The geneswith the highest scores are retained as informative genes which are used for gene combinations.Fuzzy logic is a superset of conventional Boolean logic.Fuzzy logic, unlike other logicalsystems, deals with imprecise or uncertain knowledge.The set of informative genes with geneexpression data are converted into fuzzy values using Type 1 fuzzy. The different genecombinations are identified and intermediate value is calculated for each gene combination.Further, the lymphoma subtypes are classified basedon the fuzzy rules on a test dataset.The fuzzified informative genes are used to find out gene combinations which are used forclassifying the dataset to find its lymphoma subtypes. Specifically Single gene, Two-gene andThree-gene combinations are donewith the selected informative genes.The purpose of generating gene combinations is to find out whether it will classify lymphoma subtypes.A fuzzy rule involves a fuzzy condition and a fuzzy conclusion. The intermediate valuescalculated for single gene,two geneandthree gene combinationsare used to frame fuzzy rulesto classify the lymphoma subtypes such as DLBCL, FL and CLL of the test dataset. The testdataset consists of hundred random genes and it is selected from the whole dataset of 4026geneswith its samples.Clustering is the process of organizing objects into groups whose members are similar in someaspects. Here the gene combinations such astwo geneandthree gene combinationsare groupedinto a set of disjoint classes, calledclustersso that genes within a class have high similarity toeach other, while genes in separate classes are more dissimilar. Finallygene combinationsareverified and its correlation is compared with hierarchical clustering approach by grouping theentire informative genes. Then the classification accuracy of the gene combination is analyzed based on its efficiency of subtype’s classification such as DLBCL, FL and CLL of the test dataset.Thispaperisorganizedasfollows.Section2providestheliteraturestudyofthevariousFeatureselectionmethods, Gene classification and Fuzzy logicforBio-logicaldatabase.Section3exploresthemethodologyforMicroarray Gene classificationusingFuzzyLogic(MGC-FL).InSection4theimplementedresultsareverifiedandvalidated.Thefinalsectiondrawstheconclusionofthepaper.  International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.4, August 2012 81 2. REVIEW OF LITERATURE In[3]Qinghua Huang et al., (2011)havediscussed the importance of feature selection. Theobjective of feature selection is to find optimal or suboptimal subsets from the srcinal featuresets for irrelevant features removal, intrinsic class information preservation.In [15] Patharawut Saengsiri et al., (2011) have provided the benefits of Feature Selection. Theyproposed three feature selection methods. Theyare Correlation based Feature Selection, Gainratio and Information gain. The concept of Correlation based Feature Selection is relevance of feature and target class that is based on heuristic operation. Gain Ratio technique improves theproblem of Information gain. Gain Ratio is based on evaluation of information theory.In [17] the author Alok Sharma et al., (2011) have proposed a feature selection algorithm forclassification problem using transcriptome data. The proposed algorithm explores and provides away to investigate important genes. It is observed that the algorithm finds a small gene subset thatprovides high classification accuracy on several DNA microarray gene expression datasets.In [6] Yan-Fei Wang et al., (2011) proposed a type-2 fuzzy membership test (Type-2 FM test) fordisease-associated gene identification on microarrays to improve traditional fuzzy methods. Theresults showed that type-2 FM test performs better than traditional fuzzy methods when analyzingmicroarray data with similarexpression values and noise.In [7]Pablo Martin-Munoz et al., (2010)presented a new algorithm, FuzzyCN2, for extractingconjunctive fuzzy classification rules. This algorithm produced an ordered list of fuzzy rules.In[20] Yan-Fei Wang et al., (2010)proposed to combine the FCM method with the empirical modedecomposition (EMD) for clustering microarray data in order to reduce the effect of the noise. Itwas called as fuzzy C-means method with empirical mode decomposition (FCM-EMD).In [4] Lipo Wanget al., (2010) discussed ranking of genes using two methods called T-Score(TS) and Class Separability CS). All genes in the training data set are ranked using a certainranking criterion and small numbers of highly ranked genes are retained. In T-Test statisticalmethod the T-Scores are calculated for each gene and gene with highest T-score is selected.In [16] Wutao Chen,Huijuan Lu et al., (2009) compared various feature selection methods inselecting informative genes. It is choosing genes which have expression levels of high diversity indifferent types of samples. Among the various feature selection methods, such as SNR, t  -test,Fisher and information gain, t  -test has been proved to be an effective method in the binary-classification problem.In [11]Zarita Zainuddin et al., (2009) have discussed aboutMicroarray Data Preprocessing.Microarray data consists of an overwhelming number of genes relative to the number of samples.However, the majority of such genes are probably irrelevant in discriminating between thesubclasses of the heterogeneous cancers. Hence, genes selection is a crucial aspect in microarraydata analysis.In [12] Wutao Chen et al., (2009) has introduced classification of gene expression data usingartificial neural network based onsamples filtering. Simulation tests were carried out to verify theproposed strategy using Leukemia data sets, and the test results were compared with those of single artificial neural network.In [9] Jahangheer Shaik et al., (2009) presented Fuzzy-Adaptive-Subspace-Iteration-based Two-way Clustering (FASIC) of microarray data to find differentially expressed genes from two-sample microarray experiments. In [10] Keon Myung Lee et al., (2009) introduced three fuzzy  International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.4, August 2012 82 set-based microarray data analysis techniques used to find local cluster, to locate contrastinggroup, and to filter group with specific pattern.In [19] Mingrui Zhang et al., (2009) evaluated several validity measures in fuzzy clustering anddeveloped a new measure for a fuzzy c-means algorithm which uses a Pearson correlation in itsdistance metrics.In [18] this paper Ming Chen et al., (2008) focused on a method of optimizingclassifiers of neural network by Genetic Algorithm based on the principle of genereconfiguration, and implemented classification by training the weight.In [5] Qingzhong Liu et al., (2006)have presented a scheme of recursive feature addition forgene selection and combined classifiers for the purpose of classifying tumor tissues using DNAmicroarray data.In [8] Nilesh N. Karnik et al., (1999) introduced a type-2 fuzzy logic system(FLS), which handled rule uncertainties. It involved the operations of fuzzification, inference, andoutput processing. 3. PROBLEM FORMULATION AND METHODOLOGY The proposed frameworkMicroarray Gene Classification using Fuzzy Logic (MGC-FL) givenin Figure1 is used to find informative gene combinations and to classify gene combinationsbelonging to its relevant subtype by using fuzzy logic.In the initialphase the noisy data isremoved and genes are ranked based on their statistical scores. The highly informative genesare filtered based on ranking of genes.In the classificationphase informative genes arefuzzifiedandidentified for 2-gene and 3-gene combinations. The intermediatevalue for genecombination is calculated to classify gene lymphoma subtypes by using fuzzy rules. In the finalphase top gene combinations are compared with clustering and the classification accuracy of gene combinations is analyzed. Figure 1. Framework for Microarray Gene Classification using Fuzzy Logic(MGC-FL) LymphomaDatasetPreprocessing Phase Removalof NoisyDataRankingof GenesFilteringInformativeGenes Fuzzy Classification PhaseAccuracy Verification Phase Comparing gene combinationwith Hierarchical ClusteringVerifying classificationaccuracy ofgene combinationClassifyingTestDatasetFuzzification of InformativeGenesIdentifying GeneCombinationsGene intermediatevalue calculationFuzzy Rules
Similar documents
View more...
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks