Neural Techniques for Improving the Classification Accuracy of Microarray Data Set using Rough Set Feature Selection Method
Bichitrananda Patra
#1
, Sujata Dash
#2
, B. K. Tripathy
#3
1
Department of CSE, KMBB College of Engineering and Technology, Bhubaneswar, Odisha, India
2
Department of CSE, GIFT Engineering College, Bhubaneswar, Odisha, India
3
VITUniversity, Vellore, Tamilnadu, India
1
bnpatra@gmail.com
2
sujata_dash@yahoo.com
3
tripathybk@vit.ac.in
Abstract
 Classification, a data mining task is an effective method to classify the data in the process of Knowledge Data Discovery. Classification method algorithms are widely used in medical field to classify the medical data for diagnosis. Feature Selection increases the accuracy of the Classifier because it eliminates irrelevant attributes. This paper analyzes the performance of neural network classifiers with and without feature selection in terms of accuracy and efficiency to build a model on four different datasets. This paper provides rough feature selection scheme, and evaluates the relative performance of four different neural network classification procedures such as Learning Vector Quantisation (LVQ)  LVQ1, LVQ3, optimizedlearningrate LVQ1 (OLVQ1), and The SelfOrganizing Map (SOM) incorporating those methods. Experimental results show that the LVQ3 neural classification is an appropriate classification method makes it possible to construct high performance classification models for microarray data.
Keywords
 Data Mining, Rough, Feature Selection, Learning Vector Quantisation, SelfOrganizing Map, Classification
.
I.
I
NTRODUCTION
Knowledge Data Discovery (KDD) is a process of deriving hidden knowledge from databases. KDD consists of several phases like Data cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern evaluation, Knowledge representation. Data mining is one of the important phases of knowledge data discovery. Data mining is a technique which is used to find new, hidden and useful patterns of knowledge from large databases. There are several data mining functions such as Concept descriptions, Association Rules, Classification, Prediction, Clustering and Sequence discovery to find the useful patterns. Data preprocessing is applied before data mining to improve the quality of the data. Data preprocessing includes data cleaning, data integration, data transformation and data reduction techniques. Cleaning is used to remove noisy data and missing values. Integration is used to extract data from multiple sources and storing as a single repository. Transformation transforms and normalizes the data in a consolidated form suitable for mining. Reduction reduces the data by adopting various techniques i.e., aggregating the data, attribute subset selection, dimensionality reduction, numerosity reduction and generation of concept hierarchies. The data reduction is also called as feature selection. Feature selection or attribute selection identifies the relevant attributes which are useful to the data mining task. Applying feature selection with data mining technique improves the quality of the data by removing irrelevant attributes. Rough set theory was developed by Palwak [1] in the early 1980s and has been used in data analysis, pattern recognition, and data mining and knowledge discovery [2, 3]. Recently, rough set theory has also been employed to select feature subset [4, 5, 1, 6, 7]. In the rough set community, feature selection algorithms are attributereduct oriented, that is, finding optimal reduct of condition attributes of a given data set. Two main approaches to finding attribute reducts are recognized as discernibility functionbased and attribute dependencybased [8, 1]. These algorithms, however, suffer from intensive computations of either discernibility functions for the former or positive regions for the latter, although some computation efficiency improvement has been made in some new developments. LVQ was applied successfully to areas such as audio compression, data compression, data transmission, facial recognition, radar signal processing, finance and insurance, production control, sale and marketing, and so on. Keeping all these issues in view, LVQ could be applied to such simple structured data, with higher confidence than that of SOM. One of the most amazing features of LVQ algorithm is that it can take very few vectors to obtain excellent classification results. The idea behind LVQ is to take away codebook vectors from the decision surfaces to clearly demarcate the class borders. We begin with a brief overview of the Rough Set framework in Section 2 and in Section 3 presents the proposed supervised
RSAR subset evaluation algorithm. Section 4 describes the learning vector quantization classification algorithms are introduced. The experimental framework and results are described in Section 5and 6 respectively. Finally, the conclusion and future work are presented in Section 7. II.
R
OUGH
S
ETS
The theory of rough sets begins with the notion of an approximation space, which is a pair
<
U
,
A
>
, where U is a nonempty set (the universe of discourse), i.e., U =
{x
1
, . . . , x
i
, . . . , x
n
}
, and A is a family of attributes, also called knowledge in the universe.
V
is the value domain of A, and
is an information function
: U
×
A
→ V
. An approximation space is also called an information system [9]. Any subset P of knowledge A defines an equivalence (also called indiscernibility) relation
IND
(P) on U
{(
)
̂
̂(
)}
If
(
)
IND
(P), then
and
are indiscernible by attributes from P. The partition of U generated by
IND
(P) is denoted as
*,

+⁄
Where
,

is the equivalence class containing
. The elements in
,

are indiscernible or equivalent with respect to knowledge P. Equivalence classes, also termed as information granules and are used to characterize arbitrary subsets of U. The equivalence classes of
IND
(P) and the empty set
∅
are the elementary sets in the approximation space
<
U
,
A
>
. Given an arbitrary set
X
⊆
U, in general, it may not be possible to precisely describe
X
in
⟨⟩
. One may characterize
X
by a pair of lower and upper approximations defined as follows [8]:
*,
 ,
⊆+
̅ *,
 ,
 +
That is, the lower approximation
is the union of all the elementary sets that are subsets of
X
, and the upper approximation
̅
is the union of all the elementary sets that have a nonempty intersection with
X
. The tuple
〈 〉
is the representation of an ordinary set
X
in the approximation space
〈〉
or is simply called the rough set of
X
. The lower (respectively, upper) approximation
(respectively,
)) is interpreted as the collection of those elements of U that definitely (respectively, possibly) belong to
X
. The lower approximation is also sometimes called the positive region, denoted as
POS
P
(
X
). A set
X
is said to be definable (or exact) in
⟨⟩
. If and only if
̅
Otherwise,
X
is indefinable and termed as a rough set. III.
R
OUGH
S
ET
A
TTRIBUTE
R
EDUCTION
(RSAR)
S
UBSET
E
VALUATION
The reduction of attributes is achieved by comparing equivalence relations generated by sets of attributes. Attributes are removed so that the reduced set provides the same predictive capability of the decision feature as the srcinal. A reduct is defined as a subset of minimal cardinality R
min
of the conditional attribute set C such that
γ
R
(D) =
γ
C
(D) [10, 11] The Quick Reduct algorithm given in figure 1, attempts to calculate a reduct without exhaustively generating all possible subsets. It starts off with an empty set and adds in turn, one at a time, those attributes that result in the Quick Reduct(C, D). C, the set of all conditional features; D, the set of decision features. (1) R
{ } (2) do (3) T
R (4)
x
(C
R) (5) if
*+
(D) >
T
(D) (6) T
R
{
x
} (7) R
T (8) until
γ
R
(D) ==
γ
C
(D) (9) return R
Fig. 1. The Quick Reduct Algorithm
greatest increase in the rough set dependency metric, until this produces its maximum possible value for the dataset.
IV.
LVQ
C
LASSIFICATIONS
LVQ was applied successfully to areas such as audio compression, data compression, data transmission, facial recognition, radar signal processing, finance and insurance, production control, sale and marketing, and so on. Keeping all these issues in view, LVQ could be applied to such simple structured data, with higher confidence than that of SOM. One of the most amazing features of LVQ algorithm is that it can take very few vectors to obtain excellent classification results. The idea behind LVQ is to take away codebook vectors from the decision surfaces to clearly demarcate the class borders [12].
A.
The LVQ1
Assume that a number of 'codebook vectors' m
i
(free parameter vectors) are placed into the input space to approximate various domains of the input vector x by their quantized values. Usually several codebook vectors are assigned to each class of x values, and x is then decided to belong to the same class to which the nearest m
i
belongs. Let c = arg min (x  m
i
) (1) define the nearest m
i
to x, denoted by m
c
. Values for the m
i
that approximately minimize the misclassification errors in the above nearestneighbor classification can be found as asymptotic values in the following learning process. Let x(t) be a sample of input and let the m
i
(t) represent sequences of the m
i
in the discretetime domain. Starting with properly defined initial values, the following equations define the basic LVQ1 process [21]: m
c
(t + 1) = m
c
(t) + alpha(t)[x(t)  m
c
(t)]
if x and m
c
belong to the same class, m
c
(t + 1) = m
c
(t) alpha(t)[x(t)  m
c
(t)] if x and m
c
belong to different classes, m
i
(t + 1) = m
i
(t) for i not in c. Here 0 < alpha (t) < 1, and alpha (t) may be constant or decrease monotonically with time. In the above basic LVQ1 it is recommended that alpha should initially be smaller than 0.1; linear decrease in time is used.
B.
The LVQ3
The LVQ2 algorithm was based on the idea of differentially shifting the decision borders towards the Bayes limits, while no attention was paid to what might happen to the location of the m
i
in the long run if this process were continued. Therefore it seems necessary to include corrections that ensure that the m
i
continue approximating the class distributions, at least roughly. Combining these ideas, we now obtain an improved algorithm that may be called LVQ3 [21] : m
i
(t + 1) = m
i
(t)  alpha(t)[x(t)  m
i
(t)], m
j
(t + 1) = m
j
(t) + alpha(t)[x(t)  m
j
(t)], where mi and mj are the two closest codebook vectors to x, whereby x and mj belong to the same class, while x and mi belong to different classes, respectively; furthermore x must fall into the 'window'; m
k
(t + 1) = m
k
(t) + epsilon alpha(t)[x(t)  m
k
(t)] (5) for k in {i, j}, if x, m
i
, and m
j
belong to the same class.
C.
The optimizedlearningrate LVQ1 (OLVQ1)
The basic LVQ1 algorithm is now modified in such a way that an individual learning rate
α
i
(t) is assigned to each m
i
. We then get the following discrete time learning process. Let c be defined by Eq. (1). Then m
c
(t + 1) = m
c
(t) + α
c
(t)[x(t)  m
c
(t)] if x is classified correctly, m
c
(t + 1) = m
c
(t) 
α
c
(t)[x(t)  m
c
(t)] (6) if the classification of x is incorrect, m
i
(t + 1) = m
i
(t) for i not in c. Next we address the problem of whether the
α
i
(t) can be determined optimally for fastest possible convergence of (6). If we express (6) in the form m
c
(t + 1) = [1  s(t)
α
c
(t)]m
c
(t) + s(t)
α
c
(t)x(t) (7) where s(t) = +1 if the classification is correct and s(t) = 1 if the classification is wrong, we first directly see that m
c
(t) is statistically independent of x(t). It may also be obvious that the statistical accuracy of the learned codebook vector values is optimal if the effects of the corrections made at different times, when referring to the end of the learning period, are of equal weight. Notice that m
c
(t + 1) contains a "trace" from x(t) through the last term in (7), and "traces" from the earlier x(t'); t' = 1,2,... t1 through m
c
(t). The (absolute) magnitude of the last "trace" from x(t)
is scaled down by the factor α
c
(t), and, for instance, the "trace" from x(t  1) is scaled down by [1  s(t)
α
c
(t)]
α
c
(t  1). Now we first stipulate that these two scalings must be identical:
α
c
(t) = [1  s(t)
α
c
(t)]
α
c
(t  1) (8) If this condition is then made to hold for all t, by induction it can be shown that the "traces" collected up to time t from the entire earlier x will be scaled down by an equal amount at the end, and thus the "optimal" values of
α
i
(t) are determined by the recursion
α
c
(t) =
α
c
(t  1)/ (1 + s(t)
α
c
(t  1)) (9) Any user of the LVQ_PAK can easily become convinced about that (9) really provides for fast convergence. A precaution must be made, however: since
α
c
(t) can also increase, it is especially important that it does not rise above the value 1; the learning program OLVQ1 is even more restrictive, it
never allows any α
i
to rise above its initial value[21].
D.
The SelfOrganizing Map
T. Kohonen introduced the SelfOrganizing Map (SOM) [13]. It is an unsupervised learning process, which learns the distribution of a set of patterns without any class information. It has the property of topology preservation. There is a competition among the neurons to be activated or fired .The result is that only one neuron that wins the competition is fired and is called winnertakes all neuron. SOMs may be onedimensional, twodimensional or multidimensional, but the most common ones are either onedimensional or twodimensional maps. The number of input connections depends on the number of attributes to be used in the classification. The neuron with weights closest to the input data vector is declared the winner during the training. Then the weights of all of the neurons in the neighborhood of the winning neuron are adjusted by an amount inversely proportional to the distance. It clusters and classifies the data set based on the set of attributes used. The algorithm is summarized as follows [14]: Step 1 Initialization: Choose random values for the initial weight vectors w
j
(0), the weight vectors being different for j =1,2,...l where l is the total number of neurons Step 2 Sampling: Draw a sample x from the input space with a certain probability. Step 3 Similarity Matching: Find the best matching (winning) neuron i (x) at time steps n by using the minimum distance Euclidean criterion Step 4 Updating: Adjust the synaptic weight vector of all neurons by
using the update formula where, η(n) is the learning
rate parameter, and h
j
, i(x)(n) is the neighborhood function centered around the winning neuron i(x).
Both η(n) and h
j
, i(x) (n) are varied dynamically during learning for best results. Step 5 Continue with Step2 until no noticeable changes in the feature map are observed.
V.
E
XPERIMENTS
A.
D
ATASETS
Six used microarray gene expression data sets are chosen for our experiments: Colon tumor, ALLAML Leukemia, Lung cancer, Prostate_Tumor_GEMS, Brain_Tumor_GEMS and Leukemia_GEMS. The first three data is taken from http://sdmc.lit.org.sg/GEDatasets/Datasets.html and other data is taken from http://www.gemssystem.org. Table I summarize these datasets. We conducted the experiments on these six data sets by applying Rough Set Attribute Reduction (RSAR) Subset Evaluation
method for feature reduction and Learning Vector Quantisation (LVQ)  LVQ1, LVQ3, optimizedlearningrate LVQ1 (OLVQ1), and The SelfOrganizing Map (SOM) for
neural classification
of the reduced datasets. We used Weka, a wellknown comprehensive toolset for machine learning and data mining [15] as our main experimental platform. We evaluated the performance of feature reduction in Weka environment with four classifiers, using 10fold Cross Validation.
T
ABLE
I D
ATASET
I
NFORMATION
Dataset # classes # instances # attributes
Colon tumor (Train/ Test) 2 40 / 22 2000 Leukemia (Train/ Test) 2 47 / 25 7129 Lung cancer (Train/ Test) 2 109 / 72 12533 Prostate Tumor (Train/ Test) 2 61 / 41 10509 Brain Tumor (Train/ Test) 5 54 / 36 5920 Leukemia_GEMS (Train/ Test) 3 43 / 29 11225
First, a simple method introduced in [16] is used to discretize the domain of each attribute because rough sets
methods require discretization input. Any data larger than μ +
σ
/2
were transformed to state 1; any data
between μ +
σ
/2
and μ −
σ
/2
were transformed to state 0; any data smaller than
μ −
σ
/2
were transformed to state 
1. where σ is standard deviation, μ is mean of a gene. These three states correspond
to the overexpression, baseline, and underexpression. Then our method is employed to searching for informative genes for classification.
B.
Parameter Settings for RSAR Feature Selection
We used Weka, a wellknown comprehensive toolset for machine learning and data mining [17], as our main experimental platform. We evaluated the performance of feature selection methods in Weka environment with an implementation of the Quick Reduct algorithm of rough set attribute reduction (RSAR) Applicable for use on large gene expression (GE) datasets with numeric continuous data. Evaluate subset using rough set dependency, to return a feature subset giving only the rough set positive region. Feature subset evaluation merit value between 0.0 and 1.0. Not all datasets will reach maximum dependency of 1.0. And Best first may start with the empty set of attributes and search forward with the different parameter as direction : Set the direction of the search as forward, lookup Cache Size : Set the maximum size of the lookup cache of evaluated subsets, at here the default is 1. After using Rough Set Attribute Reduction (RSAR) subset evaluation and best first search in training dataset, 9 genes are left in colon data set, 3 genes are left in leukemia data set, 13 genes are left in lung cancer data set, 14 genes are left in prostate data set, 36 genes are left in brain tumor data set and 14 genes are left in leukemia_GEMS cancer data set.
C.
Comparison of ANN Algorithms with respect to the Parametric Changes Learning Vector Quantisation (LVQ)  LVQ1
. A single BMU (best matching unit) is selected and moved closer or further away from each data vector, per iteration.
Learning Vector Quantisation (LVQ)  LVQ3
.The same as LVQ1, except only if the classes of the BMU
’
s match, otherwise, the learning rate modified by the epsilon is used on BMU's.
Learning Vector Quantisation (LVQ)  OLVQ1
.The same as LVQ1, except each codebook vector has its own learning rate. If the BMU has the same class, the individual learning rate is increased, otherwise it is decreased.
Self Organising Map (SOM
),
Aka Kohonen Feature Map[18]. The SOM algorithm is not intended to be used for classification, this is a version of the SOM that supports supervised learning, as well as unsupervised learning. Class labels can be assigned either dynamically via voting during training, or codebook vector labeling using after training. To evaluate the performance of the proposed method, the selected feature subsets were evaluated by 10fold cross validation for
the above parameters were changed and correspondingly the correctly classified instances were computed. The intention of these experiments was to bring out the accuracy or overall behavior of the four ANN algorithms, which were shown in Table II without feature selection and with RSAR feature selection shown in Table III.
VI.
R
ESULTS AND
C
OMPARISON
We started experiment by evaluating performance accuracies of different classifiers, LVQ1, LVQ3, OLVQ1 and SOM on four binary and two multi class datasets using 10fold Cross Validation (CV) without using feature selection algorithms. The results of the 10fold CV accuracy for the classifiers are shown in table II. After feature selection, the selected feature subsets were evaluated using the above classification algorithms using 10fold CV method are shown in Table III. The experimental results show that the accuracy of microarray data which had feature selection implemented was
better than without feature selection datasets. Comparing binary class and multi class datasets, the accuracy of the binary datasets was better than the multi class datasets. Again, we can observe from Table II and Table III and Graph1 and Graph2 that the proposed method effectively increases classification accuracy and selects a smaller number of feature subsets. During the RSAR subset evaluation of the proposed method returns very small sets of genes compared to alternative variable selection methods, while retaining predictive performance. Our method of gene selection will not return sets of genes that are highly correlated, because they are redundant. This method will be useful when considering the design of diagnostic tools, where having a small set of probes is often desirable and to help understand the results from other gene selection approaches that return many genes, so as to understand which ones of those genes have the largest signal to noise ratio and could be used as surrogates for complex processes involving many correlated genes.
T
ABLE
II C
LASSIFICATION ACCURACY BY USING FOUR
ANN
CLASSIFIER WITH DIFFERENT PARAMETER SETTINGS BEFORE SUBSET EVALUATION
Datasets Lvq1 Lvq3 Olvq1 SOM
Colon 85 80 90 47.5 Leukemia 95.7447 95.7447 87.234 57.4468 Lung 98.1651 96.3303 95.4128 75.2294 Prostate Tumor 75.4098 81.9672 80.3279 68.8525 Brain Tumor 79.6296 81.4815 79.6296 61.111 Leukemia2_GEMS 90.6977 93.0233 93.0233 62.7907
T
ABLE
III C
LASSIFICATION ACCURACY BY USING FOUR
ANN
CLASSIFIER WITH DIFFERENT PARAMETER SETTINGS AFTER
RSAR
SUBSET EVALUATION OF TRAINING DATASETS
.
Datasets Lvq1 Lvq3 Olvq1 SOM
Colon 87.5 87.5 87.5 67.5 Leukemia 95.7447 95.7447 95.7447 93.617 Lung 97.2477 97.2477 97.2477 95.4128 Prostate Tumor 93.4426 91.8033 91.8033 88.5246 Brain Tumor 79.6296 81.4815 77.7778 66.6667 Leukemia2_GEMS 90.6977 90.6977 90.6977 81.3953
Graph1:
Classification accuracy by using four ANN classifier with different parameter settings before subset evaluation
Graph2:
Classification accuracy by using four ANN classifier with different parameter settings after RSAR subset evaluation of training datasets.
Based on the comparison of results produced by SOM and LVQ algorithms on the microarray gene expression datasets, LVQ produced better results than SOM, and out of the three LVQ algorithms LVQ3 was the best. The classification accuracy of LVQ with RSAR subset evaluation was found to be in the range of 87.5 to 97.2477 and without subset evaluation was found to be in the range of 75.4098 to 98.
When comparing binary class and multi class datasets, the accuracy of the binary datasets was better than the multi class datasets. For testing data sets of binary class data sets were evaluated using the above classification algorithms using 10fold CV method are shown in Table IV.
0102030405060708090100110
C o l o n L e u k e m i a L u n g P r o s t a t e T u m o r B r a i n T u m o r L e u k e m i a 2_ G E M S
Lvq1Lvq3Olvq1SOM0102030405060708090100110
C o l o n L e u k e m i a L u n g P r o s t a t e T u m o r B r a i n T u m o r L e u k e m i a 2_ G E M S
Lvq1Lvq3Olvq1SOM