An Excited Cuckoo Search-grey Wolf Adaptive Kernel Svm For Effective Pattern Recognition In Dna Microarray Cancer Chips
Abstract
The scarcity of patient samples, curse-of-dimensionality and class imbalance of the available DNA
microarray chips remain big hindrances for researchers to accurately and reliably classify cancerous
tissues without overfitting. Moreover, these challenges are magnified when resource (computational
power and memory) constrained devices like smart phones, tablets, and personal digital assistants are
used to mine these datasets, rendering effective portable microarray data mining a very difficult task
to achieve. Thus, gene selection and classification have turned out to be the most researched topics in
DNA microarray based cancer diagnosis. An effective gene selection phase derives an informative gene
subset from otherwise a highly dimensional dataset to reduce noise, computational overheads and model
overfitting. On the other hand, an enhanced learning and classification phase builds a model that
accurately and reliably classify a given DNA patient sample. This research has formulated a novel
memetic approach: Excited-(E)-Adaptive Cuckoo Search-(ACS)-Intensification Dedicated Grey Wolf
(IDGWO), i.e. EACSIDGWO for optimal gene selection. EACSIDGWO is an algorithm where the step
size of ACS and the nonlinear control strategy of parameter !→of the IDGWO are innovatively made
adaptive via the concept of the complete voltage and current responses of a direct current (DC) excited
resistor-capacitor (RC) circuit. Since the population has a higher diversity at early stages of the proposed
EACSIDGWO algorithm, both the ACS and IDGWO are jointly involved in local exploitation.
Furthermore, to enhance mature convergence at later stages of the proposed algorithm, the role of ACS
is switched to global exploration while the IDGWO is still left conducting the local exploitation. The
performance of EACSIDGWO as a gene selector is evaluated on six standard DNA microarray chips
derived from Irvine (UCI) repository namely Ovarian Cancer(4000 genes), Central Nervous System
Cancer (7129 genes), Colon Cancer (2000 genes), Breast Cancer Wisconsin(prognosis) (33 genes),
Breast Cancer Wisconsin(diagnostic) (30 genes) and SPECTF Heart Cancer (44 genes). The
EACSIDGWO achieved the most compact informative gene subsets along with the highest
classification accuracies as follows: Ovarian Cancer (274 genes, 100%), Central Nervous System
Cancer (1208 genes, 72%), Colon Cancer (538 genes, 91%), Breast Cancer Wisconsin (prognosis) (5
genes, 87%), Breast Cancer Wisconsin (diagnostic) (3 genes, 98%) and SPECTF Heart Cancer (4 genes,
88%). Extended Binary Cuckoo Search (EBCS), the second best state-of-the-art published algorithm,
attained the following: Ovarian Cancer (1811 genes, 99%), Central Nervous System Cancer (3446
genes, 67%), Colon Cancer (988 genes, 89%), Breast Cancer Wisconsin (prognosis) (6 genes, 86%),
Breast Cancer Wisconsin (diagnostic) (3 genes, 97%) and SPECTF Heart Cancer (6 genes, 86%). The
results indicate that the proposed technique has comprehensive superiority in reducing the size of
informative gene subsets as well as locating the most significant optimal gene subsets. To improve the
performance of the classification phase (the last stage of the DNA microarray-based cancer analysis),
another novel hybrid model is proposed. This model is based on particle swarm optimization (PSO),
principal component analysis (PCA) and multiclass support vector machine (MCSVM) i.e. PSO-PCALGP-
MCSVM. The MCSVM adopts a novel hybrid Linear-Gaussian-Polynomial (LGP) kernel
formulated in this research. The hybrid LGP kernel innovatively combines the advantages of three
standard kernels (Linear, Gaussian and Polynomial) in a novel manner, where a Gaussian kernel
embedding a Polynomial kernel is linearly combined with a Linear kernel. To reveal the superior global
gene extraction, prediction and learning ability of this model against three single kernel-based models:
PSO-PCA-L-MCSVM (using a single Linear kernel), PSO-G-MCSVM (using a single Gaussian kernel)
and PSO-P-MCSVM (using a single Polynomial kernel), four datasets: Colon cancer, Acute
Lymphoblastic Leukemia-Acute myeloid Leukemia (ALL-AML), St. Jude Leukemia dataset and Lung
cancer were used. Adopting three extended evaluation metrics (G-mean, Accuracy (Acc) and F-score)
the proposed model achieved the following: Colon Cancer (G-mean: 0.88, Acc: 0.88, F-score: 0.87),
ALL-AML (G-mean: 0.94, Acc: 0.94, F-score: 0.94), Lung Cancer (G-mean: 0.99, Acc: 0.97, F-score:
0.96) and St. Jude Leukemia dataset (G-mean: 0.97, Acc: 0.96, F-score: 0.90). The PSO-G-MCSVM,
the second best published model, attained the following: Colon Cancer (G-mean: 0.82, Acc: 0.82, Fscore:
0.82), ALL-AML (G-mean: 0.94, Acc: 0.94, F-score: 0.94), Lung Cancer (G-mean: 0.98, Acc:
0.96, F-score: 0.93) and St. Jude Leukemia dataset (G-mean: 0.97, Acc: 0.95, F-score: 0.85).
Considering the reported compact informative gene subsets selection along with the very high
classification accuracy, it is evident that the proposed models are promising DNA microarray data
mining tools for both cost effective computers and online servers ,as well as resource constrained mobile
devices.
Publisher
University of Nairobi
Rights
Attribution-NonCommercial-NoDerivs 3.0 United StatesUsage Rights
http://creativecommons.org/licenses/by-nc-nd/3.0/us/Collections
The following license files are associated with this item: