• Login
    • Login
    Advanced Search
    View Item 
    •   UoN Digital Repository Home
    • Theses and Dissertations
    • Faculty of Health Sciences (FHS)
    • View Item
    •   UoN Digital Repository Home
    • Theses and Dissertations
    • Faculty of Health Sciences (FHS)
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Principal component analysis and linear discriminant analysis in gene expression data

    Thumbnail
    View/Open
    FullText (1.061Mb)
    Date
    2013-11
    Author
    Kagereki, Edwin M
    Type
    Thesis
    Language
    en
    Metadata
    Show full item record

    Abstract
    The datasets from microarray experiments enables the measurement of gene ex- pression pro le of in cells. Statistical models maybe used for classify the samples into various physiological categories based on the gene expression pro le. How- ever gene classi cation as a domain of research is not straight-forwad due to some inherent properties of the data; mainly multidimensionality and the noise. The thesis studied three aspects of gene expression analysis. That is dimension reduction, classi cation of the expression pro les and described the variability of the gene expression data due to the covariates like age and gender. The dataset used in the thesis is the GEO dataset GSE34105 . Principle Component Analysis and Eigen-R2 methods were applied to dissect the overall variation. Subsequently a linear discriminant classi er was built and the e ect of the number of princi- pal components retained on the accuracy of the linear discriminant classi er was assessed using the leave-one-out cross-validation approach. All the data analysis was done in R 3.0.1 and R 2.6.2 and the relevant packages. The rst three components accounted for a cumulative 33.34 % of the total vari- ance (23.26 % , 6.02 % and 4.06 % respectively). The error rate of the linear discriminant classi er systematically increased at the number of retained princi- pal components increased from three to seventy (6 % to 33 %). In our study the age explained 0.8 % of the variance, the disease condition 26.5 % and gender only 1.59 %. The accuracy of the linear discriminant classi er was highly dependent on the number of principal components retained. The error rate increased systemat- ically from 6 % to 33% when the components retained were increased from 3 to 70. The fact that the rst few principal components explained a large proportion of the variance suggests that there were only a few genes that accounted for the signi cant amount of the variance.This aligns with the knowledge that only a few number of genes present relevant attributes and that the gene expressed data comes with presence of noise which can be termed as technical and biological distortions of the data. In conclusion the proper understanding of the variability of gene expression data is key to making proper biological conclusions. The appreciation of the contribution of the variability contributed to other biological factors is important in the study design.
    URI
    http://erepository.uonbi.ac.ke:8080/xmlui/handle/11295/60021
    Citation
    A Thesis Submitted In Partial Ful Llment For The Degree Of Masters Of Science In Medical Statistics, 2013
    Publisher
    University of Nairobi
     
    School of Medicine
     
    Collections
    • Faculty of Health Sciences (FHS) [4486]

    Copyright © 2022 
    University of Nairobi Library
    Contact Us | Send Feedback

     

     

    Useful Links
    UON HomeLibrary HomeKLISC

    Browse

    All of UoN Digital RepositoryCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Copyright © 2022 
    University of Nairobi Library
    Contact Us | Send Feedback