Natural Language Processing for Mapping Free-text Medical Diagnosis to Icd-11 Code.

Kundu, Isaac W

View/Open

FULL TEXT (3.167Mb)

Date

2024

Author

Kundu, Isaac W

Type

Thesis

Language

Metadata

Show full item record

Abstract

The standardization of medical coding practices through the International Classification of Diseases (ICD), based on ICD-10 and the current ICD-11, underscores the global importance of accurately classifying diseases, symptoms, and medical procedures. As chronic diseases continue to intensify worldwide, particularly in developing countries like Kenya, there is need to develop innovative methods for healthcare delivery and management. This project addresses this challenge by utilizing the power of Natural Language Processing techniques to categorize free-text medical diagnoses, using data from KEMRI-Wellcome Trust. Research Electronic Data Capture (REDCap) that works the same as Electronic Health Records (EHRs) present a valuable resource for conducting clinical and translational research, with the potential to enhance patient outcomes and inform healthcare decision-making. However, the unstructured nature of clinical texts poses a significant obstacle, limiting access to critical information regarding patient diagnoses, treatments, and outcomes. To overcome this barrier, the study aims to develop NLP algorithms capable of preprocessing and extracting relevant text features from free-text medical diagnosis data extracted from EHRs, to optimize the Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), BERT (Bidirectional Encoder Representations from Transformers), and a hybrid BERT-LSTM models for categorizing free-text medical diagnoses and evaluate its performance using appropriate metrics. The data preprocessing, exploratory data analysis, and model development were carried out in Anaconda distribution and implemented within the Jupyter Notebook and Spyder Integrated Development Environment (IDE) version 5.4.3. The results indicated that the Hybrid BERT-LSTM model performed better with the highest accuracy (83%), precision (84%), recall (83%), and F1-score (82%), this was because the model combines the strengths of both BERT models’ character-level contextual understanding and LSTM models’ sequential character-level based processing capabilities. The study recommended that there should be more enhanced preprocessing techniques, and regular model updates hence fine-tuning the model and exploring more avenues to improve the hybrid models. The suggestions for further research were to, explore more on how to implement the hybrid and other metric learning models for NLP techniques in real-time and in the real world.