A Comparative Analysis of unsupervised outlier detection methods for Data Quality Assurance

Terer, Mercy, C

View/Open

Full-Text (1.671Mb)

Date

2020

Author

Terer, Mercy, C

Type

Thesis

Language

Metadata

Show full item record

Abstract

Data quality assurance is a key component in research. It is almost impossible to routinely check for errors in large datasets if automated smart mechanisms are not put in place. The quality of results from data analysis heavily relies on the underlying state of data. Quality data leads to e ective and unbiased reporting. Errors introduced into the data are inevitable hence the need to have error-checking mechanisms. Error checking mechanisms such as the use of range checks, quantile ranges and z-scores are limited to continuous data types and e ective for small feature space data. Errors in dichotomous and character data types are easily omitted hence the need to use methods that scan anomalies for all data types and for extremely large datasets. Two pass veri - cation on the other hand is a gold standard method for checking the quality state of data. It involves random sampling of observations to be re-entered from similar source documents to measure the level of accuracy and consistency of data. It is an accurate process; however, it is a tedious and manual process that relies on random sampling for larger datasets. We propose possible alternative methods for error checking by applying machine learning outlier detection algorithms. The observations that are outlying are subjected to crossreferencing for possible errors instead of randomly selecting a set of observations. We evaluated k-means clustering and isolation forest unsupervised machine learning algorithms to detect outliers. The outliers form the sample of observations to be validated and veri ed. We then compared two pass veri cation anomaly scores, k-means anomaly scores and isolation forest anomaly scores. Normalized mutual information score and the coe cient of determination metrics were used to determine the strength of the correlation. The results indicated that unsupervised machine learning methods can be possible alternatives for data quality assurance with a exibility for future considerations and improvements. Isolation forest performed better than k-means clustering.