A Comparative Analysis of unsupervised outlier detection methods for Data Quality Assurance
Abstract
Data quality assurance is a key component in research. It is almost impossible to routinely
check for errors in large datasets if automated smart mechanisms are not put in place.
The quality of results from data analysis heavily relies on the underlying state of data.
Quality data leads to e ective and unbiased reporting. Errors introduced into the data are
inevitable hence the need to have error-checking mechanisms.
Error checking mechanisms such as the use of range checks, quantile ranges and z-scores
are limited to continuous data types and e ective for small feature space data. Errors in
dichotomous and character data types are easily omitted hence the need to use methods
that scan anomalies for all data types and for extremely large datasets. Two pass veri -
cation on the other hand is a gold standard method for checking the quality state of data.
It involves random sampling of observations to be re-entered from similar source documents
to measure the level of accuracy and consistency of data. It is an accurate process;
however, it is a tedious and manual process that relies on random sampling for larger
datasets.
We propose possible alternative methods for error checking by applying machine learning
outlier detection algorithms. The observations that are outlying are subjected to crossreferencing
for possible errors instead of randomly selecting a set of observations.
We evaluated k-means clustering and isolation forest unsupervised machine learning algorithms
to detect outliers. The outliers form the sample of observations to be validated
and veri ed. We then compared two pass veri cation anomaly scores, k-means anomaly
scores and isolation forest anomaly scores. Normalized mutual information score and the
coe cient of determination metrics were used to determine the strength of the correlation.
The results indicated that unsupervised machine learning methods can be possible
alternatives for data quality assurance with a exibility for future considerations and improvements.
Isolation forest performed better than k-means clustering.
Publisher
University of Nairobi
Rights
Attribution-NonCommercial-NoDerivs 3.0 United StatesUsage Rights
http://creativecommons.org/licenses/by-nc-nd/3.0/us/Collections
The following license files are associated with this item: