Holistic approach for efficient extraction of web data
Abstract
There is a tremendous growth in the volume of information available on the internet,
digital libraries, new sources and company database or intranets that contain valuable
information. Information from World Wide Web has been a source of information which
caters for different sectors ranging from social, political and economical spheres for
decision making. Such information would be more valuable if it can be available to the
end user and other application systems in required formats. This has caused the need for
tools to assist users in extracting relevant information in a fast and effective way. We
explore an efficient mechanism of extracting web data through analysis of HTML tags
and patterns. HTML constitutes a large percentage of web content. However, much of
this content lacks strict structure and proper schema. Additionally, web content has high
update frequency and semantic heterogeneity of the information as compared to other
format such as XML that are more firm in structure. We have managed to produce a
custornised generic model that can be used to extract unstructured data from the web and
populate it to a database. The main contribution is an automated process for locating,
extracting and storing data from HTM L web sources. Such data is then available to other
application software for analysis and other processing
Citation
Masters of science in computer scienceSponsorhip
University of NairobiPublisher
University of Nairobi School of Computing and Informatics