Holistic approach for efficient extraction of web data

Didas, Malekia

View/Open

Fulltext (3.738Mb)

Date

2011

Author

Didas, Malekia

Type

Thesis

Language

Metadata

Show full item record

Abstract

There is a tremendous growth in the volume of information available on the internet, digital libraries, new sources and company database or intranets that contain valuable information. Information from World Wide Web has been a source of information which caters for different sectors ranging from social, political and economical spheres for decision making. Such information would be more valuable if it can be available to the end user and other application systems in required formats. This has caused the need for tools to assist users in extracting relevant information in a fast and effective way. We explore an efficient mechanism of extracting web data through analysis of HTML tags and patterns. HTML constitutes a large percentage of web content. However, much of this content lacks strict structure and proper schema. Additionally, web content has high update frequency and semantic heterogeneity of the information as compared to other format such as XML that are more firm in structure. We have managed to produce a custornised generic model that can be used to extract unstructured data from the web and populate it to a database. The main contribution is an automated process for locating, extracting and storing data from HTM L web sources. Such data is then available to other application software for analysis and other processing

URI

http://erepository.uonbi.ac.ke:8080/xmlui/handle/123456789/13136

Citation

Masters of science in computer science

Sponsorhip

University of Nairobi

Publisher

University of Nairobi

School of Computing and Informatics

Subject

Web data extraction
structured data
semi structured and unstructured data

Collections

Faculty of Science & Technology (FST) [4213]