Loughborough University
Leicestershire, UK
LE11 3TU
+44 (0)1509 263171
Loughborough University

Loughborough University Institutional Repository

Please use this identifier to cite or link to this item: https://dspace.lboro.ac.uk/2134/14687

Title: A teachable semi-automatic web information extraction system based on evolved regular expression patterns
Authors: Siau, Nor Zainah
Keywords: TS-WIE
Dynamic grammar definition
Genetic programming
Regular expressions pattern and structural pattern (DOM).
Issue Date: 2014
Publisher: © Nor Zainah Siau
Abstract: This thesis explores Web Information Extraction (WIE) and how it has been used in decision making and to support businesses in their daily operations. The research focuses on a WIE system based on Genetic Programming (GP) with an extensible model to enhance the automatic extractor. This uses a human as a teacher to identify and extract relevant information from the semi-structured HTML webpages. Regular expressions, which have been chosen as the pattern matching tool, are automatically generated based on the training data to provide an improved grammar and lexicon. This particularly benefits the GP system which may need to extend its lexicon in the presence of new tokens in the web pages. These tokens allow the GP method to produce new extraction patterns for new requirements.
Description: A Doctoral Thesis. Submitted in partial fulfilment of the requirements for the award of Doctor of Philosophy of Loughborough University.
URI: https://dspace.lboro.ac.uk/2134/14687
Appears in Collections:PhD Theses (Computer Science)

Files associated with this item:

File Description SizeFormat
Thesis-2014-Siau.pdf4.85 MBAdobe PDFView/Open
Form-2014-Siau.pdf11.09 MBAdobe PDFView/Open


SFX Query

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.