Loughborough University
Leicestershire, UK
LE11 3TU
+44 (0)1509 263171
Loughborough University

Loughborough University Institutional Repository

Please use this identifier to cite or link to this item: https://dspace.lboro.ac.uk/2134/13685

Title: A partial syntactic analysis-based pre-processor for automatic indexing and retrieval of Chinese texts
Authors: Wu, Zimin
Issue Date: 1992
Publisher: © Zimin Wu
Abstract: Automatic indexing is the automatic creation of a text surrogate, normally keywords or phrases, to represent the original text. In the current English text retrieval systems, this process of content representation is accomplished by extracting words using spaces and punctuations as word delimiters. The same technique cannot easily be applied to Chinese texts which contain no obvious word boundaries; they appear to be a linear sequence of non-spaced or equally spaced ideographic characters and thenumber of characters in words varies. The solution to the problem lies in morphological and syntactic analyses of Chinese morphemes, words and phrases. The idea is inspired by the experiments on English computational morphology and its application to English text retrieval, mainly automatic compound and phrase indexing. These areas are particularly germane to Chinese because typographically there are no morph and phrase boundaries in either Chinese or English texts. The experiment is based on the hypothesis that words and phrases exceeding two Chinese characters can be characterised by a grammar that describes the concatenation behaviour of morphological and syntactic categories. This is examined using the following three procedures: (1) text segmentation - texts are divided into one and two character segments by searching a dictionary containing over 17000 morphemes and words, which are tagged with 'morphological and syntactic categories. (2) category disambiguation - for the resulting morphemes and words tagged with more than one category, the correct one is selected based on context (3) parsing - the segments are analysed using the grammar, which combines them into compound and complex words and phrases for indexing and retrieval. The utilities employed in the experiment include CCOOS, an extended version of MSOOS providing for Chinese I/O system,Chinese Wordstar for text input and Chinese dBASEIII for dictionary construction. Source codes are written in Turbo BASIC including its database toolbox. Thiny texts are drawn randomly from newspapers to form thcsample for the experiment. The results prove that the partial syntactic analysis-based approach can extract keywords with a good degree of accuracy.
Description: A Doctoral Thesis. Submitted in partial fulfilment of the requirements for the award of Doctor of Philosophy of Loughborough University.
URI: https://dspace.lboro.ac.uk/2134/13685
Appears in Collections:PhD Theses (Information Science)

Files associated with this item:

File Description SizeFormat
Thesis-1992-Wu.pdf8.15 MBAdobe PDFView/Open
Form-1992-Wu.pdf41.06 kBAdobe PDFView/Open


SFX Query

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.