Loughborough University
Browse
Thesis-1992-Wu.pdf (7.96 MB)

A partial syntactic analysis-based pre-processor for automatic indexing and retrieval of Chinese texts

Download (7.96 MB)
thesis
posted on 2013-11-27, 11:59 authored by Zimin Wu
Automatic indexing is the automatic creation of a text surrogate, normally keywords or phrases, to represent the original text. In the current English text retrieval systems, this process of content representation is accomplished by extracting words using spaces and punctuations as word delimiters. The same technique cannot easily be applied to Chinese texts which contain no obvious word boundaries; they appear to be a linear sequence of non-spaced or equally spaced ideographic characters and thenumber of characters in words varies. The solution to the problem lies in morphological and syntactic analyses of Chinese morphemes, words and phrases. The idea is inspired by the experiments on English computational morphology and its application to English text retrieval, mainly automatic compound and phrase indexing. These areas are particularly germane to Chinese because typographically there are no morph and phrase boundaries in either Chinese or English texts. The experiment is based on the hypothesis that words and phrases exceeding two Chinese characters can be characterised by a grammar that describes the concatenation behaviour of morphological and syntactic categories. This is examined using the following three procedures: (1) text segmentation - texts are divided into one and two character segments by searching a dictionary containing over 17000 morphemes and words, which are tagged with 'morphological and syntactic categories. (2) category disambiguation - for the resulting morphemes and words tagged with more than one category, the correct one is selected based on context (3) parsing - the segments are analysed using the grammar, which combines them into compound and complex words and phrases for indexing and retrieval. The utilities employed in the experiment include CCOOS, an extended version of MSOOS providing for Chinese I/O system,Chinese Wordstar for text input and Chinese dBASEIII for dictionary construction. Source codes are written in Turbo BASIC including its database toolbox. Thiny texts are drawn randomly from newspapers to form thcsample for the experiment. The results prove that the partial syntactic analysis-based approach can extract keywords with a good degree of accuracy.

History

School

  • Science

Department

  • Information Science

Publisher

© Zimin Wu

Publication date

1992

Notes

A Doctoral Thesis. Submitted in partial fulfilment of the requirements for the award of Doctor of Philosophy of Loughborough University.

EThOS Persistent ID

uk.bl.ethos.587865

Language

  • en

Usage metrics

    Loughborough Theses

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC