基于CRFs的英语功能小句自动句法分析

论文价格:0元/篇 论文用途:仅供参考 编辑:论文网 点击次数:0
论文字数:**** 论文编号:lw202313302 日期:2023-07-16 来源:论文网

1 Introduction


1.1Research Background
The first and foremost difficulty in parsing is ambiguity. Since early years, linguists,philosophers and logicians came to an agreement that natural languages are so vague andambiguous that they cannot be described in the same way as artificial languages. Human, withextensive language knowledge can eliminate the wrong sentences, but the machine can't.The second difficulty is the searching space. Because parsing is complicated andinvolves a lot of calculation, the longer the sentence, the more space machine may need. So,an effective parser is always needed to reduce the complexity.To solve the first problem, a proper grammar, which is a formal specification of legalsyntactic structures allowable in English[3],is needed. Formal linguist, Chomsky's approacheshave heavily influenced modem computational linguistics, and many theories based ongenerative grammar prevailed, like Probabilistic Context-Free Grammar (PCFG),Head-driven Phrase Structure Grammar (HPSG), and Lexical Functional Grammar (LFG).Though they contribute a lot in this area, they can't solve the semantic problems of parsingbecause they pay little attention to the nature of language, the meaning and context. In view ofthis,Systemic Functional Grammar (SFG) is introduced in this paper. As Bloor⑷ put it, itstudies "actual instances of language that have been used by speakers or writers". Eggins[5]also argued that SFG is a "functional-semantic approach to language which explores bothhow people use language in different contexts, and how language is structured for use as asemiotic system". As for syntactic parsing, we choose clause theories in SFG, aiming toimprove the parsing quality.To solve the second problem, the computer program we choose for parsing isConditional Random Fields (CRFs). They are a type of a probabilistic framework for labelingand segmenting sequential data[6],and used in shallow parsing,named entity recognition andgene finding. Linear chain CRFs are especially popular in NLP for it can predict sequences oflabels for sequences of input samples. Because the corpus used in the thesis is small,and theparsing result would be in form of phrases not in trees, CRFs is more appropriate.
………..


1.2 Research Questions
This thesis presents an application of clause theories in Halliday's Systemic FunctionalGrammar into automatic parsing,using a small Business English Corpus, aims to achieve ascalable parser useful for machine translation. The research questions are as follows:
(1)How to realize parsing using clause theories in SFG? Will the parsing result befavorable?
(2)What kinds (or kind) of errors appear the most in parsing? How many kinds of errorsare there? What are their causes? And how do we avoid making the mistakes?
……….


2 Literature Review


2.1 Grammars Related to Parsing
Chomsky's linguistic theories allow the possibility for the understanding and use of acomputer science model which enables a programmer to accomplish meaningful linguisticgoals systematically. Though his theories are controversial, they laid foundation for manyother theories, like Extended Standard Theory, Prevised Extended Standard Theory, Principleand Parameter Theory and the Minimalist Program, just to name a few'-In China, ZhouMing[8】 proposed a typical rule-based parsing model. He defined eleven kinds of phrases andseventeen kinds of dependency relationships, and made up more than one thousand phraseidentification rules and over three hundreds dependency rules.The advantage of rule-based parsing models is that the machine would be accessible tothe maximum possible usage of natural language, but its disadvantages weigh higher[9]. Firstly,making rules is a sophisticated job,depending too much on the linguist's knowledge andexperience. So the rules listed are limited for covering all aspects of natural language.Secondly, the compatibility of all rules may not be ensured. As rules accumulate in the system,conflicts arise. Thus, since the late 1990s, computational linguists applied statisticalmathematics in analyzing language data, which has been the mainstream ever since[io].The advantages of statistical parsing models are as follows[ii]: Firstly, they depend lesson human's prior knowledge, which is more objective and independent. Secondly, statisticalparsing models make it possible to separate language knowledge from algorithm. Anotherway of saying it is that the same corpus can be used in many algorithms. As new languagephenomena appear, the only thing needed is to enlarge the corpus, instead of adding morerules to the system. Thirdly, with the development of computer hardwares and devices, thestorage and calculation cost is largely reduced. Also, the abundant global internet resourcesprovide the necessary environment for the development of statistical corpus-based parsingmodels. Several widely used grammars for statistical parsing are introduced in the section.
………


2.2 Important Theories in Systemic Functional Grammar
Because this thesis studies clause from the perspective of Halliday's Systemic FunctionalGrammar, some basic tenets which make the theory particular important need to beintroduced. Following the tradition of J.R. Firth and the Prague School, Michael Halliday, alsoinfluenced by Malinowski's studies of anthropology, set out to develop his theory,known asSystemic Functional Linguistics,whose main concern is the function of language in use.In Lynn Yu Luo's thesis,there was a comparison between functional and traditionalviews of language,which illustrates the nature of SFG clearly. Please see Table 2.1.From this table, it is noticeable that the traditional and the functional views vary in manyaspects. While the fimctional view tends to see language as a societal phenomenon and as aresource for meaning making,the traditional view tends to regard language primarily asmental phenomenon and explain it as a set of rules. The fimctional view emphasizes theimportance of context and sees text as a whole,while the traditional view only cares about thegrammatical accuracyof the text, and analyze text at the level of sentence and below. Moreover, where thefunctional view looks at language learning as an on-going process of extending one'sresources for making meaning, the traditional view sees language learning as acquiringcorrect forms.

………


3Methodology.........18
3.1Corpus-Based Study........18
3.2Conditional Random Fields (CRFs)........18
3.3IOB Representation........ 19
3.4SQL Server........20
4Parsing of Clause........21
4.1Syntax at clause level........21
4.2Tagging Rules for Clause Ranked Constituents Functions........22
5Results and Discussion34
5.1Evaluation Metrics........34
5.2Results and Discussion of the Open Tests........34
5.3Error Analysis and Discussion........36


5 Results and Discussion


5. 1 Evaluation Metrics
In simple terms, precision can be seen as a measure of exactness or quality, highprecision means that ah algorithm returned substantially more relevant results than irrelevant,and recall is a measure of completeness or quantity, so high recall means that an algorithmhas returned most of the relevant results. The F-measure is derived to measure theeffectiveness of retrieval with respect to a user who attaches p times as much importance torecall as precision. F-measure weights recall higher than precision.In our experiment,only when both the lOB tags and the order of lOB tags are correct,will the functional constituents be considered valid. In order to have a reliable result, six-fold cross tests are performed. The corpus is pidedinto six groups,with each group equally covering all the fourteen business situations. In eachtest, one group is chosen as the testing data and the rest five are tiie training data. The detailsof the training and testing corpora of each test are given in the following table.
……….


Conclusion


This part concludes the main findings, limitations of this thesis. Some suggestions forfuture work are also give.This thesis studied parsing of English clause for Phrase-Based Statistical MachineTranslation in Natural Language Processing field under the guidance of Halliday's SFG.Different from the previous studies on functional clause syntax, this thesis combined andexpanded the functional constituents of clause into seven major types, namely: subject (S),predicator (P), complement (C), first/ second/ third/ fourth complement (C1/2/3/4), adjunct(D), residue of predicator (PR), and residue of complement (CR).With the new tag set, a parsing system using CRFs was employed to automaticallyidentify different functions of clause constituents. Experiments were carried out to test theeffectiveness of our parsing. The parsing results using our tagging method realized a highproficiency with Precision of 92.5%, Recall of 91.96% and F-measure of 92.18%. The bestidentified functions were P and S, with precision, recall and F-measure all over 97%. Thesecond-best identified ftinctional type was CI, with a precision of 93.39%, recall 88.62% andF-measure 90.86%. The identification results of D,C, C2, PR were comparatively low.
…………
References (omitted)


如果您有论文相关需求,可以通过下面的方式联系我们
客服微信:371975100
QQ 909091757 微信 371975100