• 《英语学术论文摘要语步结构自动识别模型的构建》
  • 作者:刘霞
  • 单位:北京外国语大学
  • 论文名称 英语学术论文摘要语步结构自动识别模型的构建
    作者 刘霞
    学科 外国语言文学
    学位授予单位 北京外国语大学
    导师 王立非指导
    出版年份 2016
    中文摘要 大数据时代,如何科学、全面的反映一个学科的知识结构和发展状况至关重要,摘要是学科知识挖掘的一个便捷且重要的数据来源,但现有的知识挖掘很难定位摘要中的关键信息语步,更无法实现语步内部的知识挖掘,这就需要构建摘要语步的自动识别模型。 基于文本自动分类的研究成果,自然语言处理领域出现了三类摘要语步自动识别模型,但这三类模型各有利弊。纯粹以统计词频构建的词袋模型,虽然能够穷尽词项特征,但对特征不做筛选和归类,导致特征稀疏。基于规则提取语言学特征构建的模型,虽然避免了特征稀疏的问题,但未能全面系统地提取所有特征。第三类结合词袋和语境特征构建的模型,虽然识别效果很好,但只能针对结构化摘要,对大量非结构化摘要的识别效果仍然差强人意。 针对这种情况,本研究旨在以现有的摘要语步结构自动识别模型为出发点,针对现有模型语言特征提取不足的问题,结合语言学理论和方法,提取新的特征,同时借助语料库语言学、自然语言处理、信息检索技术以及统计学等学科中的研究方法,试图构建运行效果更好的能够自动识别常见类型英文摘要语步结构的模型。 本研究模型的构建大体分四个阶段:(1)语料的准备和预处理阶段。我们下载了Web of Science数据库收录的《应用语言学》期刊自1993年到2014年出版的所有论文摘要,剔除书评、会议论文、编者语,共计440篇。然后对文本进行清理,以及进行自动词性赋码和句法分析。(2)人工标注阶段。由三位相关专业研究人员对语料进行人工标注,标注过程前后持续一年,经历了基于已有研究提出的标注方案自上而下地标注,以及不带有任何已有的方案自下而上地标注,最后采用了两种方法相结合的方式,并确定了以完整的语句为标注单位的六语步标注方案。经检验,两位标注人员独立标注的一致性较好(Kappa =.785),然后对独立标注中二者不一致的地方进行多次讨论、修改,达成完全一致。(3)提取特征构建模型阶段。人工标注完语步结构之后,利用一系列研究工具和方法,提取有效的语步预测特征,再利用这些特征和数据训练学习分类器(条件随机场),获得模型。(4)模型的验证阶段。利用构建的模型预测验证集的语步类别,将模型预测的验证集的语步类别与人工标注类别对比,得到模型的识别效果,再与现有的同类模型作对比,探索本模型的优势与不足。 本研究的主要发现可以概括为摘要的语步分析、语步结构的有效预测特征和模型的识别效果三方面。第一,本研究突破了传统语步分析的方法,基于对大量数据的实际分析印证并完善了已有的语类研究理论。第二,本研究验证了已有模型提取的4个特征的有效性,证实了新加入的3个特征的有效预测力,通过对比发现以语料库的方法提取的新特征比传统方法提取的特征效果更好。从特征的三个维度来看,意义特征对语步的识别度最高(F=0.609),其次是语境特征(F=0.428),识别度最低的是形式特征(F=0.317)。第三,本研究构建了摘要语步结构的自动识别模型,模型的识别效果(F=0.7819)是现有自动识别模型中效果最好的,对信息型摘要的识别效果比现有识别效果最好的模型提高了4.5%。为了保证可比性,我们利用同一批语料训练词袋模型AntMover,结果本研究的模型比AntMover的识别效果提高了约23%。 摘要语步结构自动识别模型的构建,为下一步学科知识挖掘中定位摘要的语步以及语步内部的关键知识奠定了基础。另一方面,语步的自动识别突破了ESP领域长久以来的人工识别法,为语步分析理论和实证研究走向更多的学科和研究领域,发展成为一个更全面、多视角、多维度的语步分析领域提供了可能。 关键词:语类分析、语步结构、自动识别、英文摘要
    英文摘要 In the era of Big Data, it is of great significance to explore the knowledge structure and discover the trends of a research area. It is widely acknowledged that abstracts have been an accessible and important source of information for Knowledge Discovery. Nevertheless, the existing data mining techniques could not identify key information moves, not to mention being able to anchor the key information inside these moves. This research gap calls for the constructing of a model to automatically identify move structures. Based on the research in the automatic text categorization, three types of models have been suggested in the field of Natural Language Processing, each of which has their strengths and limitations. The first type, what was commonly referred to as “bag of words” models, is based on word frequency and statistical methods. It could exhaust all the term features, but it is likely to have the problem of sparse features due to the lack of feature selection. The second type of models is based on linguistic rules, which could avoid the problem of sparse features but are unable to exhaust all features. The third type which integrates both “bag of words” and contextual features produces better performance, but it could only deal with structured abstracts. As for unstructured abstracts, its performance is not satisfying. In response to the current situation, this study aims to construct a model which is able to automatically identify the move structure of more types of English abstracts with better performance. Knowledge and techniques in several disciplines were employed,including Corpus Linguistics, Natural Language Processing. Information Retrieval and statistics. Besides,theories and concepts in Linguistics (e.g. Move Analysis) have been drawn to make up for the limitations of the existing models. There are four stages in constructing the model: (1) Data preparation and pre-processing. At this stage we downloaded the abstracts of all the English research articles of Applied Linguistics published from 1993 to 2014 from the database Web of Science. Altogether, 440 texts were collected after we teased out the data for book reviews,conference articles, and editorials. The data were cleaned and pre-processed with POS tagging and parsing. (2) Manual Annotation. The manual annotation was carried out by highly experienced researchers of this field and the whole process took them a whole year. The coding scheme with six moves and sentences as the analytical unit was an integration of both the top-down (i.e. coding based on the existing coding schemes) and bottom-up (i.e. coding with no existing coding schemes) approaches. Then two coders coded the whole texts independently and achieved high degree of agreement (Kappa = .785). Finally, they discussed and corrected any difference between their coding in order to achieve absolute agreement. (3) Feature extracting and model constructing. After manual annotation, various features were extracted to predict the move structure. The effective predicting features were identified to construct the model with the classifier Conditional Field Model. (4) Evaluating the model. At the evaluation stage, we used 10-fold cross-validation to evaluate the system. The data were randomly divided into 10 sets, and each time 9 sets were used to train the classifier while 1 set was used to test the classifier. The final performance was the average of the ten times. In addition, the final performance was compared with the previous models in order to explore the advantages and shortcomings of this model. This study has contributed to the current research mainly in three aspects. Firstly, the current study has made contribution to the genre study by being data-based. Unlike the traditional genre study, this study approached genre on the basis of big data. Secondly, this research has validated the effectiveness of four features proposed by existing models. In addition, this model found another three new features and confirmed that the three new features retrieved with the corpus approach are more effective in predicting than those obtained by the traditional ways. In terms of the three orients of features, meaning-oriented features are the strongest predictors for moves (F=0.609), form-oriented features are the weakest (F=0.317), and contextual features lie in between (F=0.428). Thirdly, this study has constructed an effective model for the identification of move structure. The performance of this model (F=0.7819) is the best by far among the existing models. Its performance towards informative abstracts (F=0.8218) is 4.5% higher than the best model of the existing ones. In order to ensure the comparability, we used the same data set to train the “bag of words model”(AntMover),it turned out that the performance of our model is 23% higher than that of AntMover. Constructing a model for the automatic identification of move structure in English research article abstracts is a necessary step for Knowledge Discovery to anchor the key moves, and further to pinpoint the key information inside each move. In addition, the automatic identification of move structure breaks through the manual analysis of move structure in ESP for a long time, which could help the theoretical and empirical studies of move analysis develop into a more comprehensive, multi-perspective and multi-dimensional field of study by integrating with other research areas. Keywords: genre analysis, move structure, automatic identification,English abstracts
    鸥维数据云查询平台
      联系我们
    • 电话:400-139-8015
    • 微信:vbeiyou
    • 邮箱:ovo@qudong.com
    • 总部:北京市海淀区学院路30号科群大厦西楼5层
    Copyright © 西北大学西部大数据研究院旗下“鸥维数据” 京ICP备17065155号-6