看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Automatically building large-scale ... 收藏
Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

作     者:Jie ZHOU Bi-cheng LI Gang CHEN 

作者机构:Department of Signal Analysis and Information Processing Zhengzhou Information Science and Technology Institute 

基  金:Project supported by the National Natural Science Foundation of China(No.14BXW028) 

出 版 物:《Frontiers of Information Technology & Electronic Engineering》 (信息与电子工程前沿(英文版))

年 卷 期:2015年第16卷第11期

页      码:940-956页

摘      要:Named entity recognition (NER) is a core component in many natural language processing applications. Most NER systems rely on supervised machine learning methods, which depend on time-consuming and expensive annotations in different languages and domains. This paper presents a method for automatically building silver-standard NER corpora from Chinese Wikipedia. We refine novel and language-dependent features by exploiting the text and structure of Chinese Wikipedia. To reduce tagging errors caused by entity classification, we design four types of heuristic rules based on the characteristics of Chinese Wikipedia and train a supervised NE classifier, and a combined method is used to improve the precision and coverage. Then, we realize type identification of implicit mention by using boundary information of outgoing links. By selecting the sentences related with the domains of test data, we can train better NER models. In the experiments, large-scale NER corpora containing 2.3 million sentences are built from Chinese Wikipedia. The results show the effectiveness of automatically annotated corpora, and the trained NER models achieve the best performance when combining our silver-standard corpora with gold-standard corpora.

主 题 词:NER corpora Chinese Wikipedia Entity classification Domain adaptation Corpus selection 

学科分类:081203[081203] 08[工学] 0835[0835] 0812[工学-测绘类] 

核心收录:

D O I:10.1631/FITEE.1500067

馆 藏 号:203459279...

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分