Distant supervision for relation extraction without labeled data

in 知识图谱论文 with 0 comment view 72 times
 graph LR;

Distant_supervision_for_relation_extraction_without_labeled_data-->基本信息;
Distant_supervision_for_relation_extraction_without_labeled_data-->freebase;
Distant_supervision_for_relation_extraction_without_labeled_data-->动机;
Distant_supervision_for_relation_extraction_without_labeled_data-->引言;
Distant_supervision_for_relation_extraction_without_labeled_data-->做法;
Distant_supervision_for_relation_extraction_without_labeled_data-->方法介绍;方法介绍-->特征选择;方法介绍-->命名实体tag特征;
Distant_supervision_for_relation_extraction_without_labeled_data-->结果;
click 基本信息 "#menu_index_1"
click freebase "#menu_index_2"
click 动机 "#menu_index_3"
click 引言 "#menu_index_4"
click 做法 "#menu_index_5"
click 方法介绍 "#menu_index_6"
click 特征选择 "#menu_index_7"
click 命名实体tag特征 "#menu_index_8"
click 结果 "#menu_index_9"

Distant supervision for relation extraction without labeled data

基本信息

作者:{mikemintz,sbills,rion,jurafsky}@cs.stanford.edu
发布时间:2009年8月

freebase

The columns in the dataset are defined as:
creation_timestamp (Unix epoch time in milliseconds)
creator
deletion_timestamp (Unix epoch time in milliseconds)
deletor
subject (MID)
predicate (MID)
object (MID/Literal)
language_code

1352854086000,/user/mwcl_wikipedia_en,1352855856000,/user/mwcl_wikipedia_en,/m/03r90,/type/object/key,/wikipedia/en/$B816,en
1355171076000,/user/mwcl_musicbrainz,1364258198000,/user/turtlewax_bot,/m/0nncp9z,/music/recording/artist,/m/01vbfm4,en
1176630380000,/user/mwcl_images,1335928144000,/user/gardening_bot,/m/029w57m,/common/image/size,/m/0kly56,en
1292854917000,/user/mwcl_musicbrainz,1364823418001,/user/mbz_pipeline_merge_bot,/m/0fv1vl8,/type/object/type,/common/topic,en
1205530905000,/user/mwcl_images,1336022041000,/user/gardening_bot,/m/01x5scz,/common/licensed_object/license,/m/02x6b,en
1302391361000,/user/content_administrator,1336190973000,/user/gardening_bot,/m/0gkb45y,/type/object/type,/type/content,en
1176728962002,/user/mwcl_images,1335954186000,/user/gardening_bot,/m/08430h,/common/topic/image,/m/02cs147,en
1172002568007,/user/mwcl_chefmoz,1283588560000,/user/delete_bot,/m/01z4c1z,/type/object/name,La Casa Rosa Mexican Restaurant,en

动机

Supervised relation extraction suffers from a number of problems, however. Labeled training data is expensive to produce and thus limited in quantity. Also, because the relations are labeled on a particular corpus, the resulting classifiers tend to be biased toward that text domain, We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE- style algorithms, and allowing the use of corpora of any size.
然而,监督关系提取存在许多问题。 标记的训练数据生产成本高,因此数量有限。 此外,因为关系是在特定语料库上标记的,所以得到的分类器倾向于偏向于该文本域, 我们研究了一种不需要标记语料库的替代范例,避免了ACE式算法的域依赖性,并允许使用任何大小的语料库。

引言

主流的三种做法:

相关的论文:
Zhou et al., 2005; Zhou et al., 2007; Sur- deanu and Ciaramita, 2007

相关的论文:Shinyama和Sekine,2006; Banko等,2007

相关的论文:Brin,1998; Riloff和Jones,1999; Agichtein和Gravano,2000; Ravichandran和Hovy,2002; Etzioni等, 2005; Pennacchiotti和Pantel,2006; Bunescu和Mooney,2007; Rozenfeld和Feldman,2008

Many early algorithms for relation extraction used little or no syntactic information. For example, the DIPRE algorithm by Brin (1998) used string-based regular expressions in order to recognize relations such as author-book, while the SNOWBALL algorithm by Agichtein and Gravano (2000) learned similar regular expression patterns over words and named entity tags. Hearst (1992) used a small number of regular expressions over words and part-of-speech tags to find examples of the hypernym relation. The use of these patterns has been widely replicated in successful systems, for example by Etzioni et al. (2005).
许多关系提取的早期算法使用很少或没有语法信息。 例如,Brin(1998)的DIPRE算法使用基于字符串的正则表达式来识别诸如作者书之类的关系,而Agichtein和Gravano(2000)的SNOWBALL算法学习了类似的正则表达式模式而不是单词。 实体标签。 Hearst(1992)在单词和词性标签上使用少量正则表达式来查找上位词关系的例子。 这些模式的使用已在成功的系统中广泛复制,例如Etzioni等人。(2005年)。

other such as Ravichandran and Hovy (2002) and Pantel and Pennacchiotti (2006) use the same formalism of learning regular expressions over words and part-of-speech tags to discover patterns indicating a variety of relations
其他如Ravichandran和Hovy(2002)以及Pantel和Pennacchiotti(2006)使用相同的形式学习正则表达式而不是单词和词性标签来发现表明各种关系的模式。

More recent approaches have used deeper syntactic information derived from parses of the input sentences, including work exploiting syntactic dependencies by Lin and Pantel (2001) and Snow et al. (2005), and work in the ACE paradigm such as Zhou et al. (2005) and Zhou et al. (2007).
Perhaps most similar to our distant supervision algorithm is the effective method of Wu and Weld (2007) who extract relations from a Wikipedia page by using supervision from the page’s infobox.
Unlike their corpus-specific method, which is specific to a (single) Wikipedia page, our algorithm allows us to extract evidence for a relation from many different documents, and from any genre.
最近的方法使用了从输入句子的解析中得到的更深层的句法信息,包括利用Lin和Pantel(2001)以及Snow等人的句法依赖性的工作。 (2005),并在ACE等范例中工作,如Zhou等。(2005)和周等人。(2007年)。
也许与我们远程监督算法最相似的是Wu和Weld(2007)的有效方法,他通过使用页面信息框的监督从维基百科页面中提取关系。
与他们的特定于语料库的方法(特定于单个维基百科页面)不同,我们的算法允许我们从许多不同文档和任何类型中提取关系的证据。

做法

For each pair of enti- ties that appears in some Freebase relation, we find all sentences containing those entities in a large unlabeled corpus and extract textual features to train a relation classifier
对于出现在某些Freebase关系中的每对实体,我们在大型未标记语料库中找到包含这些实体的所有句子,并提取文本特征来训练关系分类器。

Because our algorithm is supervised by a database, rather than by labeled text, it does not suffer from the problems of overfitting and domain-dependence that plague supervised systems. Supervision by a database also means that, unlike in unsupervised approaches, the output of our classifier uses canonical names for relations
因为我们的算法是由数据库监督的,而不是标记文本,所以它不会被监督系统的过度拟合和域依赖性问题所困扰。 数据库的监督也意味着,与无监督方法不同,我们的分类器的输出使用规范名称来表示关系。

方法介绍

对在freebase中出现的实体对提取特征,构造训练数据;
训练多类别逻辑斯特回归模型。

在句子中出现的每对实体都被考虑做为一个潜在的关系实例,作为测试数据
使用训练后的模型对实体对分类。

特征选择

词汇特征:
a) 两个实体中间的词序列;
b) 这些词的词性标记;
c) 标志位表示哪个实体出现在前面;
d) 大小为k的左窗口;
e) 大小为k的右窗口。
3.2. 句法特征:
a) 两个实体之间的最短依存路径;
b) 两个实体的左右窗口

命名实体tag特征

人名、地名、组织名和其他
其他注意的地方
连接特征来丢进多类逻辑斯特回归模型。
负例构造:随机选取不在freebase中的实体对(有错误的可能)
训练和测试数据构造:freebase中的关系实例一半用来训练,另一半用来测试。数据使用维基百科数据,2:1的训练和测试数据分配。测试时只对在训练时未出现(不属于训练时的freebase中)的实例对分类。
测试结果选择:对所有实体对分类,并对每对实体对分配一个分类结果的置信度。然后对它们的置信度排序,选取top n。

结果

Our model is able to extract 10,000 instances of 102 relations at a precision of 67.6%.
我们的模型能够以67.6%的精度提取10,000个包含102个关系的实例。

Responses