Distant supervision for relation extraction without labeled data

# freebase

The columns in the dataset are defined as:
creation_timestamp (Unix epoch time in milliseconds)
creator
deletion_timestamp (Unix epoch time in milliseconds)
deletor
subject (MID)
predicate (MID)
object (MID/Literal)
language_code

1352854086000,/user/mwcl_wikipedia_en,1352855856000,/user/mwcl_wikipedia_en,/m/03r90,/type/object/key,/wikipedia/en/\$B816,en
1355171076000,/user/mwcl_musicbrainz,1364258198000,/user/turtlewax_bot,/m/0nncp9z,/music/recording/artist,/m/01vbfm4,en
1176630380000,/user/mwcl_images,1335928144000,/user/gardening_bot,/m/029w57m,/common/image/size,/m/0kly56,en
1292854917000,/user/mwcl_musicbrainz,1364823418001,/user/mbz_pipeline_merge_bot,/m/0fv1vl8,/type/object/type,/common/topic,en
1176728962002,/user/mwcl_images,1335954186000,/user/gardening_bot,/m/08430h,/common/topic/image,/m/02cs147,en
1172002568007,/user/mwcl_chefmoz,1283588560000,/user/delete_bot,/m/01z4c1z,/type/object/name,La Casa Rosa Mexican Restaurant,en

# 动机

Supervised relation extraction suffers from a number of problems, however. Labeled training data is expensive to produce and thus limited in quantity. Also, because the relations are labeled on a particular corpus, the resulting classifiers tend to be biased toward that text domain, We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE- style algorithms, and allowing the use of corpora of any size.

# 引言

• The NIST Automatic Content Extraction (ACE) RDC 2003 and 2004 corpora, for example, include over 1,000 documents in which pairs of entities have been la- beled with 5 to 7 major relation types and 23 to 24 subrelations, totaling 16,771 relation instances. ACE systems then extract a wide variety of lexi- cal, syntactic, and semantic features, and use su- pervised classifiers to label the relation mention holding between a given pair of entities in a test set sentence, optionally combining relation mentions
例如，NIST自动内容提取（ACE）RDC 2003和2004语料库包括1,000多个文档，其中实体对已被标记为5到7种主要关系类型和23到24个子关系，总共有16,771个关系实例。ACE系统然后提取各种各样的语法，句法和语义特征，并使用超级分类器标记在测试集句子中给定的一对实体之间保持的关系，可选地组合提及的关系

• An alternative approach, purely unsupervised information extraction, extracts strings of words between entities in large amounts of text, and clusters and simplifies these word strings to pro- duce relation-strings (Shinyama and Sekine, 2006; Banko et al., 2007). Unsupervised approaches can use very large amounts of data and extract very large numbers of relations, but the resulting rela- tions may not be easy to map to relations needed for a particular knowledge base.
另一种方法，纯粹无监督的信息提取，在大量文本中的实体之间提取单词串，并且聚类和简化这些单词串以产生关系字符串。 无监督方法可以使用非常大量的数据并提取大量关系，但由此产生的关系可能不容易映射到特定知识库所需的关系。

• A third approach has been to use a very small number of seed instances or patterns to do boot- strap learning. These seeds are used with a large corpus to extract a new set of patterns, which are used to extract more instances, which are used to extract more patterns, in an it- erative fashion. The resulting patterns often suffer from low precision and semantic drift.
第三种方法是使用极少数的种子实例或模式来进行自我学习。 这些种子与大型语料库一起使用以提取一组新模式，这些模式用于提取更多实例，这些实例用于以实际方式提取更多模式。 由此产生的模式通常会受到低精度和语义漂移的影响。

Many early algorithms for relation extraction used little or no syntactic information. For example, the DIPRE algorithm by Brin (1998) used string-based regular expressions in order to recognize relations such as author-book, while the SNOWBALL algorithm by Agichtein and Gravano (2000) learned similar regular expression patterns over words and named entity tags. Hearst (1992) used a small number of regular expressions over words and part-of-speech tags to find examples of the hypernym relation. The use of these patterns has been widely replicated in successful systems, for example by Etzioni et al. (2005).

other such as Ravichandran and Hovy (2002) and Pantel and Pennacchiotti (2006) use the same formalism of learning regular expressions over words and part-of-speech tags to discover patterns indicating a variety of relations

More recent approaches have used deeper syntactic information derived from parses of the input sentences, including work exploiting syntactic dependencies by Lin and Pantel (2001) and Snow et al. (2005), and work in the ACE paradigm such as Zhou et al. (2005) and Zhou et al. (2007).
Perhaps most similar to our distant supervision algorithm is the effective method of Wu and Weld (2007) who extract relations from a Wikipedia page by using supervision from the page’s infobox.
Unlike their corpus-specific method, which is specific to a (single) Wikipedia page, our algorithm allows us to extract evidence for a relation from many different documents, and from any genre.

# 做法

For each pair of enti- ties that appears in some Freebase relation, we find all sentences containing those entities in a large unlabeled corpus and extract textual features to train a relation classifier

• Our algorithm uses Freebase (Bollacker et al., 2008), a large semantic database, to provide distant supervision for relation extraction. Free- base contains 116 million instances of 7,300 rela- tions between 9 million entities. The intuition of distant supervision is that any sentence that con- tains a pair of entities that participate in a known Freebase relation is likely to express that relation in some way. Since there may be many sentences containing a given entity pair, we can extract very large numbers of (potentially noisy) features that are combined in a logistic regression classifier.
我们的算法使用Freebase（Bollacker et al。，2008），一个大型语义数据库，为关系提取提供远程监控。 自由基在9百万个实体之间包含1.16亿个7,300个关系。 远程监督的直觉是任何包含参与已知Freebase关系的实体的句子都可能以某种方式表达这种关系。 由于可能存在许多包含给定实体对的句子，因此我们可以提取在逻辑回归分类器中组合的非常大量（可能有噪声的）特征

Because our algorithm is supervised by a database, rather than by labeled text, it does not suffer from the problems of overfitting and domain-dependence that plague supervised systems. Supervision by a database also means that, unlike in unsupervised approaches, the output of our classifier uses canonical names for relations

# 方法介绍

• 基本假设： 如果两个实体是某个关系的参与者，任意的一个包含这两个实体的句子都可能表达了这个关系。
• 训练阶段
使用 NET（named entity tagger）标注 persons organizations 和 locations；

• 测试阶段：
使用 NET（named entity tagger）标注 persons organizations 和 locations

## 特征选择

a) 两个实体中间的词序列；
b) 这些词的词性标记；
c) 标志位表示哪个实体出现在前面；
d) 大小为k的左窗口；
e) 大小为k的右窗口。
3.2. 句法特征：
a) 两个实体之间的最短依存路径；
b) 两个实体的左右窗口

# 结果

Our model is able to extract 10,000 instances of 102 relations at a precision of 67.6%.

Responses