• ## CRF

千次阅读 2019-03-19 15:20:30
随机场：由若干个子集组成的...CRF是马尔科夫随机场的特例，它假设马尔科夫随机场中只有X和Y两种变量，X一般是给定的，而Y一般是在给定X的条件下的输出。 BiLSTM-CRF在NER上的实践总结 使用基于字的BILSTM-CRF，可...
随机场：由若干个子集组成的一个整体，而每个子集都按照某个分布随机赋予一个值，这个场就叫随机场。
马尔科夫随机场：随机场中某$^{h_{1}}$一位置的赋值仅与其相邻位置的赋值有关，和与其不相邻位置的赋值无关。
CRF是马尔科夫随机场的特例，它假设马尔科夫随机场中只有X和Y两种变量，X一般是给定的，而Y一般是在给定X的条件下的输出。
BiLSTM-CRF在NER上的实践总结
使用基于字的BILSTM-CRF，可以用如Bakeoff-3评测采用的BIO表注集
常用的网络结构是双向LSTM后接一个softmax层，输出各个label的概率，然后再接一个CRF层。
句子中的每一个单元都代表着字或词嵌入构成的向量，在输入模型训练之前，设置Dropout缓解过拟合字或词向量作为BILSTM-CRF模型的输入，双向LSTM层用来自动提取句子特征。双向LSTM接受句子的每个字的embedding序列作为输入，将正向LSTM的隐层输出$^{h_{1}}$和反向LSTM输出的隐层输出$^{h_{2}}$进行拼接$^{h_{t}}$=[$^{h_{1}}$;$^{h_{2}}$]，得到完整的隐状态序列；在经过设置的Dropout后，再介入一个线性层，将隐状态映射为k维(k是标签集的标签数)。线性层的输出$R{^{n*k}}$为提取的句子特征，可以理解为将字$^{x_{i}}$分类到第j个标签的打分值。如果在线性层后接一层softmax层，就变成了对各个字进行k分类了。但是这样没有考虑连续字标注的合理性(无法利用前面标注过的信息)，所以需要CRF层来进行标注。CRF层用来进行句子级的序列标注。CRF层的参数是一个(k+2) *(k+2)的矩阵A，Aij表示的是从第i个标签到第j个标签的转移得分，所以在一个位置进行标注的时候就可以利用前面标注过的标签。加2 是因为要为句子首部添加一个起始状态以及为句子尾部添加一个终止状态。如果一个长度等于句子长度的标签序列y=(y1,y2,y2,.......yn),模型对于句子x的标签打分等于y的打分公式:$\fn_jvn score\left ( x, \right y)= \m\sum_{i=1}^{n} P_{i,yi}\dotplus \sum_{i=1}^{n+1}A_{yi-1,yi}$可以看出整个序列的打分等于各位置的打分之和，而且每个位置的打分由两部分得到：LSTM输出的$\fn_jvn ^{P_{i}}$以及CRF的转移矩阵A决定。
softmax层的输出是相互独立的，虽然BiLSTM学习到了上下文的信息，但是输出信息互不影响，没有CRF的话只是在每一步挑选一个最大概率的label输出。这样会有问题：虽然每次的输出都是最大概率的label，但我们并不能保证每次预测都是正确的。而CRF中有转移特征，它会考虑输出label的顺序性。或者更准确的说，CRF层可以为最后预测的标签添加一些约束来保证预测的标签是合法的。在训练的过程中，这些约束可以通过CRF层自动学习到。
公式化表示条件随机场
在给定的观测序列X时，某个待定标记序列Y的概率可以定义为：

$^{t_{j}}$($^{y_{i-1}}$,$^{y_{i}}$,x,i)是转移概率，$S_{k}$($^{y_{i}}$,x,i)是状态概率，表示观测序列在位置i的标记概率

算法理论参考
CRF中的约束就是CRF的特征函数。特征函数接收四个参数：
句子(标注词性的句子)句子s中的第i个单词序列标注给第i个单词标注的词性序列标注给第i-1个单词标注的词性
特征函数的输出是0或1，0表示评分不符合这个特征，1表示要评分的标注序列符合这个特征
为了构建一个条件随机场，首先要定义一个特征函数集。每个特征函数都以整个句子s，当前位置i，位置i-1的标签作为输入，然后为每一个特征函数赋予一个权重，然后针对每一个标注序列，对这些特征函数加权求和。


展开全文
• ## crf

千次阅读 2012-04-05 20:26:49
看到不错的介绍crf的资料 转自：http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/ 原来crf是通过下降法调节参数的 又了解到无约束的最优化问题比较容易求解的，只要是可以求导的，...
看到不错的介绍crf的资料
转自：http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/
原来crf是通过下降法调节参数的
又了解到无约束的最优化问题比较容易求解的，只要是可以求导的，不管是二次、三次.....可以用最速下降法，牛顿法.....

Introduction

Imagine you have a sequence of snapshots from a day in Justin Bieber’s life, and you want to label each image with the activity it represents (eating, sleeping, driving, etc.). How can you do this?
One way is to ignore the sequential nature of the snapshots, and build a per-image classifier. For example, given a month’s worth of labeled snapshots, you might learn that dark images taken at 6am tend to be about sleeping, images with lots of bright colors tend to be about dancing, images of cars are about driving, and so on.
By ignoring this sequential aspect, however, you lose a lot of information. For example, what happens if you see a close-up picture of a mouth – is it about singing or eating? If you know that the previous image is a picture of Justin Bieber eating or cooking, then it’s more likely this picture is about eating; if, however, the previous image contains Justin Bieber singing or dancing, then this one probably shows him singing as well.
Thus, to increase the accuracy of our labeler, we should incorporate the labels of nearby photos, and this is precisely what a conditional random field does.
Part-of-Speech Tagging
Let’s go into some more detail, using the more common example of part-of-speech tagging.
In POS tagging, the goal is to label a sentence (a sequence of words or tokens) with tags like ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB, ARTICLE.
For example, given the sentence “Bob drank coffee at Starbucks”, the labeling might be “Bob (NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION) Starbucks (NOUN)”.
So let’s build a conditional random field to label sentences with their parts of speech. Just like any classifier, we’ll first need to decide on a set of feature functions

fi
.
Feature Functions in a CRF
In a CRF, each feature function is a function that takes in as input:
a sentence s the position i of a word in the sentence the label

li
of the current word the label

li−1
of the previous word
and outputs a real-valued number (though the numbers are often just either 0 or 1).
(Note: by restricting our features to depend on only the current and previous labels, rather than arbitrary labels throughout the sentence, I’m actually building the special case of a linear-chain CRF. For simplicity, I’m going to ignore general CRFs in this post.)
For example, one possible feature function could measure how much we suspect that the current word should be labeled as an adjective given that the previous word is “very”.
Features to Probabilities
Next, assign each feature function

fj
a weight

λj
(I’ll talk below about how to learn these weights from the data). Given a sentence s, we can now score a labeling l of s by adding up the weighted features over all words in the sentence:

score(l|s)=∑mj=1∑ni=1λjfj(s,i,li,li−1)

(The first sum runs over each feature function

j
, and the inner sum runs over each position

i
of the sentence.)
Finally, we can transform these scores into probabilities

p(l|s)
between 0 and 1 by exponentiating and normalizing:

p(l|s)=exp[score(l|s)]∑l′exp[score(l′|s)]=exp[∑mj=1∑ni=1λjfj(s,i,li,li−1)]∑l′exp[∑mj=1∑ni=1λjfj(s,i,l′i,l′i−1)]

Example Feature Functions
So what do these feature functions look like? Examples of POS tagging features could include:

f1(s,i,li,li−1)=1
if

li=
ADVERB and the ith word ends in “-ly”; 0 otherwise.
If the weight

λ1
associated with this feature is large and positive, then this feature is essentially saying that we prefer labelings where words ending in -ly get labeled as ADVERB.

f2(s,i,li,li−1)=1
if

i=1
,

li=
VERB, and the sentence ends in a question mark; 0 otherwise.
Again, if the weight

λ2
associated with this feature is large and positive, then labelings that assign VERB to the first word in a question (e.g., “Is this a sentence beginning with a verb?”) are preferred.

f3(s,i,li,li−1)=1
if

li−1=

li=
NOUN; 0 otherwise.
Again, a positive weight for this feature means that adjectives tend to be followed by nouns.

f4(s,i,li,li−1)=1
if

li−1=
PREPOSITION and

li=
PREPOSITION.
A negative weight

λ4
for this function would mean that prepositions don’t tend to follow prepositions, so we should avoid labelings where this happens.
And that’s it! To sum up: to build a conditional random field, you just define a bunch of feature functions (which can depend on the entire sentence, a current position, and nearby labels), assign them weights, and add them all together, transforming at the end to a probability if necessary.
Now let’s step back and compare CRFs to some other common machine learning techniques.
Smells like Logistic Regression…
The form of the CRF probabilities

p(l|s)=exp[∑mj=1∑ni=1fj(s,i,li,li−1)]∑l′exp[∑mj=1∑ni=1fj(s,i,l′i,l′i−1)]
might look familiar.
That’s because CRFs are indeed basically the sequential version of logistic regression: whereas logistic regression is a log-linear model for classification, CRFs are a log-linear model for sequential labels.
Looks like HMMs…
Recall that Hidden Markov Models are another model for part-of-speech tagging (and sequential labeling in general). Whereas CRFs throw any bunch of functions together to get a label score, HMMs take a generative approach to labeling, defining

p(l,s)=p(l1)∏ip(li|li−1)p(wi|li)

where

p(li|li−1)
are transition probabilities (e.g., the probability that a preposition is followed by a noun);

p(wi|li−1)
are emission probabilities (e.g., the probability that a noun emits the word “dad”).
So how do HMMs compare to CRFs? CRFs are more powerful – they can model everything HMMs can and more. One way of seeing this is as follows.
Note that the log of the HMM probability is

logp(l,s)=logp(l0)+∑ilogp(li|li−1)+∑ilogp(wi|li)
. This has exactly the log-linear form of a CRF if we consider these log-probabilities to be the weights associated to binary transition and emission indicator features.
That is, we can build a CRF equivalent to any HMM by…
For each HMM transition probability

p(li=y|li−1=x)
, define a set of CRF transition features of the form

fx,y(s,i,li,li−1)=1
if

li=y
and

li−1=x
. Give each feature a weight of

wx,y=logp(li=y|li−1=x)
. Similarly, for each HMM emission probability

p(wi=z|li=x)
, define a set of CRF emission features of the form

gx,y(s,i,li,li−1)=1
if

wi=z
and

li=x
. Give each feature a weight of

wx,z=logp(wi=z|li=x)
.
Thus, the score

p(l|s)
computed by a CRF using these feature functions is precisely proportional to the score computed by the associated HMM, and so every HMM is equivalent to some CRF.
However, CRFs can model a much richer set of label distributions as well, for two main reasons:
CRFs can define a much larger set of features. Whereas HMMs are necessarily local in nature (because they’re constrained to binary transition and emission feature functions, which force each word to depend only on the current label and each label to depend only on the previous label), CRFs can use more global features. For example, one of the features in our POS tagger above increased the probability of labelings that tagged the first word of a sentence as a VERB if the end of the sentence contained a question mark. CRFs can have arbitrary weights. Whereas the probabilities of an HMM must satisfy certain constraints (e.g.,

0<=p(wi|li)<=1,∑wp(wi=w|l1)=1)
, the weights of a CRF are unrestricted (e.g.,

logp(wi|li)
can be anything it wants).
Learning Weights
Let’s go back to the question of how to learn the feature weights in a CRF. One way is (surprise) to use gradient ascent.
Assume we have a bunch of training examples (sentences and associated part-of-speech labels). Randomly initialize the weights of our CRF model. To shift these randomly initialized weights to the correct ones, for each training example…
Go through each feature function

fi
, and calculate the gradient of the log probability of the training example with respect to

λi
:

∂∂wjlogp(l|s)=∑mj=1fi(s,j,lj,lj−1)−∑l′p(l′|s)∑mj=1fi(s,j,l′j,l′j−1)
Note that the first term in the gradient is the contribution of feature

fi
under the true label, and the second term in the gradient is the expected contribution of feature

fi
under the current model. This is exactly the form you’d expect gradient ascent to take. Move

λi
in the direction of the gradient:

λi=λi+α[∑mj=1fi(s,j,