精华内容
下载资源
问答
  • 简单的GMM和HMM模型Python实现,用于隔离数字识别。 此实现包含3个模型: 单一高斯:使用具有对角协方差的单一高斯对每个数字进行建模。 高斯混合模型(GMM):每个数字都是使用高斯混合模型来建模的,并通过...
  • HMM 隐马尔可夫模型 python 代码, 源码实现部分,包括训练测试以及相关调用 。 调用主要为自然语言处理方面实体标注样例。
  • 一文读懂NLP之HMM模型代码python实现与演示1. 前言2 概率计算问题2.1 前向算法2.2 后向算法3 模型训练问题3.1 监督学习–最大似然估计2.2 Baum·welch算法4 序列预测问题4.1 维特比算法 1. 前言 在上一篇《一文读懂...
  • 隐马尔科夫模型HMM)及其Python实现 目录 1.基础介绍 形式定义 隐马尔科夫模型的两个基本假设 一个关于感冒的实例 2.HMM的三个问题 2.1概率计算问题 2.2学习问题 2.3预测问题 3.完整代码 1....

    原文链接https://applenob.github.io/hmm.html

    隐马尔科夫模型(HMM)及其Python实现

    目录

    1.基础介绍

    首先看下模型结构,对模型有一个直观的概念:

    描述下这个图:

    分成两排,第一排是yy序列,第二排是xx序列。每个xx都只有一个yy指向它,每个yy也都有另一个yy指向它。

    OK,直觉上的东西说完了,下面给出定义(参考《统计学习方法》):

    • 状态序列(上图中的yy,下面的II): 隐藏的马尔科夫链随机生成的状态序列,称为状态序列(state sequence)
    • 观测序列(上图中的xx,下面的OO): 每个状态生成一个观测,而由此产生的观测的随机序列,称为观测序列(obeservation sequence)
    • 马尔科夫模型: 马尔科夫模型是关于时序的概率模型,描述由一个隐藏的马尔科夫链随机生成不可观测的状态随机序列,再由各个状态生成一个观测而产生观测随机序列的过程。

    形式定义

    设QQ是所有可能的状态的集合,VV是所有可能的观测的集合。

    Q=q1,q2,...,qN,V=v1,v2,...,vMQ=q1,q2,...,qN,V=v1,v2,...,vM

    其中,NN是可能的状态数,MM是可能的观测数。

    II是长度为TT的状态序列,OO是对应的观测序列。

    I=(i1,i2,...,iT),O=(o1,o2,...,oT)I=(i1,i2,...,iT),O=(o1,o2,...,oT)

    A是状态转移矩阵:A=[aij]N×NA=[aij]N×N

    i=1,2,...,N;j=1,2,...,Ni=1,2,...,N;j=1,2,...,N

    其中,在时刻tt,处于qiqi 状态的条件下在时刻t+1t+1转移到状态qjqj 的概率:

    aij=P(it+1=qj|it=qi)aij=P(it+1=qj|it=qi)

    B是观测概率矩阵:B=[bj(k)]N×MB=[bj(k)]N×M

    k=1,2,...,M;j=1,2,...,Nk=1,2,...,M;j=1,2,...,N

    其中,在时刻tt处于状态qjqj 的条件下生成观测vkvk 的概率:

    bj(k)=P(ot=vk|it=qj)bj(k)=P(ot=vk|it=qj)

    π是初始状态概率向量:π=(πi)π=(πi)

    其中,πi=P(i1=qi)πi=P(i1=qi)

    隐马尔科夫模型由初始状态概率向量ππ、状态转移概率矩阵A和观测概率矩阵BB决定。ππ和AA决定状态序列,BB决定观测序列。因此,隐马尔科夫模型λλ可以由三元符号表示,即:λ=(A,B,π)λ=(A,B,π)。A,B,πA,B,π称为隐马尔科夫模型的三要素

    隐马尔科夫模型的两个基本假设

    (1):设隐马尔科夫链在任意时刻tt的状态只依赖于其前一时刻的状态,与其他时刻的状态及观测无关,也与时刻tt无关。(齐次马尔科夫性假设

    (2):假设任意时刻的观测只依赖于该时刻的马尔科夫链的状态,与其他观测和状态无关。(观测独立性假设

    一个关于感冒的实例

    定义讲完了,举个实例,参考hankcs和知乎上的感冒预测的例子(实际上都是来自wikipidia:https://en.wikipedia.org/wiki/Viterbi_algorithm#Example ),这里我用最简单的语言去描述。

    假设你是一个医生,眼前有个病人,你的任务是确定他是否得了感冒。

    • 首先,病人的状态(QQ)只有两种:{感冒,没有感冒}。
    • 然后,病人的感觉(观测VV)有三种:{正常,冷,头晕}。
    • 手头有病人的病例,你可以从病例的第一天确定ππ(初始状态概率向量);
    • 然后根据其他病例信息,确定AA(状态转移矩阵)也就是病人某天是否感冒和他第二天是否感冒的关系;
    • 还可以确定BB(观测概率矩阵)也就是病人某天是什么感觉和他那天是否感冒的关系。

    In [1]:

    import numpy as np
    

    In [2]:

    # 对应状态集合Q
    states = ('Healthy', 'Fever')
    # 对应观测集合V
    observations = ('normal', 'cold', 'dizzy')
    # 初始状态概率向量π
    start_probability = {'Healthy': 0.6, 'Fever': 0.4}
    # 状态转移矩阵A
    transition_probability = {
        'Healthy': {'Healthy': 0.7, 'Fever': 0.3},
        'Fever': {'Healthy': 0.4, 'Fever': 0.6},
    }
    # 观测概率矩阵B
    emission_probability = {
        'Healthy': {'normal': 0.5, 'cold': 0.4, 'dizzy': 0.1},
        'Fever': {'normal': 0.1, 'cold': 0.3, 'dizzy': 0.6},
    }
    

    In [3]:

    # 随机生成观测序列和状态序列    
    def simulate(T):
    
        def draw_from(probs):
            """
            1.np.random.multinomial:
            按照多项式分布,生成数据
            >>> np.random.multinomial(20, [1/6.]*6, size=2)
                    array([[3, 4, 3, 3, 4, 3],
                           [2, 4, 3, 4, 0, 7]])
             For the first run, we threw 3 times 1, 4 times 2, etc.  
             For the second, we threw 2 times 1, 4 times 2, etc.
            2.np.where:
            >>> x = np.arange(9.).reshape(3, 3)
            >>> np.where( x > 5 )
            (array([2, 2, 2]), array([0, 1, 2]))
            """
            return np.where(np.random.multinomial(1,probs) == 1)[0][0]
    
        observations = np.zeros(T, dtype=int)
        states = np.zeros(T, dtype=int)
        states[0] = draw_from(pi)
        observations[0] = draw_from(B[states[0],:])
        for t in range(1, T):
            states[t] = draw_from(A[states[t-1],:])
            observations[t] = draw_from(B[states[t],:])
        return observations, states
    

    In [4]:

    def generate_index_map(lables):
        id2label = {}
        label2id = {}
        i = 0
        for l in lables:
            id2label[i] = l
            label2id[l] = i
            i += 1
        return id2label, label2id
     
    states_id2label, states_label2id = generate_index_map(states)
    observations_id2label, observations_label2id = generate_index_map(observations)
    print(states_id2label, states_label2id)
    print(observations_id2label, observations_label2id)
    
    {0: 'Healthy', 1: 'Fever'} {'Healthy': 0, 'Fever': 1}
    {0: 'normal', 1: 'cold', 2: 'dizzy'} {'normal': 0, 'cold': 1, 'dizzy': 2}
    

    In [5]:

    def convert_map_to_vector(map_, label2id):
        """将概率向量从dict转换成一维array"""
        v = np.zeros(len(map_), dtype=float)
        for e in map_:
            v[label2id[e]] = map_[e]
        return v
    
     
    def convert_map_to_matrix(map_, label2id1, label2id2):
        """将概率转移矩阵从dict转换成矩阵"""
        m = np.zeros((len(label2id1), len(label2id2)), dtype=float)
        for line in map_:
            for col in map_[line]:
                m[label2id1[line]][label2id2[col]] = map_[line][col]
        return m
    

    In [6]:

    A = convert_map_to_matrix(transition_probability, states_label2id, states_label2id)
    print(A)
    B = convert_map_to_matrix(emission_probability, states_label2id, observations_label2id)
    print(B)
    observations_index = [observations_label2id[o] for o in observations]
    pi = convert_map_to_vector(start_probability, states_label2id)
    print(pi)
    
    [[ 0.7  0.3]
     [ 0.4  0.6]]
    [[ 0.5  0.4  0.1]
     [ 0.1  0.3  0.6]]
    [ 0.6  0.4]
    

    In [7]:

    # 生成模拟数据
    observations_data, states_data = simulate(10)
    print(observations_data)
    print(states_data)
    # 相应的label
    print("病人的状态: ", [states_id2label[index] for index in states_data])
    print("病人的观测: ", [observations_id2label[index] for index in observations_data])
    
    [0 0 1 1 2 1 2 2 2 0]
    [0 0 0 0 1 1 1 1 1 0]
    病人的状态:  ['Healthy', 'Healthy', 'Healthy', 'Healthy', 'Fever', 'Fever', 'Fever', 'Fever', 'Fever', 'Healthy']
    病人的观测:  ['normal', 'normal', 'cold', 'cold', 'dizzy', 'cold', 'dizzy', 'dizzy', 'dizzy', 'normal']
    

    2.HMM的三个问题

    HMM在实际应用中,一般会遇上三种问题:

    • 1.概率计算问题:给定模型λ=(A,B,π)λ=(A,B,π) 和观测序列O=o1,o2,...,oTO=o1,o2,...,oT,计算在模型λλ下观测序列OO出现的概率P(O|λ)P(O|λ)。
    • 2.学习问题:已知观测序列O=o1,o2,...,oTO=o1,o2,...,oT,估计模型λ=(A,B,π)λ=(A,B,π),使P(O|λ)P(O|λ)最大。即用极大似然法的方法估计参数。
    • 3.预测问题(也称为解码(decoding)问题):已知观测序列O=o1,o2,...,oTO=o1,o2,...,oT 和模型λ=(A,B,π)λ=(A,B,π),求给定观测序列条件概率P(I|O)P(I|O)最大的状态序列I=(i1,i2,...,iT)I=(i1,i2,...,iT),即给定观测序列,求最有可能的对应的状态序列

    回到刚才的例子,这三个问题就是:

    • 1.概率计算问题:如果给定模型参数,病人某一系列观测的症状出现的概率。
    • 2.学习问题:根据病人某一些列观测的症状,学习模型参数。
    • 3.预测问题:根据学到的模型,预测病人这几天是不是有感冒。

    2.1 概率计算问题

    概率计算问题计算的是:在模型λλ下观测序列OO出现的概率P(O|λ)P(O|λ)。

    直接计算

    对于状态序列I=(i1,i2,...,iT)I=(i1,i2,...,iT)的概率是:P(I|λ)=πi1ai1i2ai2i3...aiT−1iTP(I|λ)=πi1ai1i2ai2i3...aiT−1iT。

    对上面这种状态序列,产生观测序列O=(o1,o2,...,oT)O=(o1,o2,...,oT)的概率是P(O|I,λ)=bi1(o1)bi2(o2)...biT(oT)P(O|I,λ)=bi1(o1)bi2(o2)...biT(oT)。

    II和OO的联合概率为P(O,I|λ)=P(O|I,λ)P(I|λ)=πi1bi1(o1)ai1i2bi2(o2)...aiT−1iTbiT(oT)P(O,I|λ)=P(O|I,λ)P(I|λ)=πi1bi1(o1)ai1i2bi2(o2)...aiT−1iTbiT(oT)。

    对所有可能的II求和,得到P(O|λ)=∑IP(O,I|λ)=∑i1,...,iTπi1bi1(o1)ai1i2bi2(o2)...aiT−1iTbiT(oT)P(O|λ)=∑IP(O,I|λ)=∑i1,...,iTπi1bi1(o1)ai1i2bi2(o2)...aiT−1iTbiT(oT)。

    如果直接计算,时间复杂度太高,是O(TNT)O(TNT)。

    前向算法(或者后向算法)

    首先引入前向概率

    给定模型λλ,定义到时刻tt部分观测序列为o1,o2,...,oto1,o2,...,ot 且状态为qiqi 的概率为前向概率。记作:

    αt(i)=P(o1,o2,...,ot,it=qi|λ)αt(i)=P(o1,o2,...,ot,it=qi|λ)

    用感冒例子描述就是:某一天是否感冒以及这天和这天之前所有的观测症状的联合概率。

    后向概率定义类似。

    助记图片

    前向算法

    输入:隐马模型λλ,观测序列OO; 输出:观测序列概率P(O|λ)P(O|λ).

      1. 初值(t=1)(t=1),α1(i)=P(o1,i1=q1|λ)=πibi(o1)α1(i)=P(o1,i1=q1|λ)=πibi(o1),i=1,2,...,Ni=1,2,...,N
      1. 递推:对t=1,2,...,Nt=1,2,...,N,αt+1(i)=[∑Nj=1αt(j)aji]bi(ot+1)αt+1(i)=[∑j=1Nαt(j)aji]bi(ot+1)
      1. 终结:P(O|λ)=∑Ni=1αT(i)P(O|λ)=∑i=1NαT(i)

    前向算法理解:

    前向算法使用前向概率的概念,记录每个时间下的前向概率,使得在递推计算下一个前向概率时,只需要上一个时间点的所有前向概率即可。原理上也是用空间换时间。这样的时间复杂度是O(N2T)O(N2T)

    前向算法/后向算法python实现:

    In [8]:

    def forward(obs_seq):
        """前向算法"""
        N = A.shape[0]
        T = len(obs_seq)
        
        # F保存前向概率矩阵
        F = np.zeros((N,T))
        F[:,0] = pi * B[:, obs_seq[0]]
    
        for t in range(1, T):
            for n in range(N):
                F[n,t] = np.dot(F[:,t-1], (A[:,n])) * B[n, obs_seq[t]]
    
        return F
    
    def backward(obs_seq):
        """后向算法"""
        N = A.shape[0]
        T = len(obs_seq)
        # X保存后向概率矩阵
        X = np.zeros((N,T))
        X[:,-1:] = 1
    
        for t in reversed(range(T-1)):
            for n in range(N):
                X[n,t] = np.sum(X[:,t+1] * A[n,:] * B[:, obs_seq[t+1]])
    
        return X
    

    2.2学习问题

    学习问题我们这里只关注非监督的学习算法,有监督的学习算法在有标注数据的前提下,使用极大似然估计法可以很方便地估计模型参数。

    非监督的情况,也就是我们只有一堆观测数据,对应到感冒预测的例子,即,我们只知道病人之前的几天是什么感受,但是不知道他之前是否被确认为感冒。

    在这种情况下,我们可以使用EM算法,将状态变量视作隐变量。使用EM算法学习HMM参数的算法称为Baum-Weich算法

    模型表达式:

     

    P(O|λ)=∑IP(O|I,λ)P(I|λ)P(O|λ)=∑IP(O|I,λ)P(I|λ)

    Baum-Weich算法

    (1). 确定完全数据的对数似然函数

    完全数据是(O,I)=(o1,o2,...,oT,i1,...,iT)(O,I)=(o1,o2,...,oT,i1,...,iT)

    完全数据的对数似然函数是:logP(O,I|λ)logP(O,I|λ)。

    (2). EM算法的E步:

     

    Q(λ,λ^)=∑IlogP(O,I|λ)P(O,I|λ^)Q(λ,λ^)=∑IlogP(O,I|λ)P(O,I|λ^)

    注意,这里忽略了对于λλ而言是常数因子的1P(O|λ^)1P(O|λ^)

    其中,λ^λ^ 是隐马尔科夫模型参数的当前估计值,λ是要极大化的因马尔科夫模型参数。

    又有:

    P(O,I|λ)=πi1bi1(o1)ai1,i2bi2(o2)...aiT−1,iTbiT(oT)P(O,I|λ)=πi1bi1(o1)ai1,i2bi2(o2)...aiT−1,iTbiT(oT)

    于是Q(λ,λ^)Q(λ,λ^)可以写成:

    Q(λ,λ^)=∑Ilogπi1P(O,I|λ^)+∑I(∑t=1T−1logait−1,it)P(O,I|λ^)+∑I(∑t=1T−1logbit(ot))P(O,I|λ^)Q(λ,λ^)=∑Ilogπi1P(O,I|λ^)+∑I(∑t=1T−1logait−1,it)P(O,I|λ^)+∑I(∑t=1T−1logbit(ot))P(O,I|λ^)

    (3). EM算法的M步:

    极大化Q函数Q(λ,λ^)Q(λ,λ^) 求模型参数A,B,πA,B,π。

    应用拉格朗日乘子法对各参数求偏导,解得Baum-Weich模型参数估计公式

    • aij=∑T−1t=1ξt(i,j)∑T−1t=1γt(i)aij=∑t=1T−1ξt(i,j)∑t=1T−1γt(i)
    • bj(k)=∑Tt=1,ot=vkγt(j)∑Tt=1γt(j)bj(k)=∑t=1,ot=vkTγt(j)∑t=1Tγt(j)
    • πi=γ1(i)πi=γ1(i)

    其中γt(i)γt(i)和ξt(i,j)ξt(i,j)是:

    γt(i)=P(it=qi|O,λ)=P(it=qi,O|λ)P(O|λ)=αt(i)βt(i)∑Nj=1αt(j)βt(j)γt(i)=P(it=qi|O,λ)=P(it=qi,O|λ)P(O|λ)=αt(i)βt(i)∑j=1Nαt(j)βt(j)

    读作gamma,即,给定模型参数和所有观测,时刻tt处于状态qiqi的概率

    ξt(i,j)=P(it=qi,ii+1=qj|O,λ)=P(it=qi,ii+1=qj,O|λ)P(O|λ)=P(it=qi,ii+1=qj,O|λ)∑Ni=1∑Nj=1P(it=qi,ii+1=qj,O|λ)ξt(i,j)=P(it=qi,ii+1=qj|O,λ)=P(it=qi,ii+1=qj,O|λ)P(O|λ)=P(it=qi,ii+1=qj,O|λ)∑i=1N∑j=1NP(it=qi,ii+1=qj,O|λ)

    读作xi,即,给定模型参数和所有观测,时刻tt处于状态qiqi且时刻t+1t+1处于状态qjqj的概率

    带入P(it=qi,ii+1=qj,O|λ)=αt(i)aijbj(ot+1)βt+1(j)P(it=qi,ii+1=qj,O|λ)=αt(i)aijbj(ot+1)βt+1(j)

    得到:ξt(i,j)=αt(i)aijbj(ot+1)βt+1(j)∑Ni=1∑Nj=1αt(i)aijbj(ot+1)βt+1(j)ξt(i,j)=αt(i)aijbj(ot+1)βt+1(j)∑i=1N∑j=1Nαt(i)aijbj(ot+1)βt+1(j)

    Baum-Weich算法的python实现:

    In [9]:

    def baum_welch_train(observations, A, B, pi, criterion=0.05):
        """无监督学习算法——Baum-Weich算法"""
        n_states = A.shape[0]
        n_samples = len(observations)
    
        done = False
        while not done:
            # alpha_t(i) = P(O_1 O_2 ... O_t, q_t = S_i | hmm)
            # Initialize alpha
            alpha = forward(observations)
    
            # beta_t(i) = P(O_t+1 O_t+2 ... O_T | q_t = S_i , hmm)
            # Initialize beta
            beta = backward(observations)
            # ξ_t(i,j)=P(i_t=q_i,i_{i+1}=q_j|O,λ)
            xi = np.zeros((n_states,n_states,n_samples-1))
            for t in range(n_samples-1):
                denom = np.dot(np.dot(alpha[:,t].T, A) * B[:,observations[t+1]].T, beta[:,t+1])
                for i in range(n_states):
                    numer = alpha[i,t] * A[i,:] * B[:,observations[t+1]].T * beta[:,t+1].T
                    xi[i,:,t] = numer / denom
    
            # γ_t(i):gamma_t(i) = P(q_t = S_i | O, hmm)
            gamma = np.sum(xi,axis=1)
            # Need final gamma element for new B
            # xi的第三维长度n_samples-1,少一个,所以gamma要计算最后一个
            prod =  (alpha[:,n_samples-1] * beta[:,n_samples-1]).reshape((-1,1))
            gamma = np.hstack((gamma,  prod / np.sum(prod))) #append one more to gamma!!!
            
            # 更新模型参数
            newpi = gamma[:,0]
            newA = np.sum(xi,2) / np.sum(gamma[:,:-1],axis=1).reshape((-1,1))
            newB = np.copy(B)
            num_levels = B.shape[1]
            sumgamma = np.sum(gamma,axis=1)
            for lev in range(num_levels):
                mask = observations == lev
                newB[:,lev] = np.sum(gamma[:,mask],axis=1) / sumgamma
            
            # 检查是否满足阈值
            if np.max(abs(pi - newpi)) < criterion and \
                            np.max(abs(A - newA)) < criterion and \
                            np.max(abs(B - newB)) < criterion:
                done = 1
            A[:], B[:], pi[:] = newA, newB, newpi
        return newA, newB, newpi
    

    回到预测感冒的问题,下面我们先自己建立一个HMM模型,再模拟出一个观测序列和一个状态序列。

    然后,只用观测序列去学习模型,获得模型参数。

    In [10]:

    A = np.array([[0.5, 0.5],[0.5, 0.5]])
    B = np.array([[0.3, 0.3, 0.3],[0.3, 0.3, 0.3]])
    pi = np.array([0.5, 0.5])
    
    observations_data, states_data = simulate(100)
    newA, newB, newpi = baum_welch_train(observations_data, A, B, pi)
    print("newA: ", newA)
    print("newB: ", newB)
    print("newpi: ", newpi)
    
    newA:  [[ 0.5  0.5]
     [ 0.5  0.5]]
    newB:  [[ 0.28  0.32  0.4 ]
     [ 0.28  0.32  0.4 ]]
    newpi:  [ 0.5  0.5]
    

    2.3预测问题

    考虑到预测问题是求给定观测序列条件概率P(I|O)P(I|O)最大的状态序列I=(i1,i2,...,iT)I=(i1,i2,...,iT),类比这个问题和最短路问题:

    我们可以把求P(I|O)P(I|O)的最大值类比成求节点间距离的最小值,于是考虑类似于动态规划的viterbi算法

    首先导入两个变量δδ和ψψ

    定义在时刻tt状态为ii的所有单个路径(i1,i2,i3,...,it)(i1,i2,i3,...,it)中概率最大值为(这里考虑P(I,O)P(I,O)便于计算,因为给定的P(O)P(O),P(I|O)P(I|O)正比于P(I,O)P(I,O)):

     

    δt(i)=maxi1,i2,...,it−1P(it=i,it−1,...,i1,ot,ot−1,...,o1|λ)δt(i)=maxi1,i2,...,it−1P(it=i,it−1,...,i1,ot,ot−1,...,o1|λ)

    读作delta,其中,i=1,2,...,Ni=1,2,...,N

    得到其递推公式:

     

    δt(i)=max1≤j≤N[δt−1(j)aji]bi(o1)δt(i)=max1≤j≤N[δt−1(j)aji]bi(o1)

    定义在时刻tt状态为ii的所有单个路径(i1,i2,i3,...,it−1,i)(i1,i2,i3,...,it−1,i)中概率最大的路径的第t−1t−1个结点

     

    ψt(i)=argmax1≤j≤N[δt−1(j)aji]ψt(i)=argmax1≤j≤N[δt−1(j)aji]

    读作psi,其中,i=1,2,...,Ni=1,2,...,N

    下面介绍维特比算法。

    维特比(viterbi)算法(动态规划):

    输入:模型λ=(A,B,π)λ=(A,B,π)和观测O=(o1,o2,...,oT)O=(o1,o2,...,oT)

    输出:最优路径I∗=(i∗1,i∗2,...,i∗T)I∗=(i1∗,i2∗,...,iT∗)

    (1).初始化:

    δ1(i)=πibi(o1)δ1(i)=πibi(o1)

    ψ1(i)=0ψ1(i)=0

    (2).递推。对t=2,3,...,Tt=2,3,...,T

    δt(i)=max1≤j≤N[δt−1(j)aji]bi(ot)δt(i)=max1≤j≤N[δt−1(j)aji]bi(ot)

    ψt(i)=argmax1≤j≤N[δt−1(j)aji]ψt(i)=argmax1≤j≤N[δt−1(j)aji]

    (3).终止:

    P∗=max1≤i≤NδT(i)P∗=max1≤i≤NδT(i)

    i∗T=argmax1≤i≤NδT(i)iT∗=argmax1≤i≤NδT(i)

    (4).最优路径回溯,对t=T−1,T−2,...,1t=T−1,T−2,...,1

     

    i∗t=ψt+1(i∗t+1)it∗=ψt+1(it+1∗)

    求得最优路径I∗=(i∗1,i∗2,...,i∗T)I∗=(i1∗,i2∗,...,iT∗)

    注:上面的bi(ot)bi(ot)和ψt+1(i∗t+1)ψt+1(it+1∗)的括号,并不是函数,而是类似于数组取下标的操作。

    viterbi算法python实现(V对应δ,prev对应ψ):

    In [11]:

    def viterbi(obs_seq, A, B, pi):
        """
        Returns
        -------
        V : numpy.ndarray
            V [s][t] = Maximum probability of an observation sequence ending
                       at time 't' with final state 's'
        prev : numpy.ndarray
            Contains a pointer to the previous state at t-1 that maximizes
            V[state][t]
            
        V对应δ,prev对应ψ
        """
        N = A.shape[0]
        T = len(obs_seq)
        prev = np.zeros((T - 1, N), dtype=int)
    
        # DP matrix containing max likelihood of state at a given time
        V = np.zeros((N, T))
        V[:,0] = pi * B[:,obs_seq[0]]
    
        for t in range(1, T):
            for n in range(N):
                seq_probs = V[:,t-1] * A[:,n] * B[n, obs_seq[t]]
                prev[t-1,n] = np.argmax(seq_probs)
                V[n,t] = np.max(seq_probs)
    
        return V, prev
    
    def build_viterbi_path(prev, last_state):
        """Returns a state path ending in last_state in reverse order.
        最优路径回溯
        """
        T = len(prev)
        yield(last_state)
        for i in range(T-1, -1, -1):
            yield(prev[i, last_state])
            last_state = prev[i, last_state]
            
    def observation_prob(obs_seq):
        """ P( entire observation sequence | A, B, pi ) """
        return np.sum(forward(obs_seq)[:,-1])
    
    def state_path(obs_seq, A, B, pi):
        """
        Returns
        -------
        V[last_state, -1] : float
            Probability of the optimal state path
        path : list(int)
            Optimal state path for the observation sequence
        """
        V, prev = viterbi(obs_seq, A, B, pi)
        # Build state path with greatest probability
        last_state = np.argmax(V[:,-1])
        path = list(build_viterbi_path(prev, last_state))
    
        return V[last_state,-1], reversed(path)
    

    继续感冒预测的例子,根据刚才学得的模型参数,再去预测状态序列,观测准确率。

    In [12]:

    states_out = state_path(observations_data, newA, newB, newpi)[1]
    p = 0.0
    for s in states_data:
        if next(states_out) == s: 
            p += 1
    
    print(p / len(states_data))
    
    0.54
    

    因为是随机生成的样本,因此准确率较低也可以理解。

    使用Viterbi算法计算病人的病情以及相应的概率:

    In [13]:

    A = convert_map_to_matrix(transition_probability, states_label2id, states_label2id)
    B = convert_map_to_matrix(emission_probability, states_label2id, observations_label2id)
    observations_index = [observations_label2id[o] for o in observations]
    pi = convert_map_to_vector(start_probability, states_label2id)
    V, p = viterbi(observations_index, newA, newB, newpi)
    print(" " * 7, " ".join(("%10s" % observations_id2label[i]) for i in observations_index))
    for s in range(0, 2):
        print("%7s: " % states_id2label[s] + " ".join("%10s" % ("%f" % v) for v in V[s]))
    print('\nThe most possible states and probability are:')
    p, ss = state_path(observations_index, newA, newB, newpi)
    for s in ss:
        print(states_id2label[s])
    print(p)
    
                normal       cold      dizzy
    Healthy:   0.140000   0.022400   0.004480
      Fever:   0.140000   0.022400   0.004480
    
    The most possible states and probability are:
    Healthy
    Healthy
    Healthy
    0.00448
    

    3.完整代码

    代码主要参考Hankcs的博客,hankcs参考的是colostate大学的教学代码

    完整的隐马尔科夫用类包装的代码:

    In [15]:

    class HMM:
        """
        Order 1 Hidden Markov Model
     
        Attributes
        ----------
        A : numpy.ndarray
            State transition probability matrix
        B: numpy.ndarray
            Output emission probability matrix with shape(N, number of output types)
        pi: numpy.ndarray
            Initial state probablity vector
        """
        def __init__(self, A, B, pi):
            self.A = A
            self.B = B
            self.pi = pi
        
        def simulate(self, T):
     
            def draw_from(probs):
                """
                1.np.random.multinomial:
                按照多项式分布,生成数据
                >>> np.random.multinomial(20, [1/6.]*6, size=2)
                        array([[3, 4, 3, 3, 4, 3],
                               [2, 4, 3, 4, 0, 7]])
                 For the first run, we threw 3 times 1, 4 times 2, etc.  
                 For the second, we threw 2 times 1, 4 times 2, etc.
                2.np.where:
                >>> x = np.arange(9.).reshape(3, 3)
                >>> np.where( x > 5 )
                (array([2, 2, 2]), array([0, 1, 2]))
                """
                return np.where(np.random.multinomial(1,probs) == 1)[0][0]
    
            observations = np.zeros(T, dtype=int)
            states = np.zeros(T, dtype=int)
            states[0] = draw_from(self.pi)
            observations[0] = draw_from(self.B[states[0],:])
            for t in range(1, T):
                states[t] = draw_from(self.A[states[t-1],:])
                observations[t] = draw_from(self.B[states[t],:])
            return observations,states
        
        def _forward(self, obs_seq):
            """前向算法"""
            N = self.A.shape[0]
            T = len(obs_seq)
    
            F = np.zeros((N,T))
            F[:,0] = self.pi * self.B[:, obs_seq[0]]
    
            for t in range(1, T):
                for n in range(N):
                    F[n,t] = np.dot(F[:,t-1], (self.A[:,n])) * self.B[n, obs_seq[t]]
    
            return F
        
        def _backward(self, obs_seq):
            """后向算法"""
            N = self.A.shape[0]
            T = len(obs_seq)
    
            X = np.zeros((N,T))
            X[:,-1:] = 1
    
            for t in reversed(range(T-1)):
                for n in range(N):
                    X[n,t] = np.sum(X[:,t+1] * self.A[n,:] * self.B[:, obs_seq[t+1]])
    
            return X
        
        def baum_welch_train(self, observations, criterion=0.05):
            """无监督学习算法——Baum-Weich算法"""
            n_states = self.A.shape[0]
            n_samples = len(observations)
    
            done = False
            while not done:
                # alpha_t(i) = P(O_1 O_2 ... O_t, q_t = S_i | hmm)
                # Initialize alpha
                alpha = self._forward(observations)
    
                # beta_t(i) = P(O_t+1 O_t+2 ... O_T | q_t = S_i , hmm)
                # Initialize beta
                beta = self._backward(observations)
    
                xi = np.zeros((n_states,n_states,n_samples-1))
                for t in range(n_samples-1):
                    denom = np.dot(np.dot(alpha[:,t].T, self.A) * self.B[:,observations[t+1]].T, beta[:,t+1])
                    for i in range(n_states):
                        numer = alpha[i,t] * self.A[i,:] * self.B[:,observations[t+1]].T * beta[:,t+1].T
                        xi[i,:,t] = numer / denom
    
                # gamma_t(i) = P(q_t = S_i | O, hmm)
                gamma = np.sum(xi,axis=1)
                # Need final gamma element for new B
                prod =  (alpha[:,n_samples-1] * beta[:,n_samples-1]).reshape((-1,1))
                gamma = np.hstack((gamma,  prod / np.sum(prod))) #append one more to gamma!!!
    
                newpi = gamma[:,0]
                newA = np.sum(xi,2) / np.sum(gamma[:,:-1],axis=1).reshape((-1,1))
                newB = np.copy(self.B)
    
                num_levels = self.B.shape[1]
                sumgamma = np.sum(gamma,axis=1)
                for lev in range(num_levels):
                    mask = observations == lev
                    newB[:,lev] = np.sum(gamma[:,mask],axis=1) / sumgamma
    
                if np.max(abs(self.pi - newpi)) < criterion and \
                                np.max(abs(self.A - newA)) < criterion and \
                                np.max(abs(self.B - newB)) < criterion:
                    done = 1
    
                self.A[:],self.B[:],self.pi[:] = newA,newB,newpi
        
        def observation_prob(self, obs_seq):
            """ P( entire observation sequence | A, B, pi ) """
            return np.sum(self._forward(obs_seq)[:,-1])
    
        def state_path(self, obs_seq):
            """
            Returns
            -------
            V[last_state, -1] : float
                Probability of the optimal state path
            path : list(int)
                Optimal state path for the observation sequence
            """
            V, prev = self.viterbi(obs_seq)
    
            # Build state path with greatest probability
            last_state = np.argmax(V[:,-1])
            path = list(self.build_viterbi_path(prev, last_state))
    
            return V[last_state,-1], reversed(path)
    
        def viterbi(self, obs_seq):
            """
            Returns
            -------
            V : numpy.ndarray
                V [s][t] = Maximum probability of an observation sequence ending
                           at time 't' with final state 's'
            prev : numpy.ndarray
                Contains a pointer to the previous state at t-1 that maximizes
                V[state][t]
            """
            N = self.A.shape[0]
            T = len(obs_seq)
            prev = np.zeros((T - 1, N), dtype=int)
    
            # DP matrix containing max likelihood of state at a given time
            V = np.zeros((N, T))
            V[:,0] = self.pi * self.B[:,obs_seq[0]]
    
            for t in range(1, T):
                for n in range(N):
                    seq_probs = V[:,t-1] * self.A[:,n] * self.B[n, obs_seq[t]]
                    prev[t-1,n] = np.argmax(seq_probs)
                    V[n,t] = np.max(seq_probs)
    
            return V, prev
    
        def build_viterbi_path(self, prev, last_state):
            """Returns a state path ending in last_state in reverse order."""
            T = len(prev)
            yield(last_state)
            for i in range(T-1, -1, -1):
                yield(prev[i, last_state])
                last_state = prev[i, last_state]
    展开全文
  • 主要为大家详细介绍了python实现隐马尔科夫模型HMM,具有一定的参考价值,感兴趣的小伙伴们可以参考一下
  • Python实现HMM(隐马尔可夫模型

    万次阅读 多人点赞 2017-04-07 16:13:15
    前几天用MATLAB实现了HMM的代码,这次用python写了一遍,依据仍然是李航博士的《统计学习方法》 由于第一次用python,所以代码可能会有许多缺陷,但是所有代码都用书中的例题进行了测试,结果正确。 这里想说一下...

    前几天用MATLAB实现了HMM的代码,这次用python写了一遍,依据仍然是李航博士的《统计学习方法》

    由于第一次用python,所以代码可能会有许多缺陷,但是所有代码都用书中的例题进行了测试,结果正确。

    这里想说一下python,在编写HMM过程中参看了之前写的MATLAB程序,发现他们有太多相似的地方,用到了numpy库,在python代码过程中最让我头疼的是数组角标,和MATLAB矩阵角标从1开始不同,numpy库数组角标都是从0开始,而且数组的维数也需要谨慎,一不小心就会出现too many indices for array的错误。程序中最后是维特比算法,在运行过程中出现了__main__:190: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future的警告,还没有去掉这个警告,查了一下说不影响结果,后面会去解决这个问题,下面贴出我的代码

    # -*- coding: utf-8 -*-
    """
    Created on Thu Feb 16 19:28:39 2017
    2017-4-2
        ForwardBackwardAlg函数功能:实现前向算法
        理论依据:李航《统计学习方法》
    2017-4-5
        修改了ForwardBackwardAlg函数名称为ForwardAlgo以及输出的alpha数组形式
        完成了BackwardAlgo函数功能:后向算法
        以及函数FBAlgoAppli:计算在观测序列和模型参数确定的情况下,
        某一个隐含状态对应相应的观测状态的概率
    2017-4-6
        完成BaumWelchAlgo函数一次迭代
    2017-4-7
        实现维特比算法
    @author: sgp
    """
    
    import numpy as np
    
    #输入格式如下:
    #A = np.array([[.5,.2,.3],[.3,.5,.2],[.2,.3,.5]])
    #B = np.array([[.5,.5],[.4,.6],[.7,.3]])
    #Pi = np.array([[.2,.4,.4]])
    #O = np.array([[1,2,1]])
    
    #应用ndarray在数组之间进行相互运算时,一定要确保数组维数相同!
    #比如:
    #In[93]:m = np.array([1,2,3,4])
    #In[94]:m
    #Out[94]: array([1, 2, 3, 4])
    #In[95]:m.shape
    #Out[95]: (4,)
    #这里表示的是一维数组
    #In[96]:m = np.array([[1,2,3,4]])
    #In[97]:m
    #Out[97]: array([[1, 2, 3, 4]])
    #In[98]:m.shape
    #Out[98]: (1, 4)
    #而这里表示的就是二维数组
    #注意In[93]和In[96]的区别,多一对中括号!!
    
    #N = A.shape[0]为数组A的行数, H = O.shape[1]为数组O的列数
    #在下列各函数中,alpha数组和beta数组均为N*H二维数组,也就是横向坐标是时间,纵向是状态
    
    def ForwardAlgo(A,B,Pi,O):
        N = A.shape[0]#数组A的行数
        M = A.shape[1]#数组A的列数
        H = O.shape[1]#数组O的列数
    
        sum_alpha_1 = np.zeros((M,N))
        alpha = np.zeros((N,H))
        r = np.zeros((1,N))
        alpha_1 = np.multiply(Pi[0,:], B[:,O[0,0]-1])
        
        alpha[:,0] = np.array(alpha_1).reshape(1,N)#alpha_1是一维数组,在使用np.multiply的时候需要升级到二维数组。#错误是IndexError: too many indices for array
        for h in range(1,H):
            for i in range(N):
                for j in range(M):
                    sum_alpha_1[i,j] = alpha[j,h-1] * A[j,i]
                r = sum_alpha_1.sum(1).reshape(1,N)#同理,将数组升级为二维数组
                alpha[i,h] = r[0,i] * B[i,O[0,h]-1]
        #print("alpha矩阵: \n %r" % alpha)    
        p = alpha.sum(0).reshape(1,H)
        P = p[0,H-1]
        #print("观测概率: \n %r" % P)
        #return alpha
        return alpha, P
        
    def BackwardAlgo(A,B,Pi,O):
        N = A.shape[0]#数组A的行数
        M = A.shape[1]#数组A的列数
        H = O.shape[1]#数组O的列数
        
        #beta = np.zeros((N,H))
        sum_beta = np.zeros((1,N))
        beta = np.zeros((N,H))
        beta[:,H-1] = 1
        p_beta = np.zeros((1,N))
        
        for h in range(H-1,0,-1):
            for i in range(N):
                for j in range(M):
                    sum_beta[0,j] = A[i,j] * B[j,O[0,h]-1] * beta[j,h]
                beta[i,h-1] = sum_beta.sum(1)
        #print("beta矩阵: \n %r" % beta)
        for i in range(N):
            p_beta[0,i] = Pi[0,i] * B[i,O[0,0]-1] * beta[i,0]
        p = p_beta.sum(1).reshape(1,1)
        #print("观测概率: \n %r" % p[0,0])
        return beta, p[0,0]
      
    def FBAlgoAppli(A,B,Pi,O,I):
        #计算在观测序列和模型参数确定的情况下,某一个隐含状态对应相应的观测状态的概率
        #例题参考李航《统计学习方法》P189习题10.2
        #输入格式:
        #I为二维数组,存放所求概率P(it = qi,O|lambda)中it和qi的角标t和i,即P=[t,i]
        alpha,p1 = ForwardAlgo(A,B,Pi,O)
        beta,p2 = BackwardAlgo(A,B,Pi,O)
        p = alpha[I[0,1]-1,I[0,0]-1] * beta[I[0,1]-1,I[0,0]-1] / p1
        return p
        
    def GetGamma(A,B,Pi,O):
        N = A.shape[0]#数组A的行数
        H = O.shape[1]#数组O的列数
        Gamma = np.zeros((N,H))
        alpha,p1 = ForwardAlgo(A,B,Pi,O)
        beta,p2 = BackwardAlgo(A,B,Pi,O)
        for h in range(H):
            for i in range(N):
                Gamma[i,h] = alpha[i,h] * beta[i,h] / p1
        return Gamma
        
    def GetXi(A,B,Pi,O):
        N = A.shape[0]#数组A的行数
        M = A.shape[1]#数组A的列数
        H = O.shape[1]#数组O的列数
        Xi = np.zeros((H-1,N,M))
        alpha,p1 = ForwardAlgo(A,B,Pi,O)
        beta,p2 = BackwardAlgo(A,B,Pi,O)
        for h in range(H-1):
            for i in range(N):
                for j in range(M):
                    Xi[h,i,j] = alpha[i,h] * A[i,j] * B[j,O[0,h+1]-1] * beta[j,h+1] / p1
        #print("Xi矩阵: \n %r" % Xi)
        return Xi
        
    def BaumWelchAlgo(A,B,Pi,O):
        N = A.shape[0]#数组A的行数
        M = A.shape[1]#数组A的列数
        Y = B.shape[1]#数组B的列数
        H = O.shape[1]#数组O的列数
        c = 0
        Gamma = GetGamma(A,B,Pi,O)
        Xi = GetXi(A,B,Pi,O)
        Xi_1 = Xi.sum(0)
        a = np.zeros((N,M))
        b = np.zeros((M,Y))
        pi = np.zeros((1,N))
        a_1 = np.subtract(Gamma.sum(1),Gamma[:,H-1]).reshape(1,N)
        for i in range(N):
            for j in range(M):
                a[i,j] = Xi_1[i,j] / a_1[0,i]
        #print(a)
        for y in range(Y):
            for j in range(M):
                for h in range(H):
                    if O[0,h]-1 == y:
                        c = c + Gamma[j,h]
                gamma = Gamma.sum(1).reshape(1,N)
                b[j,y] = c / gamma[0,j]
                c = 0
        #print(b)
        for i in range(N):
            pi[0,i] = Gamma[i,0]
        #print(pi)
        return a,b,pi
        
    def BaumWelchAlgo_n(A,B,Pi,O,n):#计算迭代次数为n的BaumWelch算法
        for i in range(n):
            A,B,Pi = BaumWelchAlgo(A,B,Pi,O)
        return A,B,Pi
        
    def viterbi(A,B,Pi,O):
        N = A.shape[0]#数组A的行数
        M = A.shape[1]#数组A的列数
        H = O.shape[1]#数组O的列数
        Delta = np.zeros((M,H))
        Psi = np.zeros((M,H))
        Delta_1 = np.zeros((N,1))
        I = np.zeros((1,H))
        
        for i in range(N):
            Delta[i,0] = Pi[0,i] * B[i,O[0,0]-1]
            
        for h in range(1,H):
            for j in range(M):
                for i in range(N):
                    Delta_1[i,0] = Delta[i,h-1] * A[i,j] * B[j,O[0,h]-1]
                Delta[j,h] = np.amax(Delta_1)
                Psi[j,h] = np.argmax(Delta_1)+1
        print("Delta矩阵: \n %r" % Delta)
        print("Psi矩阵: \n %r" % Psi)
        P_best = np.amax(Delta[:,H-1])
        psi = np.argmax(Delta[:,H-1])
        I[0,H-1] = psi + 1
        for h in range(H-1,0,-1):
            I[0,h-1] = Psi[I[0,h]-1,h]
        print("最优路径概率: \n %r" % P_best)
        print("最优路径: \n %r" % I)
    其实代码就是翻译的公式,李航博士的书中有比较详细的推理过程,或者去找一些专业的论文文献进一步了解,这里仅仅是实现了最简单的应用,其应用实例如下:

    输入数据格式:

    In[117]:A
    Out[117]: 
    array([[ 0.5,  0.2,  0.3],
           [ 0.3,  0.5,  0.2],
           [ 0.2,  0.3,  0.5]])

    In[118]:B
    Out[118]: 
    array([[ 0.5,  0.5],
           [ 0.4,  0.6],
           [ 0.7,  0.3]])

    In[119]:Pi
    Out[119]: array([[ 0.2,  0.4,  0.4]])

    In[120]:O
    Out[120]: array([[1, 2, 1]])


     
    

    输出结果为:

    In[101]:alpha,p = ForwardAlgo(A,B,Pi,O)

    In[102]:alpha
    Out[102]: 
    array([[ 0.1     ,  0.077   ,  0.04187 ],
           [ 0.16    ,  0.1104  ,  0.035512],
           [ 0.28    ,  0.0606  ,  0.052836]])

    In[103]:p
    Out[103]: 0.130218

    In[104]:beta,p1 = BackwardAlgo(A,B,Pi,O)

    In[105]:beta
    Out[105]: 
    array([[ 0.2451,  0.54  ,  1.    ],
           [ 0.2622,  0.49  ,  1.    ],
           [ 0.2277,  0.57  ,  1.    ]])

    In[106]:p1
    Out[106]: 0.130218

    In[107]:gamma = GetGamma(A,B,Pi,O)

    In[108]:gamma
    Out[108]: 
    array([[ 0.18822283,  0.31931069,  0.32153773],
           [ 0.32216744,  0.41542644,  0.27271191],
           [ 0.48960973,  0.26526287,  0.40575036]])

    In[109]:xi = GetXi(A,B,Pi,O)

    In[110]:xi
    Out[110]: 
    array([[[ 0.1036723 ,  0.04515505,  0.03939548],
            [ 0.09952541,  0.18062019,  0.04202184],
            [ 0.11611298,  0.1896512 ,  0.18384555]],


           [[ 0.14782903,  0.04730529,  0.12417638],
            [ 0.12717136,  0.16956181,  0.11869327],
            [ 0.04653735,  0.05584481,  0.16288071]]])

    In[111]:a,b,pi = BaumWelchAlgo_n(A,B,Pi,O,5)

    In[112]:a
    Out[112]: 
    array([[ 0.43972438,  0.15395857,  0.40631705],
           [ 0.309058  ,  0.45055446,  0.24038754],
           [ 0.3757005 ,  0.50361975,  0.12067975]])

    In[113]:b
    Out[113]: 
    array([[ 0.50277235,  0.49722765],
           [ 0.49524289,  0.50475711],
           [ 0.91925551,  0.08074449]])

    In[114]:pi
    Out[114]: array([[ 0.08435301,  0.18040718,  0.73523981]])

    In[115]:viterbi(A,B,Pi,O)
    Delta矩阵: 
     array([[ 0.1    ,  0.028  ,  0.00756],
           [ 0.16   ,  0.0504 ,  0.01008],
           [ 0.28   ,  0.042  ,  0.0147 ]])
    Psi矩阵: 
     array([[ 0.,  3.,  2.],
           [ 0.,  3.,  2.],
           [ 0.,  3.,  3.]])
    最优路径概率: 
     0.014699999999999998
    最优路径: 
     array([[ 3.,  3.,  3.]])
    __main__:192: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future


    展开全文
  • 隐马尔可夫模型(hidden Markov model, HMM)是可用于标注问题的统计学模型,是生成模型。 1 从一个自然语言处理处理开始

    HMM

    隐马尔可夫模型(hidden Markov model, HMM)是可用于标注问题的统计学模型,是生成模型。

    本章节内容参考李航博士的《统计学习方法》
    本章节添加了一些结论性结果的推导过程。

    1. 从一个自然语言处理的例子开始

    例如有三个个句子:
    句子一:我/名词 看见/动词 猫/名词
    句子二:猫/名词 是/动词 可爱的/形容词
    句子三:我/名词 是/动词 可爱的/形容词
    一般只能观察到具体的词,所以像"我 看见 猫 …"是观测集合,而词性如"名词 动词 形容词 …"是状态序列

    Q Q Q是所有可能的状态集合, V V V是所有可能的观测集合:

    Q = { q 1 , q 2 , . . . , q N } , V = { v 1 , v 2 , . . . , v M } Q = \{q_1, q_2, ..., q_N\}, V=\{v_1, v_2, ..., v_M\} Q={q1,q2,...,qN},V={v1,v2,...,vM}

    其中, N是可能的状态数,M是可能的观测数。

    例如: Q = { 名 词 , 动 词 , 形 容 词 } , V = { 我 , 看 见 , 猫 , 是 , 可 爱 的 } , N = 3 , M = 5 Q=\{名词,动词,形容词 \},V=\{我, 看见, 猫, 是,可爱的\},N=3, M=5 Q={}V={}N=3,M=5

    I I I是长度为 T T T的状态序列, O O O是对应的观测序列:

    I = { i 1 , i 2 , . . . , i T } , O = { o 1 , o 2 , . . . , o T } I = \{i_1, i_2,..., i_T \}, O=\{o_1, o_2,..., o_T\} I={i1,i2,...,iT},O={o1,o2,...,oT}

    例如: I = ( 名 词 , 动 词 , 名 词 ) , O = ( 我 , 看 见 , 猫 ) I=(名词,动词,名词), O=(我,看见,猫) I=()O=()

    A A A是状态转移矩阵:

    A = [ a i j ] N ∗ N (1) A=[a_{ij}]_{N*N} \tag1 A=[aij]NN(1)

    其中,

    a i j = p ( i t + 1 = q j ∣ i t = q i ) , i = 1 , 2 , . . . , N ; j = 1 , 2 , . . . , N (2) a_{ij} = p(i_{t+1}=q_j|i_t=q_i), i=1,2,...,N; j=1,2,...,N \tag2 aij=p(it+1=qjit=qi),i=1,2,...,N;j=1,2,...,N(2)

    例如:

    转态转移概率名词动词形容词
    名词010
    动词1/302/3
    形容词1/31/31/3

    B B B是观测概率矩阵,也就是发射矩阵:

    B = [ b j ( k ) ] N ∗ M (3) B=[b_j(k)]_{N*M} \tag3 B=[bj(k)]NM(3)

    其中,

    b j ( k ) = p ( o t = v k ∣ i t = q j ) , k = 1 , 2 , . . . , M ; j = 1 , 2 , . . . , N (4) b_j(k) = p(o_t=v_k|i_t=q_j), k=1,2,...,M; j=1,2,...,N \tag4 bj(k)=p(ot=vkit=qj),k=1,2,...,M;j=1,2,...,N(4)

    例如:

    观测矩阵概率看见可爱的
    名词10100
    动词01010
    形容词00001

    π \pi π是初始状态概率向量:

    π = ( π i ) (5) \pi = (\pi_i) \tag5 π=(πi)(5)

    其中,

    π i = p ( i 1 = q i ) , i = 1 , 2 , . . . , N (6) \pi_i = p(i_1 = q_i), i = 1,2,...,N \tag6 πi=p(i1=qi),i=1,2,...,N(6)

    A , B A,B A,B π \pi π是HMM的参数,用 λ \lambda λ表示:

    λ = ( A , B , π ) (7) \lambda = (A,B,\pi) \tag7 λ=(A,B,π)(7)

    例如:

    名词动词形容词
    100

    隐马尔可夫的三个基本问题
    1.概率计算问题。给定模型 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π)和观测序列 O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) O=(o1,o2,...,oT),计算在已知模型参数的情况下,观测序列的概率,即 p ( O ∣ λ ) p(O|\lambda) p(Oλ)
    2.学习问题。已知观测序列 O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) O=(o1,o2,...,oT),估计模型参数 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π),使 p ( O ∣ λ ) p(O|\lambda) p(Oλ)最大。
    3.预测问题,也称解码问题。已知模型 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π) O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) O=(o1,o2,...,oT),求条件概率最大 p ( I ∣ O ) p(I|O) p(IO)最大的状态序列 I = ( i 1 , i 2 , . . . , i T ) I=(i_1,i_2,...,i_T) I=(i1,i2,...,iT)

    2. 概率预测问题

    概率问题预测用直接计算法,计算复杂度高,可以采用动态规划形式的前向和后向算法降低计算复杂度。
    为了表示方便,记:

    ( o 1 : t ) = ( o 1 , o 2 , . . . , o t ) ; ( o t : T ) = ( o t , o t + 1 , . . . , o T ) (o_{1:t} )= (o_1,o_2,...,o_t); (o_{t_:T})=(o_t,o_{t+1},...,o_T) (o1:t)=(o1,o2,...,ot);(ot:T)=(ot,ot+1,...,oT)

    2.1 前向算法

    接下来就是解前向概率 p ( i t , o 1 : t ∣ λ ) p(i_t,o_{1:t}|\lambda) p(it,o1:tλ)

    p ( i t , o 1 : t ∣ λ ) = ∑ i t − 1 p ( i t − 1 , i t , o 1 : t − 1 , o t ∣ λ ) = ∑ i t − 1 p ( o t ∣ i t − 1 , i t , o 1 : t − 1 , λ ) p ( i t ∣ i t − 1 , o 1 : t − 1 , λ ) p ( i t − 1 , o 1 : t − 1 ∣ λ ) \begin{aligned} p(i_t,o_{1:t}|\lambda) &=\sum_{i_{t-1}} p(i_{t-1},i_t,o_{1:t-1},o_t|\lambda) \\ &=\sum_{i_{t-1}} p(o_t|i_{t-1},i_t,o_{1:t-1},\lambda)p(i_t|i_{t-1},o_{1:t-1},\lambda)p(i_{t-1},o_{1:t-1}|\lambda) \end{aligned} p(it,o1:tλ)=it1p(it1,it,o1:t1,otλ)=it1p(otit1,it,o1:t1,λ)p(itit1,o1:t1,λ)p(it1,o1:t1λ)

    由隐马尔科夫的条件独立性假设可得:

    p ( o t ∣ i t − 1 , i t , o 1 : t − 1 , λ ) = p ( o t ∣ i t , λ ) p(o_t|i_{t-1},i_t,o_{1:t-1},\lambda) = p(o_t|i_t,\lambda) p(otit1,it,o1:t1,λ)=p(otit,λ)

    p ( i t ∣ i t − 1 , o 1 : t − 1 , λ ) = p ( i t ∣ i t − 1 , λ ) p(i_t|i_{t-1},o_{1:t-1},\lambda)=p(i_t|i_{t-1},\lambda) p(itit1,o1:t1,λ)=p(itit1,λ)

    p ( i t , o 1 : t ∣ λ ) = ∑ i t − 1 p ( o t ∣ i t , λ ) p ( i t ∣ i t − 1 , λ ) p ( i t − 1 , o 1 : t − 1 ∣ λ ) = [ ∑ i t − 1 p ( i t − 1 , o 1 : t − 1 ∣ λ ) p ( i t ∣ i t − 1 , λ ) ] p ( o t ∣ i t , λ ) p(i_t,o_{1:t}|\lambda)=\sum_{i_{t-1}} p(o_t|i_t,\lambda) p(i_t|i_{t-1},\lambda)p(i_{t-1},o_{1:t-1}|\lambda)=[\sum_{i_{t-1} } p(i_{t-1},o_{1:t-1}|\lambda) p(i_t|i_{t-1},\lambda)] p(o_t|i_t,\lambda) p(it,o1:tλ)=it1p(otit,λ)p(itit1,λ)p(it1,o1:t1λ)=[it1p(it1,o1:t1λ)p(itit1,λ)]p(otit,λ)

    设:

    α t + 1 ( i ) = p ( o 1 : t + 1 , i t + 1 = q i ∣ λ ) (8) \alpha_{t+1}(i) = p(o_{1:t+1},i_{t+1}=q_i|\lambda) \tag8 αt+1(i)=p(o1:t+1,it+1=qiλ)(8)

    且:

    p ( i t + 1 = q i ∣ i t = q j , λ ) ] = a j i p(i_{t+1}=q_i|i_t=q_j,\lambda)] = a_{ji} p(it+1=qiit=qj,λ)]=aji

    p ( o t + 1 ∣ i t + 1 , λ ) = b i ( o t + 1 ) p(o_{t+1}|i_{t+1},\lambda)=b_i(o_{t+1}) p(ot+1it+1,λ)=bi(ot+1)

    则:

    α t + 1 ( i ) = [ ∑ j = 1 N α t ( j ) a j i ] b i ( o t + 1 ) (9) \alpha_{t+1}(i)=[\sum_{j=1}^N \alpha_t(j)a_{ji}]b_i(o_{t+1}) \tag9 αt+1(i)=[j=1Nαt(j)aji]bi(ot+1)(9)

    所以前向算法就可迭代进行。

    前向算法:
    1.初值

    α 1 ( i ) = π i b i ( o 1 ) \alpha_1(i) = \pi_ib_i(o_1) α1(i)=πibi(o1)

    2.递推 t = 1 , 2 , . . . , T − 1 t=1,2,...,T-1 t=1,2,...,T1

    α t + 1 ( i ) = [ ∑ j = 1 N α t ( j ) a j i ] b i ( o t + 1 ) \alpha_{t+1}(i)=[\sum_{j=1}^N \alpha_t(j)a_{ji}]b_i(o_{t+1}) αt+1(i)=[j=1Nαt(j)aji]bi(ot+1)

    3.终止
    p ( O ∣ λ ) = ∑ i = 1 N α T ( i ) p(O|\lambda) = \sum_{i=1}^N \alpha_T(i) p(Oλ)=i=1NαT(i)

    2.2 后向算法

    后向算法解决后向概率 p ( o t + 1 : T ∣ i t , λ ) p(o_{t+1:T}|i_t, \lambda) p(ot+1:Tit,λ):

    p ( o t + 1 : T ∣ i t , λ ) = ∑ i t + 1 p ( i t + 1 , o t + 1 , o t + 2 : T ∣ i t , λ ) = ∑ i t + 1 p ( o t + 2 : T ∣ i t + 1 , i t , o t + 1 , λ ) p ( o t + 1 ∣ i t + 1 , i t , λ ) p ( i t + 1 ∣ i t , λ ) \begin{aligned} p(o_{t+1:T}|i_t, \lambda) &= \sum_{i_{t+1}} p(i_{t+1},o_{t+1},o_{t+2:T} | i_t, \lambda) \\ &= \sum_{i_{t+1}} p(o_{t+2:T}|i_{t+1}, i_t, o_{t+1}, \lambda) p(o_{t+1}|i_{t+1}, i_t, \lambda) p(i_{t+1}|i_t,\lambda)\\ \end{aligned} p(ot+1:Tit,λ)=it+1p(it+1,ot+1,ot+2:Tit,λ)=it+1p(ot+2:Tit+1,it,ot+1,λ)p(ot+1it+1,it,λ)p(it+1it,λ)

    由隐马尔科夫的条件独立假设得:

    p ( o t + 2 : T ∣ i t + 1 , i t , o t + 1 , λ ) = p ( o t + 2 : T ∣ i t + 1 , λ ) p(o_{t+2:T}|i_{t+1}, i_t, o_{t+1}, \lambda)=p(o_{t+2:T}|i_{t+1}, \lambda) p(ot+2:Tit+1,it,ot+1,λ)=p(ot+2:Tit+1,λ)

    p ( o t + 1 ∣ i t + 1 , i t , λ ) = p ( o t + 1 ∣ i t + 1 , λ ) p(o_{t+1}|i_{t+1}, i_t, \lambda) = p(o_{t+1}|i_{t+1}, \lambda) p(ot+1it+1,it,λ)=p(ot+1it+1,λ)

    设:

    β t ( i ) = p ( o t + 1 : T ∣ i t = q i , λ ) (10) \beta_t(i) = p(o_{t+1:T}|i_t=q_i, \lambda) \tag{10} βt(i)=p(ot+1:Tit=qi,λ)(10)

    又:

    p ( i t + 1 = q j ∣ i t = q i , λ ) = a i j p(i_{t+1}=q_j|i_t=q_i,\lambda) = a_{ij} p(it+1=qjit=qi,λ)=aij

    p ( o t + 1 ∣ i t + 1 = q j , λ ) = b j ( o t + 1 ) p(o_{t+1}|i_{t+1}=q_j, \lambda) = b_j(o_{t+1}) p(ot+1it+1=qj,λ)=bj(ot+1)

    则:

    β t ( i ) = ∑ j = 1 N a i j b j ( o t + 1 ) β t + 1 ( i ) (11) \beta_t(i) = \sum_{j=1}^N a_{ij} b_j(o_{t+1}) \beta_{t+1}(i) \tag{11} βt(i)=j=1Naijbj(ot+1)βt+1(i)(11)

    后向算法:
    (1)

    β T ( i ) = 1 \beta_T (i) = 1 βT(i)=1

    (2) 对t=T-1,T-2,…,1

    β t ( i ) = ∑ j = 1 N a i j b j ( o t + 1 ) β t + 1 ( i ) \beta_t(i) = \sum_{j=1}^N a_{ij} b_j(o_{t+1}) \beta_{t+1}(i) βt(i)=j=1Naijbj(ot+1)βt+1(i)

    (3)

    p ( O ∣ λ ) = ∑ i = 1 N π i b i ( o 1 ) β 1 ( i ) p(O|\lambda) = \sum_{i=1}^N \pi_i b_i(o_1) \beta_1(i) p(Oλ)=i=1Nπibi(o1)β1(i)

    2.3 一些概率与期望值

    这两个期望值都是后面EM算法用到的中间参量
    1.计算 t t t时刻处于状态 q i q_i qi的概率。
    概率计算问题是计算 p ( O ∣ λ ) p(O|\lambda) p(Oλ),则有:

    p ( O ∣ λ ) = ∑ i t p ( O , i t ∣ λ ) p(O|\lambda)=\sum_{i_t}p(O,i_t|\lambda) p(Oλ)=itp(O,itλ)

    依据隐马尔科夫的独立性假设:

    p ( o t + 1 : T ∣ i t , o 1 : t , λ ) = p ( o t + 1 : T ∣ i t , λ ) p(o_{t+1:T}|i_t,o_{1:t}, \lambda) = p(o_{t+1:T}|i_t, \lambda) p(ot+1:Tit,o1:t,λ)=p(ot+1:Tit,λ)

    所以:

    p ( O ∣ λ ) = ∑ i t p ( O , i t ∣ λ ) = ∑ i t p ( o t + 1 : T ∣ i t , o 1 : t , λ ) p ( i t , o 1 : t ∣ λ ) = ∑ i t p ( o t + 1 : T ∣ i t , λ ) p ( i t , o 1 : t ∣ λ ) \begin{aligned} p(O|\lambda) &=\sum_{i_t}p(O,i_t|\lambda) \\ &=\sum_{i_t} p(o_{t+1:T}|i_t,o_{1:t}, \lambda) p(i_t,o_{1:t}|\lambda) \\ &=\sum_{i_t} p(o_{t+1:T}|i_t, \lambda) p(i_t,o_{1:t}|\lambda) \\ \end{aligned} p(Oλ)=itp(O,itλ)=itp(ot+1:Tit,o1:t,λ)p(it,o1:tλ)=itp(ot+1:Tit,λ)p(it,o1:tλ)

    又有:

    α t ( i ) = p ( o 1 : t , i t = q i ∣ λ ) (12) \alpha_t(i) = p(o_{1:t},i_t=q_i|\lambda) \tag{12} αt(i)=p(o1:t,it=qiλ)(12)

    β t ( i ) = p ( o t + 1 : T ∣ i t = q i , λ ) (13) \beta_t(i) = p(o_{t+1:T}|i_t=q_i, \lambda) \tag{13} βt(i)=p(ot+1:Tit=qi,λ)(13)

    故:

    p ( O , i t = q i ∣ λ ) = p ( o t + 1 : T ∣ i t = q i , λ ) p ( i t = q i , o 1 : t ∣ λ ) = α t ( i ) β t ( i ) p(O,i_t=q_i|\lambda) = p(o_{t+1:T}|i_t=q_i, \lambda) p(i_t=q_i,o_{1:t}|\lambda) = \alpha_t(i) \beta_t(i) p(O,it=qiλ)=p(ot+1:Tit=qi,λ)p(it=qi,o1:tλ)=αt(i)βt(i)

    p ( O ∣ λ ) = ∑ i t α t ( i ) β t ( i ) p(O|\lambda) = \sum_{i_t} \alpha_t(i) \beta_t(i) p(Oλ)=itαt(i)βt(i)

    设:

    γ t ( i ) = p ( i t = q i ∣ O , λ ) \gamma_t(i) = p(i_t=q_i|O,\lambda) γt(i)=p(it=qiO,λ)

    于是可以得到:

    γ t ( i ) = p ( i t = q i ∣ O , λ ) = p ( i t = q i , O ∣ λ ) p ( O ∣ λ ) = α t ( i ) β t ( i ) ∑ j = 1 N α t ( j ) β t ( j ) (14) \gamma_t(i) = p(i_t=q_i|O,\lambda) = \frac {p(i_t=q_i,O|\lambda)}{p(O|\lambda)} = \frac {\alpha_t(i) \beta_t(i)}{\sum_{j=1}^N \alpha_t(j) \beta_t(j)} \tag{14} γt(i)=p(it=qiO,λ)=p(Oλ)p(it=qi,Oλ)=j=1Nαt(j)βt(j)αt(i)βt(i)(14)

    2.计算计算 t t t时刻处于状态 q i q_i qi且计算 t + 1 t+1 t+1时刻处于状态 q j q_j qj的概率

    p ( O ∣ λ ) = ∑ i t ∑ i t + 1 p ( O , i t , i t + 1 ∣ λ ) = ∑ i t ∑ i t + 1 p ( o 1 : t , o t + 1 , o t + 2 : T , i t , i t + 1 ∣ λ ) = ∑ i t ∑ i t + 1 p ( o t + 2 : T ∣ o 1 : t , o t + 1 , i t , i t + 1 , λ ) p ( o t + 1 ∣ o 1 : t , i t , i t + 1 , λ ) p ( i t + 1 ∣ i t , o 1 : t , λ ) p ( i t , o 1 : t ∣ λ ) \begin{aligned} p(O|\lambda) &=\sum_{i_t} \sum_{i_{t+1}} p(O,i_t, i_{t+1}|\lambda) \\ &=\sum_{i_t} \sum_{i_{t+1}} p(o_{1:t},o_{t+1},o_{t+2:T},i_t, i_{t+1}|\lambda) \\ &=\sum_{i_t} \sum_{i_{t+1}} p(o_{t+2:T}|o_{1:t},o_{t+1},i_t, i_{t+1},\lambda)p(o_{t+1}|o_{1:t},i_t,i_{t+1},\lambda) p(i_{t+1}|i_t,o_{1:t},\lambda) p(i_t,o_{1:t}|\lambda) \\ \end{aligned} p(Oλ)=itit+1p(O,it,it+1λ)=itit+1p(o1:t,ot+1,ot+2:T,it,it+1λ)=itit+1p(ot+2:To1:t,ot+1,it,it+1,λ)p(ot+1o1:t,it,it+1,λ)p(it+1it,o1:t,λ)p(it,o1:tλ)

    由隐马尔科夫的独立性假设可得:

    p ( O ∣ λ ) = ∑ i t ∑ i t + 1 p ( o t + 2 : T ∣ i t + 1 , λ ) p ( o t + 1 ∣ i t + 1 , λ ) p ( i t + 1 ∣ i t , λ ) p ( i t , o 1 : t ∣ λ ) p(O|\lambda) = \sum_{i_t} \sum_{i_{t+1}} p(o_{t+2:T}| i_{t+1},\lambda)p(o_{t+1}|i_{t+1},\lambda) p(i_{t+1}|i_t,\lambda) p(i_t,o_{1:t}|\lambda) p(Oλ)=itit+1p(ot+2:Tit+1,λ)p(ot+1it+1,λ)p(it+1it,λ)p(it,o1:tλ)

    设:

    ξ t ( i , j ) = p ( i t = q i , i t + 1 = q j ∣ O , λ ) \xi_t(i,j)=p(i_t=q_i,i_{t+1}=q_j|O,\lambda) ξt(i,j)=p(it=qi,it+1=qjO,λ)

    又有公式(2)(4)(12)(13)

    得:

    ξ t ( i , j ) = p ( i t = q i , i t + 1 = q j ∣ O , λ ) p ( O ∣ λ ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) ∑ i = 1 N ∑ j = 1 N α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) (15) \xi_t(i,j) = \frac {p(i_t=q_i,i_{t+1}=q_j|O,\lambda)}{p(O|\lambda)} =\frac {\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} {\sum_{i=1}^N \sum_{j=1}^N \alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} \tag{15} ξt(i,j)=p(Oλ)p(it=qi,it+1=qjO,λ)=i=1Nj=1Nαt(i)aijbj(ot+1)βt+1(j)αt(i)aijbj(ot+1)βt+1(j)(15)

    3. 学习问题
    3.1 监督学习

    如果有标记好状态序列的样本,那就太好办了,直接将接个矩阵统计的各个维度定义后进行统计就可以了。统计过程中注意概率之和为一的约束。

    3.2 无监督学习

    如果没有标记状态序列的样本,可以用Baum-Welch算法(EM算法)实现。

    已知:包含 S S S个长度为 T T T的观测序列的观测序列 { O 1 , O 2 , . . . , O S } \{O_1,O_2,...,O_S \} {O1,O2,...,OS}
    目标:学习隐马尔可夫模型的参数 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π)

    记观测数据 O O O,隐数据 I I I,那么隐马尔可夫模型可以表示为:

    p ( O ∣ λ ) = ∑ I p ( O ∣ I , λ ) p ( I ∣ λ ) p(O|\lambda) = \sum_I p(O|I,\lambda) p(I|\lambda) p(Oλ)=Ip(OI,λ)p(Iλ)

    E步:

    因为对 λ \lambda λ而言, 1 / p ( O ∣ λ ‾ ) 1/p(O| \overline \lambda) 1/p(Oλ)是常数项,所以

    Q ( λ , λ ‾ ) = E I [ log ⁡ p ( O , I ∣ λ ) ∣ O , λ ‾ ] = ∑ I log ⁡ p ( O , I ∣ λ ) p ( I ∣ O , λ ‾ ) = ∑ I log ⁡ p ( O , I ∣ λ ) p ( I , O ∣ λ ‾ ) p ( O ∣ λ ‾ ) = ∑ I log ⁡ p ( O , I ∣ λ ) p ( I , O ∣ λ ‾ ) \begin{aligned} Q(\lambda,\overline \lambda) &= E_I[\log p(O,I|\lambda)|O, \overline \lambda] \\ &= \sum_I \log p(O,I|\lambda) p(I|O,\overline \lambda) \\ &= \sum_I \log p(O,I|\lambda) \frac {p(I,O|\overline \lambda)}{p(O| \overline \lambda)} \\ &= \sum_I \log p(O,I|\lambda) p(I,O|\overline \lambda) \\ \end{aligned} Q(λ,λ)=EI[logp(O,Iλ)O,λ]=Ilogp(O,Iλ)p(IO,λ)=Ilogp(O,Iλ)p(Oλ)p(I,Oλ)=Ilogp(O,Iλ)p(I,Oλ)

    将概率计算问题2.1小姐中前向算法的递归公式展开就可以得到:

    p ( O , I ∣ λ ) = π i 1 b i 1 ( o 1 ) a i 1 i 2 b i 2 ( o 2 ) . . . a i T − 1 i T b i T ( o T ) = π i 1 [ ∏ t = 1 T − 1 a i t i t + 1 ] [ ∏ t = 1 T b i t ( o t ) ] p(O,I|\lambda) = \pi_{i_1} b_{i_1}(o_1) a_{i_1i_2} b_{i_2}(o_2) ... a_{i_{T-1}i_T} b_{iT}(o_T) = \pi_{i_1} [\prod_{t=1}^{T-1} a_{i_ti_{t+1}}][\prod_{t=1}^T b_{it}(o_t)] p(O,Iλ)=πi1bi1(o1)ai1i2bi2(o2)...aiT1iTbiT(oT)=πi1[t=1T1aitit+1][t=1Tbit(ot)]

    于是:

    Q ( λ , λ ‾ ) = ∑ I log ⁡ π i 1 p ( O , I ∣ λ ‾ ) + ∑ I ( ∑ t = 1 T − 1 a i t i t + 1 ) p ( O , I ∣ λ ‾ ) + ∑ I ( ∑ t = 1 T b i t ( o t ) ) p ( O , I ∣ λ ‾ ) (16) Q(\lambda, \overline \lambda) = \sum_I \log \pi_{i_1} p(O, I| \overline \lambda) + \sum_I (\sum_{t=1}^{T-1} a_{i_ti_{t+1}}) p(O, I| \overline \lambda) + \sum_I (\sum_{t=1}^T b_{it}(o_t)) p(O, I| \overline \lambda) \tag{16} Q(λ,λ)=Ilogπi1p(O,Iλ)+I(t=1T1aitit+1)p(O,Iλ)+I(t=1Tbit(ot))p(O,Iλ)(16)

    特此说明隐变量
    隐马尔可夫模型的隐变量就是观测序列对应的状态序列,所以隐变量可以用(14)式的变量表示
    后面在M步中更新模型参数的时候也用到了(15)式,是不是就说明隐变量是两个,其实不是的,这儿只是为了表示的方便和算法的方便。
    也就是在E步中,用 γ \gamma γ ξ \xi ξ表示隐变量,只是为了编程和表示的便利,这两个变量在E步中信息是重复的。

    M步:

    1.求解 π i \pi_i πi
    由(15)式可得:

    L ( π i 1 ) = ∑ I log ⁡ π i 1 p ( O , I ∣ λ ‾ ) = ∑ i N log ⁡ π i 1 p ( O , i 1 = i ∣ λ ‾ ) L(\pi_{i_1}) = \sum_I \log \pi_{i_1} p(O, I| \overline \lambda) = \sum_{i}^N \log \pi_{i_1} p(O, i_1=i| \overline \lambda) L(πi1)=Ilogπi1p(O,Iλ)=iNlogπi1p(O,i1=iλ)

    又因为 π i \pi_i πi满足约束条件 ∑ i = 1 N π i 1 = 1 \sum_{i=1}^N \pi_{i_1}=1 i=1Nπi1=1,利用拉格朗日乘子法,写出拉格朗日函数:

    ∑ i = 1 N log ⁡ π i p ( O , i 1 = i ∣ λ ‾ ) + γ ( ∑ i = 1 N π i − 1 ) \sum_{i=1}^N \log \pi_{i} p(O, i_1=i| \overline \lambda) + \gamma(\sum_{i=1}^N \pi_{i} - 1) i=1Nlogπip(O,i1=iλ)+γ(i=1Nπi1)

    对其求偏导并且令其结果为0得:

    ∂ ∂ π i [ ∑ i = 1 N log ⁡ π i p ( O , i = i ∣ λ ‾ ) + γ ( ∑ i 1 = 1 N π i − 1 ) ] = 0 (17) \frac {\partial} {\partial \pi_i} [\sum_{i=1}^N \log \pi_{i} p(O, i=i| \overline \lambda) + \gamma(\sum_{i_1=1}^N \pi_{i} - 1)]=0 \tag{17} πi[i=1Nlogπip(O,i=iλ)+γ(i1=1Nπi1)]=0(17)

    得:

    p ( O , i 1 = i ∣ λ ‾ ) + γ π i = 0 p(O, i_1=i| \overline \lambda) + \gamma \pi_i=0 p(O,i1=iλ)+γπi=0

    得到:

    π i = p ( O , i 1 = i ∣ λ ‾ ) − λ \pi_i = \frac {p(O, i_1=i| \overline \lambda)} {-\lambda} πi=λp(O,i1=iλ)

    带入 ∑ i = 1 N π i 1 = 1 \sum_{i=1}^N \pi_{i_1}=1 i=1Nπi1=1的:

    − λ = ∑ i = 1 N p ( O , i 1 = i ∣ λ ‾ ) = p ( o ∣ λ ‾ ) -\lambda = \sum_{i=1}^N p(O, i_1=i| \overline \lambda) = p(o|\overline \lambda) λ=i=1Np(O,i1=iλ)=p(oλ)

    求得并有公式(14):

    π i = p ( O , i 1 = i ∣ λ ‾ ) p ( o ∣ λ ‾ ) = γ 1 ( i ) (18) \pi_i = \frac {p(O, i_1=i| \overline \lambda)} {p(o|\overline \lambda)} = \gamma_1(i) \tag{18} πi=p(oλ)p(O,i1=iλ)=γ1(i)(18)

    2.求解 a i j a_{ij} aij:

    L ( a i j ) = ∑ I ( ∑ t = 1 T − 1 a i t i t + 1 ) p ( O , I ∣ λ ‾ ) = ∑ i = 1 N ( ∑ t = 1 T − 1 a i t i t + 1 ) ( ∑ j = 1 N p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) ) = ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 a i j p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) L(a_{ij})=\sum_I (\sum_{t=1}^{T-1} a_{i_ti_{t+1}}) p(O, I| \overline \lambda) = \sum_{i=1}^N (\sum_{t=1}^{T-1} a_{i_ti_{t+1}}) ( \sum_{j=1}^N p(O, i_t=i, i_{t+1}=j| \overline \lambda) ) \\ = \sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T-1} a_{ij} p(O, i_t=i, i_{t+1}=j| \overline \lambda) L(aij)=I(t=1T1aitit+1)p(O,Iλ)=i=1N(t=1T1aitit+1)(j=1Np(O,it=i,it+1=jλ))=i=1Nj=1Nt=1T1aijp(O,it=i,it+1=jλ)

    应用约束条件 ∑ j = 1 N a i j = 1 \sum_{j=1}^N a_{ij} = 1 j=1Naij=1,用拉格朗日乘子法可以求出:

    ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 a i j p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) + λ ( ∑ j = 1 N a i j − 1 ) \sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T-1} a_{ij} p(O, i_t=i, i_{t+1}=j| \overline \lambda) + \lambda(\sum_{j=1}^N a_{ij} - 1) i=1Nj=1Nt=1T1aijp(O,it=i,it+1=jλ)+λ(j=1Naij1)

    对上式求骗到并等于0得到:

    ∂ ∂ a i j [ ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 a i j p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) + λ ( ∑ j = 1 N a i j − 1 ) ] = 0 \frac {\partial}{\partial a_{ij}} [\sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T-1} a_{ij} p(O, i_t=i, i_{t+1}=j| \overline \lambda) + \lambda(\sum_{j=1}^N a_{ij} - 1)] = 0 aij[i=1Nj=1Nt=1T1aijp(O,it=i,it+1=jλ)+λ(j=1Naij1)]=0

    得到:

    ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) + λ a i j = 0 \sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda) + \lambda a_{ij} = 0 t=1T1p(O,it=i,it+1=jλ)+λaij=0

    所以:

    a i j = ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) − λ a_{ij} = \frac {\sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda)}{- \lambda} aij=λt=1T1p(O,it=i,it+1=jλ)

    将上式带入 ∑ j = 1 N a i j = 1 \sum_{j=1}^N a_{ij} = 1 j=1Naij=1

    − λ = ∑ j = 1 N ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) = ∑ t = 1 T − 1 p ( O , i t = i ∣ λ ‾ ) - \lambda = \sum_{j=1}^N \sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda) = \sum_{t=1}^{T-1} p(O, i_t=i| \overline \lambda) λ=j=1Nt=1T1p(O,it=i,it+1=jλ)=t=1T1p(O,it=iλ)

    故得:

    a i j = ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) ∑ t = 1 T − 1 p ( O , i t = i ∣ λ ‾ ) = ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) / p ( o ∣ λ ‾ ) ∑ t = 1 T − 1 p ( O , i t = i ∣ λ ‾ ) / p ( o ∣ λ ‾ ) a_{ij} = \frac {\sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda)}{\sum_{t=1}^{T-1} p(O, i_t=i| \overline \lambda) } = \frac {\sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda) / p(o|\overline \lambda)} {\sum_{t=1}^{T-1} p(O, i_t=i| \overline \lambda) / p(o|\overline \lambda) } aij=t=1T1p(O,it=iλ)t=1T1p(O,it=i,it+1=jλ)=t=1T1p(O,it=iλ)/p(oλ)t=1T1p(O,it=i,it+1=jλ)/p(oλ)

    将(14)和(15)带入的:

    a i j = ∑ t = 1 T − 1 ξ t ( i , j ) ∑ t = 1 T − 1 γ t ( i ) (19) a_{ij} = \frac {\sum_{t=1}^{T-1} \xi_t(i,j)} {\sum_{t=1}^{T-1} \gamma_t(i) } \tag{19} aij=t=1T1γt(i)t=1T1ξt(i,j)(19)

    3.求解 b j k b_j{k} bjk:

    L ( b j k ) = ∑ I ( ∑ t = 1 T b i t ( o t ) ) p ( O , I ∣ λ ‾ ) = ∑ j = 1 N ∑ t = 1 T b j ( o t ) p ( O , i t = j ∣ λ ‾ ) L(b_j{k}) = \sum_I (\sum_{t=1}^T b_{it}(o_t)) p(O, I| \overline \lambda) = \sum_{j=1}^N \sum_{t=1}^T b_{j}(o_t) p(O, i_t=j| \overline \lambda) L(bjk)=I(t=1Tbit(ot))p(O,Iλ)=j=1Nt=1Tbj(ot)p(O,it=jλ)

    在约束条件 ∑ k = 1 M b j ( k ) = 1 \sum_{k=1}^M b_j(k) = 1 k=1Mbj(k)=1的拉格朗日乘子法:

    ∑ j = 1 N ∑ t = 1 T b j ( o t ) p ( O , i t = j ∣ λ ‾ ) + λ ( ∑ k = 1 M b j ( k ) − 1 ) \sum_{j=1}^N \sum_{t=1}^T b_{j}(o_t) p(O, i_t=j| \overline \lambda) + \lambda(\sum_{k=1}^M b_j(k) - 1) j=1Nt=1Tbj(ot)p(O,it=jλ)+λ(k=1Mbj(k)1)

    对其求偏导得:

    ∂ ∂ b j ( k ) [ ∑ j = 1 N ∑ t = 1 T b j ( o t ) p ( O , i t = j ∣ λ ‾ ) + λ ( ∑ k = 1 M b j ( k ) − 1 ) ] = 0 \frac {\partial}{\partial b_j(k)} [\sum_{j=1}^N \sum_{t=1}^T b_{j}(o_t) p(O, i_t=j| \overline \lambda) + \lambda(\sum_{k=1}^M b_j(k) - 1)] = 0 bj(k)[j=1Nt=1Tbj(ot)p(O,it=jλ)+λ(k=1Mbj(k)1)]=0

    因为只有在 o t = v k o_t=v_k ot=vk时偏导才不会等于0,以 I ( o t = v k ) I(o_t=v_k) I(ot=vk)表示,则:

    ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) I ( o t = v k ) + λ b j ( o t ) I ( o t = v k ) = 0 \sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k) + \lambda b_{j}(o_t)I(o_t=v_k) = 0 t=1Tp(O,it=jλ)I(ot=vk)+λbj(ot)I(ot=vk)=0

    b j ( o t ) I ( o t = v k ) b_{j}(o_t)I(o_t=v_k) bj(ot)I(ot=vk)可以写作 b j ( k ) b_{j}(k) bj(k),故:

    b j ( k ) = ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) I ( o t = v k ) − λ b_{j}(k) = \frac {\sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k)} {- \lambda} bj(k)=λt=1Tp(O,it=jλ)I(ot=vk)

    将上式带入 ∑ k = 1 M b j ( k ) = 1 \sum_{k=1}^M b_j(k) = 1 k=1Mbj(k)=1得:

    − λ = ∑ k = 1 M ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) I ( o t = v k ) = ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) - \lambda = \sum_{k=1}^M \sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k) = \sum_{t=1}^T p(O, i_t=j| \overline \lambda) λ=k=1Mt=1Tp(O,it=jλ)I(ot=vk)=t=1Tp(O,it=jλ)

    得到:

    b j ( k ) = ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) I ( o t = v k ) ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) b_{j}(k) = \frac {\sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k)} {\sum_{t=1}^T p(O, i_t=j| \overline \lambda)} bj(k)=t=1Tp(O,it=jλ)t=1Tp(O,it=jλ)I(ot=vk)

    又有(14)式可得:

    b j ( k ) = ∑ t = 1 , o t = v k T γ t ( j ) ∑ t = 1 T γ t ( j ) (20) b_{j}(k) = \frac {\sum_{t=1,o_t=v_k}^T \gamma_t(j)} {\sum_{t=1}^T \gamma_t(j)} \tag{20} bj(k)=t=1Tγt(j)t=1,ot=vkTγt(j)(20)

    EM算法总结:
    E步:

    γ t ( i ) = p ( i t = q i ∣ O , λ ) = p ( i t = q i , O ∣ λ ) p ( O ∣ λ ) = α t ( i ) β t ( i ) ∑ j = 1 N α t ( j ) β t ( j ) \gamma_t(i) = p(i_t=q_i|O,\lambda) = \frac {p(i_t=q_i,O|\lambda)}{p(O|\lambda)} = \frac {\alpha_t(i) \beta_t(i)}{\sum_{j=1}^N \alpha_t(j) \beta_t(j)} γt(i)=p(it=qiO,λ)=p(Oλ)p(it=qi,Oλ)=j=1Nαt(j)βt(j)αt(i)βt(i)

    ξ t ( i , j ) = p ( i t = q i , i t + 1 = q j ∣ O , λ ) p ( O ∣ λ ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) ∑ i = 1 N ∑ j = 1 N α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) \xi_t(i,j) = \frac {p(i_t=q_i,i_{t+1}=q_j|O,\lambda)}{p(O|\lambda)} =\frac {\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} {\sum_{i=1}^N \sum_{j=1}^N \alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} ξt(i,j)=p(Oλ)p(it=qi,it+1=qjO,λ)=i=1Nj=1Nαt(i)aijbj(ot+1)βt+1(j)αt(i)aijbj(ot+1)βt+1(j)

    M步:
    π i = p ( O , i 1 = i ∣ λ ‾ ) p ( o ∣ λ ‾ ) = γ 1 ( i ) \pi_i = \frac {p(O, i_1=i| \overline \lambda)} {p(o|\overline \lambda)} = \gamma_1(i) πi=p(oλ)p(O,i1=iλ)=γ1(i)

    a i j = ∑ t = 1 T − 1 ξ t ( i , j ) ∑ t = 1 T − 1 γ t ( i ) a_{ij} = \frac {\sum_{t=1}^{T-1} \xi_t(i,j)} {\sum_{t=1}^{T-1} \gamma_t(i) } aij=t=1T1γt(i)t=1T1ξt(i,j)

    b j ( k ) = ∑ t = 1 , o t = v k T γ t ( j ) ∑ t = 1 T γ t ( j ) b_{j}(k) = \frac {\sum_{t=1,o_t=v_k}^T \gamma_t(j)} {\sum_{t=1}^T \gamma_t(j)} bj(k)=t=1Tγt(j)t=1,ot=vkTγt(j)

    4. 预测问题(解码问题)

    用维特比算法进行求解:
    已知:模型 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π) O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) O=(o1,o2,...,oT)
    求:条件概率最大 p ( I ∣ O , λ ) p(I|O,\lambda) p(IO,λ)最大的状态序列 I = ( i 1 , i 2 , . . . , i T ) I=(i_1,i_2,...,i_T) I=(i1,i2,...,iT)
    因为 p ( O ) p(O) p(O)是一个定值,所以:

    max ⁡ I p ( I ∣ O , λ ) = max ⁡ I p ( I , O ∣ λ ) / p ( O ∣ λ ) = max ⁡ I p ( I , O ∣ λ ) \max_I p(I|O,\lambda) = \max_I p(I, O|\lambda) / p(O|\lambda) = \max_I p(I, O|\lambda) Imaxp(IO,λ)=Imaxp(I,Oλ)/p(Oλ)=Imaxp(I,Oλ)

    定义在时刻 t t t状态为 i i i的所有单个路径 ( i 1 , i 2 , . . . , i t ) (i_1,i_2,...,i_t) (i1,i2,...,it)中概率最大值为:

    δ t ( i ) = max ⁡ i 1 , i 2 , . . . , i t − 1 p ( i t = i , i t − 1 : i 1 , o t : 1 ∣ λ ) \delta_t(i) = \max_{i_1,i_2,...,i_{t-1}} p(i_t=i, i_{t-1:i_1},o_{t:1}|\lambda) δt(i)=i1,i2,...,it1maxp(it=i,it1:i1,ot:1λ)

    递推推导:

    p ( i t + 1 = i , i t : 1 , o t + 1 : 1 ∣ λ ) = p ( i t + 1 = i , i t , i t − 1 : 1 , o t + 1 , o t : 1 ∣ λ ) = p ( o t + 1 ∣ i t + 1 = i , i t , o t : 1 , λ ) p ( i t + 1 = i ∣ i t , i t − 1 : 1 , o t : 1 , λ ) p ( i t , i t − 1 : 1 , o t : 1 ∣ λ ) = p ( o t + 1 ∣ i t + 1 = i , λ ) p ( i t + 1 = i ∣ i t , λ ) p ( i t , i t − 1 : 1 , o t : 1 ∣ λ ) \begin{aligned} &p(i_{t+1}=i,i_{t:1},o_{t+1:1}| \lambda) \\ &=p(i_{t+1}=i,i_t,i_{t-1:1},o_{t+1},o_{t:1}| \lambda) \\ &= p(o_{t+1}|i_{t+1}=i,i_t,o_{t:1},\lambda) p(i_{t+1}=i|i_t,i_{t-1:1},o_{t:1}, \lambda) p(i_t,i_{t-1:1},o_{t:1}|\lambda) \\ &= p(o_{t+1}|i_{t+1}=i,\lambda) p(i_{t+1}=i|i_t,\lambda) p(i_t,i_{t-1:1},o_{t:1}|\lambda) \\ \end{aligned} p(it+1=i,it:1,ot+1:1λ)=p(it+1=i,it,it1:1,ot+1,ot:1λ)=p(ot+1it+1=i,it,ot:1,λ)p(it+1=iit,it1:1,ot:1,λ)p(it,it1:1,ot:1λ)=p(ot+1it+1=i,λ)p(it+1=iit,λ)p(it,it1:1,ot:1λ)

    故:

    δ t + 1 ( i ) = max ⁡ i 1 , i 2 , . . . , i t − 1 p ( i t + 1 = i , i t : 1 , o t + 1 : 1 ∣ λ ) = max ⁡ 1 ≤ j ≤ N [ δ t ( j ) a j i ] b i ( o t + 1 ) (21) \delta_{t+1}(i) = \max_{i_1,i_2,...,i_{t-1}} p(i_{t+1}=i,i_{t:1},o_{t+1:1}| \lambda) = \max_{1 \le j \le N} [\delta _t(j) a_{ji}] b_i(o_{t+1}) \tag{21} δt+1(i)=i1,i2,...,it1maxp(it+1=i,it:1,ot+1:1λ)=1jNmax[δt(j)aji]bi(ot+1)(21)

    定义在时刻 t t t状态为 i i i的所有单个路径 ( i 1 , i 2 , . . . , i t − 1 ) (i_1,i_2,...,i_{t-1}) (i1,i2,...,it1)中概率最大的第 t − 1 t-1 t1个节点为:

    ψ t ( i ) = arg ⁡ max ⁡ 1 ≤ j ≤ N [ δ t − 1 ( j ) a j i ] (22) \psi_t(i) = \arg \max_{1 \le j \le N}[\delta _{t-1}(j) a_{ji}] \tag{22} ψt(i)=arg1jNmax[δt1(j)aji](22)

    5. python实现模型
    5.1 参数对应关系

    下面说一下上面公式中出现的参数和下面模型之中的名称的对应关系(公式中的符号将会和代码一致):

    :param N: N N N 表示状态数
    :param M: M M M 表示观测数
    :param V: V V V 表示观测集合,维度 ( M , ) (M,) (M,)
    :param A: A A A 对应于状态转移矩阵,维度 ( N , N ) (N, N) (N,N)
    :param B: B B B对应于观测概率矩阵(发射矩阵),维度 ( N , M ) (N, M) (N,M)
    :param pi: π \pi π 对应于初始状态向量,维度 ( N , ) (N,) (N,)
    :param S: S S S 表示输入句子数量
    :param T: T T T 表示每个句子的个数
    :param gamma: γ \gamma γ 隐变量,表示状态的概率矩阵,维度 ( S , N , T ) (S,N,T) (S,N,T)
    :param xi: ξ \xi ξ 隐变量,表示状态的概率矩阵,维度 ( S , N , N , T ) (S,N,N,T) (S,N,N,T)
    :param alpha: α \alpha α 前向算法结果,维度 ( N , T ) (N,T) (N,T)
    :param beta: β \beta β 后向算法结果,维度 ( N , T ) (N,T) (N,T)
    :param delta: δ \delta δ 维特比算法中存储概率最大值,维度 ( N , T ) (N,T) (N,T)
    :param psi: ψ \psi ψ 维特比算法中存储概率最大值索引,维度 ( N , T ) (N,T) (N,T)
    :param I: I I I 输出的状态向量,维度 ( T , ) (T,) (T,)

    小技巧:为了避免连续乘积带来内存溢出,一般先用对数进行计算,最后再用指数运算还原。
    logsumexp() # http://bayesjumping.net/log-sum-exp-trick/
    log ⁡ ∑ i exp ⁡ ( x i ) = b + log ⁡ ∑ i exp ⁡ ( x i − b ) \log\sum_i \exp(x_i) = b + \log \sum_i \exp(x_i-b) logiexp(xi)=b+logiexp(xib)

    def logSumExp(ns):
        max = np.max(ns)
        ds = ns - max
        sumOfExp = np.exp(ds).sum()
        return max + np.log(sumOfExp)
    
    5.2 python实现HMM
    import numpy as np
    
    
    class MyHMM(object):
        def __init__(self, N=None, A=None, B=None, pi=None):
            """
            HMM模型:
            >隐马尔可夫的三个基本问题
            >1.概率计算问题。给定模型$\lambda=(A,B,\pi)$和观测序列$O=(o_1,o_2,...,o_T)$,计算在已知模型参数的情况下,观测序列的概率,
                即$p(O|\lambda)$。用前向算法或后向算法。
            >2.学习问题。已知观测序列$O=(o_1,o_2,...,o_T)$,估计模型参数$\lambda=(A,B,\pi)$,使$p(O|\lambda)$最大。用BW算法。
            >3.预测问题,也称解码问题。已知模型$\lambda=(A,B,\pi)$和$O=(o_1,o_2,...,o_T)$,求条件概率最大$p(I|O)$最大的状态序列
                $I=(i_1,i_2,...,i_T)$。用维特比算法解码。
    
            :param N: $N$ 表示状态数
            :param M: $M$ 表示观测数
            :param V: $V$ 表示观测集合,维度$(M,)$
            :param A: $A$ 对应于状态转移矩阵,维度$(N, N)$
            :param B: $B$对应于观测概率矩阵(发射矩阵),维度$(N, M)$
            :param pi: $pi$ 对应于初始状态向量,维度$(N,)$
            :param S: $S$ 表示输入句子数量
            :param T: $T$ 表示每个句子的个数
            :param gamma: $gamma$ 隐变量,表示状态的概率矩阵,维度$(S,N,T)$
            :param xi: $xi$ 隐变量,表示状态的概率矩阵,维度$(S,N,N,T)$
            :param alpha: $alpha$ 前向算法结果,维度$(N,T)$
            :param beta: $beta$ 后向算法结果,维度$(N,T)$
            :param delta: $delta$ 维特比算法中存储概率最大值,维度$(N,T)$
            :param psi: $psi$ 维特比算法中存储概率最大值索引,维度$(N,T)$
            :param I: $I$ 输出的状态向量,维度$(T,)$
            """
            self.N = N  # 状态数
            self.params = {
                'A': A,
                'B': B,
                'pi': pi,
                'gamma': None,
                'xi': None
            }
    
            self.M = None  # 观测数
    
            self.S = None  # 句子个数
            self.T = None  # 每个句子的长度
    
            self.V = None  # 观测集合
    
            self.eps = np.finfo(float).eps
    
            np.random.seed(2)
    
        def _init_params(self):
            """
            初始化模型参数
            :return:
            """
            def generate_random_n_data(N):
                ret = np.random.rand(N)
                return ret / np.sum(ret)
    
            pi = generate_random_n_data(self.N)
            A = np.array([generate_random_n_data(self.N) for _ in range(self.N)])
            B = np.array([generate_random_n_data(self.M) for _ in range(self.N)])
    
            gamma = np.zeros((self.S, self.N, self.T))
            xi = np.zeros((self.S, self.N, self.N, self.T))
    
            self.params = {
                'A': A,
                'B': B,
                'pi': pi,
                'gamma': gamma,
                'xi': xi
            }
    
        def logSumExp(self, ns, axis=None):
            max = np.max(ns)
            ds = ns - max
            sumOfExp = np.exp(ds).sum(axis=axis)
            return max + np.log(sumOfExp)
    
        def _forward(self, O_s):
            """
            前向算法,公式参考博客公式(9)
            :param O_s: 单个序列,维度(N,)
            :return:
            """
            A = self.params['A']
            B = self.params['B']
            pi = self.params['pi']
            T = len(O_s)
    
            log_alpha = np.zeros((self.N, T))
    
            for i in range(self.N):
                log_alpha[i, 0] = np.log(pi[i] + self.eps) + np.log(B[i, O_s[0]])
    
            for t in range(1, T):
                for i in range(self.N):
                    log_alpha[i, t] = self.logSumExp(np.array([log_alpha[_i, t-1] +
                                                               np.log(A[_i, i] + self.eps) +
                                                               np.log(B[i, O_s[t]])
                                                               for _i in range(self.N)]))
            return log_alpha
    
        def _backward(self, O_s):
            """
            后向算法,参考博客公式(11)
            :param O_s: 单个序列,维度(N,)
            :return:
            """
            A = self.params['A']
            B = self.params['B']
            pi = self.params['pi']
            T = len(O_s)
    
            log_beta = np.zeros((self.N, T))
    
            for i in range(self.N):
                log_beta[i, T-1] = 0
    
            for t in range(T-2, -1, -1):
                for i in range(self.N):
                    log_beta[i, t] = self.logSumExp(np.array([
                        log_beta[_i, t+1] + np.log(A[i, _i] + self.eps) + np.log(B[_i, O_s[t+1]] + self.eps)
                    for _i in range(self.N)]))
            return log_beta
    
        def _E_step(self, O):
            """
            BW算法的E_step
            计算隐变量,参考博客公式(9)(11)
            :param O:
            :return:
            """
            A = self.params['A']
            B = self.params['B']
            pi = self.params['pi']
            # 对S个句子依次执行
            for s in range(self.S):
                O_s = O[s]
                log_alpha = self._forward(O_s)
                log_beta = self._backward(O_s)
    
                # 前向算法得到的最大似然
                log_likelihood = self.logSumExp(log_alpha[:, self.T -1])  # log p(O|lambda)
                # # 后向算法得到的最大似然 (两个结果应该是相等的)
                # log_likelihood = self.logSumExp(np.array([np.log(pi[_i] + self.eps) + np.log(B[_i, 0] + self.eps) + log_beta[_i, 0] for _i in range(self.N)]))
    
                for i in range(self.N):
                    self.params['gamma'][s, i, self.T-1] = log_alpha[i, self.T-1] + log_beta[i, self.T-1] - log_likelihood
    
                for t in range(self.T - 1):
                    for i in range(self.N):