精华内容
下载资源
问答
  • TransE模型的基本思想是使head向量和relation向量的和尽可能靠近tail向量。这里我们用L1或L2范数来衡量它们的靠近程度。 损失函数是使用了负抽样的max-margin函数。 L(y, y’) = max(0, margin - y + y’) y是正...

    模型介绍

    TransE模型的基本思想是使head向量和relation向量的和尽可能靠近tail向量。这里我们用L1或L2范数来衡量它们的靠近程度。

    损失函数是使用了负抽样的max-margin函数。

    L(y, y’) = max(0, margin - y + y’)

    y是正样本的得分,y'是负样本的得分。然后使损失函数值最小化,当这两个分数之间的差距大于margin的时候就可以了(我们会设置这个值,通常是1)。

    由于我们使用距离来表示得分,所以我们在公式中加上一个减号,知识表示的损失函数为:

    其中,d是:

    这是L1或L2范数。至于如何得到负样本,则是将head实体或tail实体替换为三元组中的随机实体。


    代码实现:

    具体的代码和数据集(YAGO、umls、FB15K、WN18)请见Github:
    https://github.com/Colinasda/TransE.git

    import codecs
    import numpy as np
    import copy
    import time
    import random
    
    entities2id = {}
    relations2id = {}
    
    
    def dataloader(file1, file2, file3):
        print("load file...")
    
        entity = []
        relation = []
        with open(file2, 'r') as f1, open(file3, 'r') as f2:
            lines1 = f1.readlines()
            lines2 = f2.readlines()
            for line in lines1:
                line = line.strip().split('\t')
                if len(line) != 2:
                    continue
                entities2id[line[0]] = line[1]
                entity.append(line[1])
    
            for line in lines2:
                line = line.strip().split('\t')
                if len(line) != 2:
                    continue
                relations2id[line[0]] = line[1]
                relation.append(line[1])
    
    
        triple_list = []
    
        with codecs.open(file1, 'r') as f:
            content = f.readlines()
            for line in content:
                triple = line.strip().split("\t")
                if len(triple) != 3:
                    continue
    
                h_ = entities2id[triple[0]]
                r_ = relations2id[triple[1]]
                t_ = entities2id[triple[2]]
    
    
                triple_list.append([h_, r_, t_])
    
        print("Complete load. entity : %d , relation : %d , triple : %d" % (
        len(entity), len(relation), len(triple_list)))
    
        return entity, relation, triple_list
    
    
    def norm_l1(h, r, t):
        return np.sum(np.fabs(h + r - t))
    
    
    def norm_l2(h, r, t):
        return np.sum(np.square(h + r - t))
    
    
    class TransE:
        def __init__(self, entity, relation, triple_list, embedding_dim=50, lr=0.01, margin=1.0, norm=1):
            self.entities = entity
            self.relations = relation
            self.triples = triple_list
            self.dimension = embedding_dim
            self.learning_rate = lr
            self.margin = margin
            self.norm = norm
            self.loss = 0.0
    
        def data_initialise(self):
            entityVectorList = {}
            relationVectorList = {}
            for entity in self.entities:
                entity_vector = np.random.uniform(-6.0 / np.sqrt(self.dimension), 6.0 / np.sqrt(self.dimension),
                                                  self.dimension)
                entityVectorList[entity] = entity_vector
    
            for relation in self.relations:
                relation_vector = np.random.uniform(-6.0 / np.sqrt(self.dimension), 6.0 / np.sqrt(self.dimension),
                                                    self.dimension)
                relation_vector = self.normalization(relation_vector)
                relationVectorList[relation] = relation_vector
    
            self.entities = entityVectorList
            self.relations = relationVectorList
    
        def normalization(self, vector):
            return vector / np.linalg.norm(vector)
    
        def training_run(self, epochs=1, nbatches=100, out_file_title = ''):
    
            batch_size = int(len(self.triples) / nbatches)
            print("batch size: ", batch_size)
            for epoch in range(epochs):
                start = time.time()
                self.loss = 0.0
                # Normalise the embedding of the entities to 1
                for entity in self.entities.keys():
                    self.entities[entity] = self.normalization(self.entities[entity]);
    
                for batch in range(nbatches):
                    batch_samples = random.sample(self.triples, batch_size)
    
                    Tbatch = []
                    for sample in batch_samples:
                        corrupted_sample = copy.deepcopy(sample)
                        pr = np.random.random(1)[0]
                        if pr > 0.5:
                            # change the head entity
                            corrupted_sample[0] = random.sample(self.entities.keys(), 1)[0]
                            while corrupted_sample[0] == sample[0]:
                                corrupted_sample[0] = random.sample(self.entities.keys(), 1)[0]
                        else:
                            # change the tail entity
                            corrupted_sample[2] = random.sample(self.entities.keys(), 1)[0]
                            while corrupted_sample[2] == sample[2]:
                                corrupted_sample[2] = random.sample(self.entities.keys(), 1)[0]
    
                        if (sample, corrupted_sample) not in Tbatch:
                            Tbatch.append((sample, corrupted_sample))
    
                    self.update_triple_embedding(Tbatch)
                end = time.time()
                print("epoch: ", epoch, "cost time: %s" % (round((end - start), 3)))
                print("running loss: ", self.loss)
    
            with codecs.open(out_file_title +"TransE_entity_" + str(self.dimension) + "dim_batch" + str(batch_size), "w") as f1:
    
                for e in self.entities.keys():
                    # f1.write("\t")
                    # f1.write(e + "\t")
                    f1.write(str(list(self.entities[e])))
                    f1.write("\n")
    
            with codecs.open(out_file_title +"TransE_relation_" + str(self.dimension) + "dim_batch" + str(batch_size), "w") as f2:
                for r in self.relations.keys():
                    # f2.write("\t")
                    # f2.write(r + "\t")
                    f2.write(str(list(self.relations[r])))
                    f2.write("\n")
    
        def update_triple_embedding(self, Tbatch):
            # deepcopy 可以保证,即使list嵌套list也能让各层的地址不同, 即这里copy_entity 和
            # entitles中所有的elements都不同
            copy_entity = copy.deepcopy(self.entities)
            copy_relation = copy.deepcopy(self.relations)
    
            for correct_sample, corrupted_sample in Tbatch:
    
                correct_copy_head = copy_entity[correct_sample[0]]
                correct_copy_tail = copy_entity[correct_sample[2]]
                relation_copy = copy_relation[correct_sample[1]]
    
                corrupted_copy_head = copy_entity[corrupted_sample[0]]
                corrupted_copy_tail = copy_entity[corrupted_sample[2]]
    
                correct_head = self.entities[correct_sample[0]]
                correct_tail = self.entities[correct_sample[2]]
                relation = self.relations[correct_sample[1]]
    
                corrupted_head = self.entities[corrupted_sample[0]]
                corrupted_tail = self.entities[corrupted_sample[2]]
    
                # calculate the distance of the triples
                if self.norm == 1:
                    correct_distance = norm_l1(correct_head, relation, correct_tail)
                    corrupted_distance = norm_l1(corrupted_head, relation, corrupted_tail)
    
                else:
                    correct_distance = norm_l2(correct_head, relation, correct_tail)
                    corrupted_distance = norm_l2(corrupted_head, relation, corrupted_tail)
    
                loss = self.margin + correct_distance - corrupted_distance
    
                if loss > 0:
                    self.loss += loss
                    print(loss)
                    correct_gradient = 2 * (correct_head + relation - correct_tail)
                    corrupted_gradient = 2 * (corrupted_head + relation - corrupted_tail)
    
                    if self.norm == 1:
                        for i in range(len(correct_gradient)):
                            if correct_gradient[i] > 0:
                                correct_gradient[i] = 1
                            else:
                                correct_gradient[i] = -1
    
                            if corrupted_gradient[i] > 0:
                                corrupted_gradient[i] = 1
                            else:
                                corrupted_gradient[i] = -1
    
                    correct_copy_head -= self.learning_rate * correct_gradient
                    relation_copy -= self.learning_rate * correct_gradient
                    correct_copy_tail -= -1 * self.learning_rate * correct_gradient
    
                    relation_copy -= -1 * self.learning_rate * corrupted_gradient
                    if correct_sample[0] == corrupted_sample[0]:
                        # if corrupted_triples replaces the tail entity, the head entity's embedding need to be updated twice
                        correct_copy_head -= -1 * self.learning_rate * corrupted_gradient
                        corrupted_copy_tail -= self.learning_rate * corrupted_gradient
                    elif correct_sample[2] == corrupted_sample[2]:
                        # if corrupted_triples replaces the head entity, the tail entity's embedding need to be updated twice
                        corrupted_copy_head -= -1 * self.learning_rate * corrupted_gradient
                        correct_copy_tail -= self.learning_rate * corrupted_gradient
    
                    # normalising these new embedding vector, instead of normalising all the embedding together
                    copy_entity[correct_sample[0]] = self.normalization(correct_copy_head)
                    copy_entity[correct_sample[2]] = self.normalization(correct_copy_tail)
                    if correct_sample[0] == corrupted_sample[0]:
                        # if corrupted_triples replace the tail entity, update the tail entity's embedding
                        copy_entity[corrupted_sample[2]] = self.normalization(corrupted_copy_tail)
                    elif correct_sample[2] == corrupted_sample[2]:
                        # if corrupted_triples replace the head entity, update the head entity's embedding
                        copy_entity[corrupted_sample[0]] = self.normalization(corrupted_copy_head)
                    # the paper mention that the relation's embedding don't need to be normalised
                    copy_relation[correct_sample[1]] = relation_copy
                    # copy_relation[correct_sample[1]] = self.normalization(relation_copy)
    
            self.entities = copy_entity
            self.relations = copy_relation
    
    
    if __name__ == '__main__':
        file1 = "/umls/train.txt"
        file2 = "/umls/entity2id.txt"
        file3 = "/umls/relation2id.txt"
    
        entity_set, relation_set, triple_list = dataloader(file1, file2, file3)
        
        # modify by yourself
        transE = TransE(entity_set, relation_set, triple_list, embedding_dim=30, lr=0.01, margin=1.0, norm=2)
        transE.data_initialise()
        transE.training_run(out_file_title="umls_")
    
    
    展开全文
  • 知识图谱——TransE模型原理

    千次阅读 2020-02-12 16:27:20
    知识图谱——TransE模型原理 1 TransE模型介绍 1.1 TransE模型引入 在我们之前的文章中,提到了知识图谱补全任务的前提任务是知识表示学习,在知识表示学习中,最为经典的模型就是TransE模型TransE模型的核心作用...

    知识图谱——TransE模型原理

    1 TransE模型介绍

    1.1 TransE模型引入

    在我们之前的文章中,提到了知识图谱补全任务的前提任务是知识表示学习,在知识表示学习中,最为经典的模型就是TransE模型,TransE模型的核心作用就是将知识图谱中的三元组翻译成embedding向量。

    1.2 TransE模型思想

    为了后面便于表示,我们先设定一些符号

    1. h 表示知识图谱中的头实体的向量。
    2. t 表示知识图谱中的尾实体的向量。
    3. r 表示知识图谱中的关系的向量。

    在TransE模型中,有这样一个假设
    t=h+rt=h+r
    也就是说,正常情况下的尾实体向量=头实体向量+关系向量。用图的方式描述如下:
    在这里插入图片描述
    如果一个三元组不满足上述的的关系,我们就可以认为这是一个错误的三元组。

    1.3 TransE模型的目标函数

    首先,我们先来介绍两个数学概念:

    L1范数

    也称为曼哈顿距离,对于一个向量X而言,其L1范数的计算公式为:
    XL1=i=1nxi||X||_{L1}=∑_{i=1}^n|x_i|
    其中,xix_i表示向量X的第i个属性值,这里我们取的是绝对值。并且,使用L1范数可以衡量两个向量之间的差异性,也就是两个向量的距离。
    DistanceL1(X1,X2)=i=1nX1iX2iDistance_{L1}(X_1,X_2)=∑_{i=1}^n|X_{1i}-X_{2i}|

    L2范数

    也称为欧式距离,对于一个向量X而言,其L2范数的计算公式为:
    XL2=i=1nXi2||X||_{L2}=\sqrt{∑_{i=1}^nX_i^2}
    同样,L2范数也可以用来衡量两个向量之间的差距:
    DistanceL2(X1,X2)=i=1n(X1iX2i)2Distance_{L2}(X_1,X_2)=∑_{i=1}^n(X_{1i}-X_{2i})^2

    根据我们上面介绍的Trans中的假设,我们可以知道,对于一个三元组而言,头实体向量和关系向量之和与尾实体向量越接近,那么说明该三元组越接近一个正确的三元组,差距越大,那么说明这个三元组越不正常。那么我们可以选择L1或者L2范数来衡量三个向量的差距。而我们的目标就是使得争取的三元组的距离越小越好,错误的三元组距离越大越好,也就是其相反数越小越好。数学化的表示就是:
    min(h,r,t)G(h,r,t)G[γ+distance(h+r,t)distance(h+r,t)]+min∑_{(h,r,t)∈G}∑_{(h',r',t')∈G'}[γ+distance(h+r,t)-distance(h'+r',t')]_+

    其中:

    (h,r,t)(h,r,t):表示正确的三元组
    (h,r,t)(h',r',t'): 表示错误的三三元组
    γγ :表示正样本和负样本之间的间距,一个常数
    [x]+[x]_+:表示max(0,x)

    我们来简单的解释以下目标函数,我们的目标是让正例的距离最小,也就是min(distance(h+r,t))min(distance(h+r,t)),让负例的相反数最小也就是(min(distance(h+r,t)))(min(-distance(h'+r',t'))),对于每一个正样本和负样本求和,再增加一个常数的间距,就是整体距离的最小值。也就是我们的目标函数。

    1.4 目标函数的数学推导

    这里,我们采用欧氏距离作为distance函数,则目标函数可以改写为:
    min(h,r,t)G(h,r,t)G[γ+(h+rt)2(h+rt)2]+min∑_{(h,r,t)∈G}∑_{(h',r',t')∈G'}[γ+(h+r-t)^2-(h'+r'-t')^2]_+
    则对于损失函数loss就有:
    Loss=(h,r,t)G(h,r,t)G[γ+(h+rt)2(h+rt)2]+Loss = ∑_{(h,r,t)∈G}∑_{(h',r',t')∈G'}[γ+(h+r-t)^2-(h'+r-t')^2]_+
    在损失函数中,我们知道所有的参数包括{h,r,t,h,r,th,r,t,h',r,t'}。下面,我们来逐个进行梯度推导:

    1. 首先是对h的梯度,对于某一个hih_i而言
      Losshi=(h,r,t)G(h,r,t)G[γ+(h+rt)2(h+rt)2]+hi\frac{∂Loss}{∂h_i}= ∑_{(h_,r,t)∈G}∑_{(h',r,t')∈G'}\frac{∂[γ+(h+r-t)^2-(h'+r-t')^2]_+}{∂h_i}
      在整个求和的过程中,只针对包含hih_i的项求导:
      [γ+(hi+rt)2(h+rt)2]+hi\frac{∂[γ+(h_i+r-t)^2-(h'+r-t')^2]_+}{∂h_i}
      有:
      [γ+(hi+rt)2(h+rt)2]+hi={2(hi+rt)γ+(h+rt)2(h+rt)2>00γ+(h+rt)2(h+rt)2<=0\frac{∂[γ+(h_i+r-t)^2-(h'+r-t')^2]_+}{∂h_i}= \begin{cases} 2(h_i+r-t)&&γ+(h+r-t)^2-(h'+r-t')^2>0\\ 0 &&γ+(h+r-t)^2-(h'+r-t')^2<=0 \end{cases}
      则原式变为:
      Losshi=(hi,r,t)G(h,r,t)G{2(hi+rt)γ+(hi+rt)2(h+rt)2>00γ+(hi+rt)2(h+rt)2<=0\frac{∂Loss}{∂h_i}= ∑_{(h_i,r,t)∈G}∑_{(h',r,t')∈G'} \begin{cases} 2(h_i+r-t)&&γ+(h_i+r-t)^2-(h'+r-t')^2>0\\ 0 &&γ+(h_i+r-t)^2-(h'+r-t')^2<=0 \end{cases}
      同理对于ti,hi,tit_i,h_i',t_i'有:
      Lossti=(h,r,ti)G(h,r,t)G{2(h+rti)γ+(h+rti)2(h+rt)2>00γ+(h+rti)2(h+rt)2<=0\frac{∂Loss}{∂t_i}= ∑_{(h,r,t_i)∈G}∑_{(h',r,t')∈G'} \begin{cases} -2(h+r-t_i)&&γ+(h+r-t_i)^2-(h'+r-t')^2>0\\ 0 &&γ+(h+r-t_i)^2-(h'+r-t')^2<=0 \end{cases}
      Losshi=(h,r,t)G(hi,r,t)G{2(h+rt)γ+(h+rt)2(hi+rt)2>00γ+(h+rt)2(hi+rt)2<=0\frac{∂Loss}{∂h_i'}= ∑_{(h,r,t)∈G}∑_{(h_i',r,t')∈G'} \begin{cases} -2(h'+r-t')&&γ+(h+r-t)^2-(h_i'+r-t')^2>0\\ 0 &&γ+(h+r-t)^2-(h_i'+r-t')^2<=0 \end{cases}
      Lossti=(h,r,t)G(h,r,ti)G{2(h+rt)γ+(h+rt)2(h+rti)2>00γ+(h+rt)2(h+rti)2<=0\frac{∂Loss}{∂t_i'}= ∑_{(h,r,t)∈G}∑_{(h',r,t_i')∈G'} \begin{cases} 2(h'+r-t')&&γ+(h+r-t)^2-(h'+r-t_i')^2>0\\ 0 &&γ+(h+r-t)^2-(h'+r-t_i')^2<=0 \end{cases}
      最后对于rir_i,γ有:
      Lossri=(h,ri,t)G(h,ri,t)G{2(h+rit)2(h+rit)γ+(h+rit)2(h+rit)2>00γ+(h+rit)2(h+rit)2<=0\frac{∂Loss}{∂r_i}=∑_{(h,r_i,t)∈G}∑_{(h',r_i,t')∈G'} \begin{cases} 2(h+r_i-t)-2(h'+r_i-t')&&γ+(h+r_i-t)^2-(h'+r_i-t')^2>0\\ 0 &&γ+(h+r_i-t)^2-(h'+r_i-t')^2<=0 \end{cases}
      Lossγ=(h,r,t)G(h,r,t)G{1γ+(h+rt)2(h+rt)2>00γ+(h+rt)2(h+rt)2<=0\frac{∂Loss}{∂γ}= ∑_{(h,r,t)∈G}∑_{(h',r',t')∈G'} \begin{cases} 1&&γ+(h+r-t)^2-(h'+r-t')^2>0\\ 0 &&γ+(h+r-t)^2-(h'+r-t')^2<=0 \end{cases}

    1.5 如何产生负样本

    在我们之前算法描述中,我们提到了负样本的问题,对于一个知识图谱而言,其中保存的全部都是正样本时肯定的了。那么,我们应该如何获取负样本呢?

    具体的可以通过随机替换头实体的方式来实现一个错误的三元组,或者采用随机替换一个错误的尾实体的方式来形成一个错误的三元组。

    同时,为了避免,我们在替换形成的三元组也存在于知识图谱中,我们需要在替换之后进行过滤。

    2 参考文章

    1. TransE如何进行向量更新?
    2. TransE算法详解
    展开全文
  • TransE模型-数据预处理

    2021-05-12 17:42:21
    TransE模型-数据预处理源代码数据说明代码解释 源代码 源代码参考项目 数据说明 数据集采用FB15K,下面代码中的文件分别为: file1:训练集,格式为(head,relation,tail) 例: /m/027rn /location/country/form_...

    TransE模型-数据预处理

    源代码

    源代码参考项目

    数据说明

    数据集采用FB15K,下面代码中的文件分别为:

    file1:训练集,格式为(head,relation,tail)

    例:

    /m/027rn	/location/country/form_of_government	/m/06cx9
    /m/017dcd	/tv/tv_program/regular_cast./tv/regular_tv_appearance/actor	/m/06v8s0
    /m/07s9rl0	/media_common/netflix_genre/titles	/m/0170z3
    

    file2:entity2id.txt,格式为(entity,id)

    例:

    /m/06rf7	0
    /m/0c94fn	1
    /m/016ywr	2
    

    file3:relation2id.txt,格式为(relation,id)

    例:

    /people/appointed_role/appointment./people/appointment/appointed_by	0
    /location/statistical_region/rent50_2./measurement_unit/dated_money_value/currency	1
    /tv/tv_series_episode/guest_stars./tv/tv_guest_role/actor	2
    

    file4:验证集,格式为(head,relation,tail)

    例:

    /m/07pd_j	/film/film/genre	/m/02l7c8
    /m/06wxw	/location/location/time_zones	/m/02fqwt
    /m/0d4fqn	/award/award_winner/awards_won./award/award_honor/award_winner	/m/03wh8kl
    

    代码解释

    def dataloader(file1, file2, file3, file4):
        print("load file...")
        entity = []
        relation = []
        with open(file2, 'r') as f1, open(file3, 'r') as f2:
            lines1 = f1.readlines()
            lines2 = f2.readlines()
            for line in lines1:
                # 去除头尾空格 以'\t'作为分隔符
                line = line.strip().split('\t')
                if len(line) != 2:
                    continue
                # line[0] entity
                # line[1] id
                # entities2id:{entity : id}
                entities2id[line[0]] = line[1]
                entity.append(int(line[1]))
    
            for line in lines2:
                line = line.strip().split('\t')
                if len(line) != 2:
                    continue
                # line[0] relation
                # line[1] id
                # relations2id:{relation : id}
                relations2id[line[0]] = line[1]
                relation.append(int(line[1]))
    
        triple_list = []
        relation_head = {}
        relation_tail = {}
    
        with codecs.open(file1, 'r') as f:
            content = f.readlines()
            for line in content:
                triple = line.strip().split("\t")
                if len(triple) != 3:
                    continue
    		   # h_ : 头实体id 
                # r_ : 关系id 
                # t_ : 尾实体id
                h_ = int(entities2id[triple[0]])
                r_ = int(relations2id[triple[1]])
                t_ = int(entities2id[triple[2]])
    
                triple_list.append([h_, r_, t_])
                # relation_head:{relation: {head_entity:num} }
                # 关系对应的头实体及其数量(即该头实体对应尾实体的个数)
                if r_ in relation_head:
                    if h_ in relation_head[r_]:
                        relation_head[r_][h_] += 1
                    else:
                        relation_head[r_][h_] = 1
                else:
                    relation_head[r_] = {}
                    relation_head[r_][h_] = 1
                # relation_tail:{relation: {tail_entity:num} }
                # 关系对应的尾实体及其数量(即该尾实体对应头实体的个数)
                if r_ in relation_tail:
                    if t_ in relation_tail[r_]:
                        relation_tail[r_][t_] += 1
                    else:
                        relation_tail[r_][t_] = 1
                else:
                    relation_tail[r_] = {}
                    relation_tail[r_][t_] = 1
    
        for r_ in relation_head:
            sum1, sum2 = 0, 0
            # sum1 计算有几种头实体
            # sum2 计算头实体数对应的尾实体数
            for head in relation_head[r_]:
                sum1 += 1
                sum2 += relation_head[r_][head]
            # tph 该关系中,平均每个头实体对应的尾实体数
            tph = sum2 / sum1
            relation_tph[r_] = tph
    
        for r_ in relation_tail:
            sum1, sum2 = 0, 0
            # sum1 计算有几种尾实体
            # sum2 计算尾实体数对应的头实体数
            for tail in relation_tail[r_]:
                sum1 += 1
                sum2 += relation_tail[r_][tail]
            hpt = sum2 / sum1
            relation_hpt[r_] = hpt
    
        valid_triple_list = []
        with codecs.open(file4, 'r') as f:
            content = f.readlines()
            for line in content:
                triple = line.strip().split("\t")
                if len(triple) != 3:
                    continue
    
                h_ = int(entities2id[triple[0]])
                r_ = int(relations2id[triple[1]])
                t_ = int(entities2id[triple[2]])
    
                valid_triple_list.append([h_, r_, t_])
    
        print("Complete load. entity : %d , relation : %d , train triple : %d, , valid triple : %d" % (
        len(entity), len(relation), len(triple_list), len(valid_triple_list)))
    
        return entity, relation, triple_list, valid_triple_list
    
    

    计算头实体(尾实体)对应的尾实体(头实体)数量tph(hpt),在构建负样本时会用到。

    例如,在一个知识图谱中有10个实体和n个关系,其中一个关系2个头实体对应5个尾实体,则tph=2.5,hpt=0.4。替换头实体的负样本正确概率为102101=89\frac{10-2}{10-1}=\frac{8}{9},替换尾实体的负样本正确概率为105101=59\frac{10-5}{10-1}=\frac{5}{9}。因此构建正确负样本时就要替换头实体。

    展开全文
  • 一般来说,要想训练一个模型,大致分为四个步骤:数据处理、模型构建、训练模型、测试模型,接下来的内容也会从这四个步骤开始介绍。数据处理FB15K数据是从Freebase(http://www.freebase.com)抽取到的一系列三元组...

    一般来说,要想训练一个模型,大致分为四个步骤:数据处理、模型构建、训练模型、测试模型,接下来的内容也会从这四个步骤开始介绍。

    数据处理

    FB15K数据是从Freebase(http://www.freebase.com)抽取到的一系列三元组(同义词集,关系类型,三元组)。该数据集可以看作是3模张量,描述了同义集之间的三元关系。

    总共包含了3种文件:

    • train.txt 36M

    • valid.txt 3.7K

    • test.txt 4.4M

    首先我定义了一个config.py文件用来存放一些与数据集相关的路径。

    class Config(object):    def __init__(self):        super()        self.train_fb15k = "./datasets/fb15k/train.txt"     # 训练集路径        self.test_fb15k = "./datasets/fb15k/test.txt"       # 测试集路径        self.valid_fb15k = "./datasets/fb15k/valid.txt"     # 验证集路径        self.entity2id_train_file = "./datasets/fb15k/entity2id_train.txt"      # 训练集实体到索引的映射        self.relation2id_train_file = "./datasets/fb15k/relation2id_train.txt"  # 训练集关系到索引的映射        self.entity2id_test_file = "./datasets/fb15k/entity2id_test.txt"        # 测试集实体到索引的映射        self.relation2id_test_file = "./datasets/fb15k/relation2id_test.txt"    # 测试集关系到索引的映射        self.entity2id_valid_file = "./datasets/fb15k/entity2id_valid.txt"      # 验证集实体到索引的映射        self.relation2id_valid_file = "./datasets/fb15k/relation2id_valid.txt"  # 验证集关系到索引的映射        self.entity_50dim_batch400 = "./datasets/fb15k/entity_50dim_batch400"   # 400 batch, 实体embedding向量50维的训练结果        self.relation_50dim_batch400 = "./datasets/fb15k/relation_50dim_batch400"   # 400 batch, 关系embedding向量50维的训练结果

    然后定义了一个data_process.py文件来对数据集进行处理,该文件包含一个Datasets类用来处理数据。

    import osfrom config import Configclass Datasets(object):    def __init__(self, config):        super()        self.config = config        self.entity2id = {}        self.relation2id = {}    def load_data(self, file_path):        '''        加载数据        :param file_path: 数据的文件路径        :return: 读取的数据, 按行划分构成list        '''        with open(file_path, "r", encoding="utf-8") as f:            lines = f.readlines()        return lines    def build_data2id(self, is_test=False):        '''        将数据从字符串转换为index索引, 并保存到对应的路径        :param is_test: 是否是测试集        :return: null        '''        # load data        lines = []        if not is_test:            lines = self.load_data(self.config.train_fb15k)            print("load train data completely.")        else:            lines = self.load_data(self.config.test_fb15k)            print("load test data completely.")        # process 1 line        idx_e = 0        idx_r = 0        for line in lines:            line = line.strip().split("\t")            self.entity2id.setdefault(line[0], idx_e)            idx_e += 1            self.entity2id.setdefault(line[2], idx_e)            idx_e += 1            self.relation2id.setdefault(line[1], idx_r)            idx_r += 1        # save entity2id        if not os.path.exists(self.config.entity2id_train_file):            with open(self.config.entity2id_train_file, "a+", encoding="utf-8") as f:                for k, v in self.entity2id.items():                    entry = k + " " + str(v) + "\n"                    f.write(entry)        # save relation2id        if not os.path.exists(self.config.relation2id_train_file):            with open(self.config.relation2id_train_file, "a+", encoding="utf-8") as f:                for k, v in self.relation2id.items():                    entry = k + " " + str(v) + "\n"                    f.write(entry)    def build_data(self):        '''        将字符型数据转换为由one-hot编码表示的数据        :return:            entity_set: 实体集            relation_set: 关系集            triple_list: 三元组列表        '''        # save entities        entity_set = set()        # save relations        relation_set = set()        # save triples        triple_list = []        # load data        lines = self.load_data(self.config.train_fb15k)        # build data        for line in lines:            triple = line.strip().split("\t")            # h, r, t of a triple            h_ = self.entity2id[triple[0]]            r_ = self.relation2id[triple[1]]            t_ = self.entity2id[triple[2]]            entity_set.add(h_)            entity_set.add(t_)            relation_set.add(r_)            triple_list.append([h_, r_, t_])        return entity_set, relation_set, triple_list

    然后可以在该脚本上看看数据的处理过程。

    原数据可以通过load_data方法读取,并将其输出

    config = Config()datasets = Datasets(config)lines = datasets.load_data(config.train_fb15k)print(lines[:1])

    最后在控制台会得到这样的结果

    fde8ee79ebc66ee5ef03b5e043c250da.png

    上图中的内容实际上是一个三元组,其格式为(头实体,关系,尾实体),只不过实体和关系之间用/t隔开,因此,标准的格式应该是(/m/027rn,/location/country/form_of_government,/m/06cx9),这是知识图谱的RDF表示形式。知道了读取的数据格式,就可以对其进行处理,首先,计算机是很难处理字符型数据的,此外字符类型数据作为存储的会占用比较大的内存空间,因此需要将这种字符类型的数据转换为one-hot编码,这里采用索引的方式,通过调用build_data2id方法来完成。

    datasets.build_data2id()print("entity to index:")print(list(datasets.entity2id.items())[:1])print("relation to index:")print(list(datasets.relation2id.items())[:1])

    上面是输出了构建好的实体到索引映射的第一条和关系到索引映射的第一条,然后在控制台就可以看到输出的结果:

    0bea70cc3ed96c689385afb858d94b0c.png

    前面说的,计算机不好直接处理字符类型,所以我们需要把输入模型的数据也转换为这种索引类型的,以下面这条三元组为例:

    fde8ee79ebc66ee5ef03b5e043c250da.png

    转换后的结果就变为,表示索引为0的实体与索引为1的实体存在着索引0的关系。这种原始数据到索引表示的数据的处理是通过build_data方法来实现的。

    entity_set, relation_set, triple_list = datasets.build_data()

    这样子就构建好了模型需要的数据格式。

    模型构建

    这部分主要介绍TransE是怎么实现的,在[1]中,论文给出了TransE模型的算法伪代码。

    6fd114f165ae4de4a0a18a27acab78db.png

    分析上面的伪代码,可以看到TransE的算法分为两个步骤:初始化initialize和loop下的训练部分。根据上述伪代码可以通过定义一个TransE.py,该文件中实现了TransE模型。

    from utils import distanceL1, distanceL2import numpy as npimport randomimport timeimport copyimport codecsclass TranSE(object):    def __init__(self, entity_set, relation_set, triple_list, embedding_dim=50, epochs = 400, learning_rate=0.01, margin=1, norm="L1"):        super()        self.entities = entity_set  # after embedding init, for each elem in entities = index: [embedding_dim]        self.relations = relation_set   # as same as entities        self.triples = triple_list        self.embedding_dim = embedding_dim        self.epochs = epochs        self.learning_rate = learning_rate        self.margin = margin        self.norm = norm    def embedding_init(self):        entity_dic = {}        relation_dic = {}        for relation in self.relations:            l = np.random.uniform(-6.0 / np.sqrt(self.embedding_dim), 6.0 / np.sqrt(self.embedding_dim), self.embedding_dim)            relation_dic[relation] = l / np.linalg.norm(l, ord=2)        self.relations = relation_dic        for entity in self.entities:            e = np.random.uniform(-6.0 / np.sqrt(self.embedding_dim), 6.0 / np.sqrt(self.embedding_dim), self.embedding_dim)            entity_dic[entity] = e        self.entities = entity_dic    def train(self):        nbatches = 400        batch_size = len(self.triples) // nbatches        print("batch size: %d" % batch_size)        for epoch in range(self.epochs):            start_time = time.time()            self.loss = 0.0            for k in range(nbatches):                Sbatch = random.sample(self.triples, batch_size)                Tbatch = []                for triple in Sbatch:                    corrupted_triple = self.corrupt_triple(triple)                    if corrupted_triple not in Tbatch:                        Tbatch.append((triple, corrupted_triple))                self.update_embeddings(Tbatch)            end_time = time.time()            print("Epoch: %d, cost time: %d second." % (epoch, end_time - start_time))            print("loss: %.6f" % self.loss)            # 保存训练临时结果            if epoch % 20 == 0:                with codecs.open("./datasets/fb15k/entities_temp", "w", encoding="utf-8") as f:                    for e in self.entities.keys():                        f.write(str(e) + "\t")                        f.write(str(list(self.entities[e])))                        f.write("\n")                with codecs.open("./datasets/fb15k/relations_temp", "w", encoding="utf-8") as f:                    for r in self.relations.keys():                        f.write(str(r) + "\t")                        f.write(str(list(self.relations[r])))                        f.write("\n")        # 保存最终结果        print("保存训练好的向量...")        with codecs.open("./datasets/fb15k/entity_50dim_batch400", "w") as f:            for e in self.entities.keys():                f.write(str(e) + "\t")                f.write(str(list(self.entities[e])))                f.write("\n")        with codecs.open("./datasets/fb15k/relation_50dim_batch400", "w") as f:            for r in self.relations.keys():                f.write(str(r) + "\t")                f.write(str(list(self.relations[r])))                f.write("\n")        print("向量保存结束")    def corrupt_triple(self, triple):        '''        The set of corrupted triplets is composed of training triplets        with either the head or tail replaced by a random entity        (but not both at the same time).        :param triple: a triple        :return: corrupt_triple        '''        corrupted_triple = copy.deepcopy(triple)        seed = random.random()        if seed > 0.5:            # replace head            rand_head = triple[0]            while rand_head == triple[0]:                rand_head = random.sample([_ for _ in self.entities.keys()], 1)[0]            corrupted_triple[0] = rand_head        else:            # replace tail            rand_tail = triple[2]            while rand_tail == triple[2]:                rand_tail = random.sample([_ for _ in self.entities.keys()], 1)[0]            corrupted_triple[2] = rand_tail        return corrupted_triple    def update_embeddings(self, Tbatch):        copy_entities = copy.deepcopy(self.entities)        copy_relations = copy.deepcopy(self.relations)        for triple, corrupted_triple in Tbatch:            # 取copy里的vector累积更新            h_correct_update = copy_entities[triple[0]]            relation_update = copy_relations[triple[1]]            t_correct_update = copy_entities[triple[2]]            h_corrupt_update = copy_entities[corrupted_triple[0]]            t_corrupt_update = copy_entities[corrupted_triple[2]]            # 取原始的vector计算梯度            h_correct = copy_entities[triple[0]]            relation = copy_relations[triple[1]]            t_correct = copy_entities[triple[2]]            h_corrupt = self.entities[corrupted_triple[0]]            t_corrupt = self.entities[corrupted_triple[2]]            if self.norm == "L1":                dist_correct = distanceL1(h_correct, relation, t_correct)                dist_corrupt = distanceL2(h_corrupt, relation, t_corrupt)            else:                dist_correct = distanceL2(h_correct, relation, t_correct)                dist_corrupt = distanceL2(h_corrupt, relation, t_corrupt)            err = self.hinge_loss(dist_correct, dist_corrupt)            if err > 0:                self.loss += err                # 计算两个距离的梯度                # 问题: 更新梯度的时候 不应该针对h或者t分别进行更新吗?                # 这样的话grad应该进行区分 而不是直接对d内进行求导吧?                grad_pos = 2 * (h_correct + relation - t_correct)                grad_neg = 2 * (h_corrupt + relation - t_corrupt)                if self.norm == "L1":                    # L1正则化, 该距离的梯度为±1                    for i in range(len(grad_pos)):                        if grad_pos[i] > 0:                            grad_pos[i] = 1                        else:                            grad_pos[i] = -1                    for i in range(len(grad_neg)):                        if grad_neg[i] > 0:                            grad_neg[i] = 1                        else:                            grad_neg[i] = -1                # 更新梯度                # 减去+d的梯度                # 在求距离公式中, h相关系数为正, 要减去                h_correct_update -= self.learning_rate * grad_pos                # 在求距离公式中, t相关系数为负, 要减去                t_correct_update -= (-1) * self.learning_rate * grad_pos                # 减去-d的梯度                # 这部分需要考虑是替换的h还是t, 需要区别更新                if triple[0] == corrupted_triple[0]: # 两个三元组的h相同, 替换了t                    h_correct_update -= (-1) * self.learning_rate * grad_neg                    t_corrupt_update -= self.learning_rate * grad_neg                if triple[2] == corrupted_triple[2]: # 两个三元组的t相同, 替换了h                    h_corrupt_update -= (-1) * self.learning_rate * grad_neg                    t_correct_update -= self.learning_rate * grad_neg                # 更新r的梯度                relation_update -= self.learning_rate * grad_pos                relation_update -= (-1) * self.learning_rate * grad_neg        # batch norm        for i in copy_entities.keys():            copy_entities[i] /= np.linalg.norm(copy_entities[i])        for i in copy_relations.keys():            copy_relations[i] /= np.linalg.norm(copy_relations[i])        # 批处理更新向量        self.entities = copy_entities        self.relations = copy_relations    def hinge_loss(self, dist_correct, dist_corrupt):        '''        loss function        :param dist_correct:        :param dist_corrupt:        :return:        '''        return max(0, self.margin + dist_correct - dist_corrupt)

    在初始化阶段,对应TranSE类中的embedding_init方法,该方法将每个关系实体都初始化了一个embedding维度的向量,最后self.relations = {"0":[embedding]},self.entities与之相同。最后就是模型的训练部分代码,按照伪代码的描述,对于每一代,都设置了一个是以为最小batch进行随机采样得到的个样本,也是模型在一个batch中需要的样本数目,对应论文中的这部分描述:

    1bd14400cbf747c263277b81424e1d80.png

    其中表示的是正确实体(正例)和破坏后错误实体(负例)的一个集合。在看论文的时候,怎么破坏一个正例是比较核心的内容,论文中说:

    e66f5512e16f1e3509f5507da8faad7b.png

    因此,每一个实体只破坏头实体或者尾实体,对应TranSE类中的corrupt_triple方法。

        def corrupt_triple(self, triple):        '''        The set of corrupted triplets is composed of training triplets        with either the head or tail replaced by a random entity        (but not both at the same time).        :param triple: a triple        :return: corrupt_triple        '''        corrupted_triple = copy.deepcopy(triple)        seed = random.random()        if seed > 0.5:            # replace head            rand_head = triple[0]            while rand_head == triple[0]:                rand_head = random.sample([_ for _ in self.entities.keys()], 1)[0]            corrupted_triple[0] = rand_head        else:            # replace tail            rand_tail = triple[2]            while rand_tail == triple[2]:                rand_tail = random.sample([_ for _ in self.entities.keys()], 1)[0]            corrupted_triple[2] = rand_tail        return corrupted_triple

    这里我使用一个随机数种子对输入的三元组随机破坏头实体或者尾实体。

    update embeddings

    TransE模型的核心是训练得到实体和关系所对应的embedding向量,最后就是根据破坏掉的实体,结合损失函数对embedding进行训练更新,论文采用的损失函数方程如下所示:

    998c94e9b2dfee9e21a5dd7ebcbde732.png

    上式中的表示的是正值,其实就是hinge loss,因此我在TransE类中定义了update_embeddings方法和hinge_loss方法,embedding更新需要对上式进行求导,然后根据梯度不断更新,需要注意的是,上面这个式子的梯度更新需要将分开更新。

        def update_embeddings(self, Tbatch):        copy_entities = copy.deepcopy(self.entities)        copy_relations = copy.deepcopy(self.relations)        for triple, corrupted_triple in Tbatch:            # 取copy里的vector累积更新            h_correct_update = copy_entities[triple[0]]            relation_update = copy_relations[triple[1]]            t_correct_update = copy_entities[triple[2]]            h_corrupt_update = copy_entities[corrupted_triple[0]]            t_corrupt_update = copy_entities[corrupted_triple[2]]            # 取原始的vector计算梯度            h_correct = copy_entities[triple[0]]            relation = copy_relations[triple[1]]            t_correct = copy_entities[triple[2]]            h_corrupt = self.entities[corrupted_triple[0]]            t_corrupt = self.entities[corrupted_triple[2]]            if self.norm == "L1":                dist_correct = distanceL1(h_correct, relation, t_correct)                dist_corrupt = distanceL2(h_corrupt, relation, t_corrupt)            else:                dist_correct = distanceL2(h_correct, relation, t_correct)                dist_corrupt = distanceL2(h_corrupt, relation, t_corrupt)            err = self.hinge_loss(dist_correct, dist_corrupt)            if err > 0:                self.loss += err                # 计算两个距离的梯度                # 问题: 更新梯度的时候 不应该针对h或者t分别进行更新吗?                # 这样的话grad应该进行区分 而不是直接对d内进行求导吧?                grad_pos = 2 * (h_correct + relation - t_correct)                grad_neg = 2 * (h_corrupt + relation - t_corrupt)                if self.norm == "L1":                    # L1正则化, 该距离的梯度为±1                    for i in range(len(grad_pos)):                        if grad_pos[i] > 0:                            grad_pos[i] = 1                        else:                            grad_pos[i] = -1                    for i in range(len(grad_neg)):                        if grad_neg[i] > 0:                            grad_neg[i] = 1                        else:                            grad_neg[i] = -1                # 更新梯度                # 减去+d的梯度                # 在求距离公式中, h相关系数为正, 要减去                h_correct_update -= self.learning_rate * grad_pos                # 在求距离公式中, t相关系数为负, 要减去                t_correct_update -= (-1) * self.learning_rate * grad_pos                # 减去-d的梯度                # 这部分需要考虑是替换的h还是t, 需要区别更新                if triple[0] == corrupted_triple[0]: # 两个三元组的h相同, 替换了t                    h_correct_update -= (-1) * self.learning_rate * grad_neg                    t_corrupt_update -= self.learning_rate * grad_neg                if triple[2] == corrupted_triple[2]: # 两个三元组的t相同, 替换了h                    h_corrupt_update -= (-1) * self.learning_rate * grad_neg                    t_correct_update -= self.learning_rate * grad_neg                # 更新r的梯度                relation_update -= self.learning_rate * grad_pos                relation_update -= (-1) * self.learning_rate * grad_neg        # batch norm        for i in copy_entities.keys():            copy_entities[i] /= np.linalg.norm(copy_entities[i])        for i in copy_relations.keys():            copy_relations[i] /= np.linalg.norm(copy_relations[i])        # 批处理更新向量        self.entities = copy_entities        self.relations = copy_relations    def hinge_loss(self, dist_correct, dist_corrupt):        '''        loss function        :param dist_correct:        :param dist_corrupt:        :return:        '''        return max(0, self.margin + dist_correct - dist_corrupt)

    文章中对向量相似度的刻画采用的是L1-norm或者L2-norm,我在utils.py中定义了这两个函数,专门用来计算embedding之间的相似度,在代码中我才用L1-norm来作为相似度衡量,也可以换成L2-norm,只需要在模型初始化的时候将norm参数设置成L2-norm即可。

    Training

    在上面的基础上,进行模型的训练,我编写了train.py文件,调用上面说的几个文件进行训练。

    from config import Configfrom data_process import Datasetsfrom TranSE import TranSEif __name__=="__main__":    config = Config()    datasets = Datasets(config=config)    print("build data to id...")    datasets.build_data2id()    print("data to id built. load data...")    entity_set, relation_set, triple_list = datasets.build_data()    print("load entities: %d, relationa: %d, triples: %d" % (len(entity_set), len(relation_set), len(triple_list)))    print("entities:")    model = TranSE(entity_set=entity_set, relation_set=relation_set, triple_list=triple_list, embedding_dim=50)    # 初始化向量    model.embedding_init()    # 开始训练    model.train()

    这个训练好的只是实体和关系的embedding向量,还需要对其性能进行测试,在这里我采用链接预测任务,衡量指标是hits10和mean rank,性能测试代码对应test.py中。

    from config import Configfrom data_process import Datasetsimport numpy as npimport timeimport operatorimport codecsdef load_data2id(config):    # entity to index    entity2id = {}    # relation to index    relation2id = {}    # load entity to index    with codecs.open(config.entity2id_train_file, encoding="utf-8") as f:        lines = f.readlines()    for line in lines:        entity, idx_e = line.strip().split(" ")        entity2id.setdefault(entity, int(idx_e))    # load relation to index    with codecs.open(config.relation2id_train_file, encoding="utf-8") as f:        lines = f.readlines()    for line in lines:        relation, idx_r = line.strip().split(" ")        relation2id.setdefault(relation, int(idx_r))    return entity2id, relation2iddef data_loader(entity_file, relation_file, test_file, entity2id, relation2id):    entity_dic = {}    relation_dic = {}    test_triple = []    # load entity to embedding    with codecs.open(entity_file, encoding="utf-8") as f:        lines = f.readlines()    for line in lines:        entity, emb = line.strip().split("\t")        entity_dic.setdefault(int(entity), eval(emb))    # load relation to embedding    with codecs.open(relation_file, encoding="utf-8") as f:        lines = f.readlines()    for line in lines:        relation, emb = line.strip().split("\t")        relation_dic.setdefault(int(relation), eval(emb))    # convert test data to the representation of index    with codecs.open(test_file, encoding="utf-8") as f:        lines = f.readlines()    for line in lines:        triple = line.strip().split("\t")        h_ = int(entity2id[triple[0]])        r_ = int(relation2id[triple[1]])        t_ = int(entity2id[triple[2]])        test_triple.append(tuple((h_, r_, t_)))    return entity_dic, relation_dic, test_tripleclass Test(object):    def __init__(self, entity_dic, relation_dic, test_triple, train_triple, is_Fit=False):        super()        self.entity_dic = entity_dic        self.relation_dic = relation_dic        self.test_triple = test_triple        self.train_triple = train_triple        self.is_Fit = is_Fit        self.mean_rank = 0        self.hits10 = 0        self.relation_hits10 = 0        self.relation_mean_rank = 0    def distance(self, h, r, t):        '''        This function is aimed to compute the dissimilarity between h + l and t, which is based on L2-norm.        :param h: head entity        :param r: relation        :param t: tail entity        :return: a value of dissimilarity        '''        h = np.array(h)        r = np.array(r)        t = np.array(t)        s = h + r - t        return np.linalg.norm(s)    def rank(self):        hits = 0        rank_sum = 0        step = 0        for t in self.test_triple:            rank_head_dict = {}            rank_tail_dict = {}            for e in self.entity_dic.keys():                corrupted_head_triple = [e, t[1], t[2]]                corrupted_tail_triple = [t[0], t[1], e]                # corrupt head of test triple                if self.is_Fit:                    if corrupted_head_triple not in self.train_triple:                        h_emb = self.entity_dic[corrupted_head_triple[0]]                        r_emb = self.relation_dic[corrupted_head_triple[1]]                        t_emb = self.entity_dic[corrupted_head_triple[2]]                        rank_head_dict[tuple(corrupted_head_triple)] = self.distance(h_emb, r_emb, t_emb)                else:                    h_emb = self.entity_dic[corrupted_head_triple[0]]                    r_emb = self.relation_dic[corrupted_head_triple[1]]                    t_emb = self.entity_dic[corrupted_head_triple[2]]                    rank_head_dict[tuple(corrupted_head_triple)] = self.distance(h_emb, r_emb, t_emb)                # corrupt tail of test triple                if self.is_Fit:                    if corrupted_tail_triple not in self.train_triple:                        h_emb = self.entity_dic[corrupted_tail_triple[0]]                        r_emb = self.relation_dic[corrupted_tail_triple[1]]                        t_emb = self.entity_dic[corrupted_tail_triple[2]]                        rank_tail_dict[tuple(corrupted_tail_triple)] = self.distance(h_emb, r_emb, t_emb)                else:                    h_emb = self.entity_dic[corrupted_tail_triple[0]]                    r_emb = self.relation_dic[corrupted_tail_triple[1]]                    t_emb = self.entity_dic[corrupted_tail_triple[2]]                    rank_tail_dict[tuple(corrupted_tail_triple)] = self.distance(h_emb, r_emb, t_emb)            # sort head & tail dict            rank_head_sorted_dict = sorted(rank_head_dict.items(), key=operator.itemgetter(1))            rank_tail_sorted_dict = sorted(rank_tail_dict.items(), key=operator.itemgetter(1))            # save the rank of correct head entity, compute hits10 and mean rank.            for i in range(len(rank_head_sorted_dict)):                if t[0] == rank_head_sorted_dict[i][0][0]:                    if i < 10:                        hits += 1                    # save the rank of correct head entity                    rank_sum = rank_sum + i + 1                    break            # save the rank of correct tail entity, compute hits10 and mean rank.            for i in range(len(rank_tail_sorted_dict)):                if t[2] == rank_head_sorted_dict[i][0][2]:                    if i < 10:                        hits += 1                    # save the rank of head entity                    rank_sum = rank_sum + i + 1                    break            # print result            step += 1            if step % 5000 == 0:                print("hits: ", hits, "rank sum: ", rank_sum)        # save final result        self.hits10 = hits / (2 * len(self.test_triple))        self.mean_rank = rank_sum / len(self.test_triple)        print("hits10: ",self.hits10, "mean rank: ", self.mean_rank)    def rank_relation(self):        hits = 0        rank_sum = 0        step = 0        for t in self.test_triple:            rank_relation_dic = {}            for r in self.relation_dic.keys():                corrupted_relation_triple = [t[0], r, t[2]]                if self.is_Fit:                    if corrupted_relation_triple not in self.train_triple:                        h_emb = self.entity_dic[corrupted_relation_triple[0]]                        r_emb = self.relation_dic[corrupted_relation_triple[1]]                        t_emb = self.entity_dic[corrupted_relation_triple[2]]                        rank_relation_dic[tuple(h_emb, r_emb, t_emb)] = self.distance(h_emb, r_emb, t_emb)                else:                    h_emb = self.entity_dic[corrupted_relation_triple[0]]                    r_emb = self.relation_dic[corrupted_relation_triple[1]]                    t_emb = self.entity_dic[corrupted_relation_triple[2]]                    rank_relation_dic[tuple(h_emb,r_emb,t_emb)] = self.distance(h_emb, r_emb, t_emb)            # sorted relation triple            rank_relation_sorted_triple = sorted(rank_relation_dic.items(), key=operator.itemgetter(1))            # compute hits10 and rank sum            for i in range(len(rank_relation_sorted_triple)):                if t[1] == rank_relation_sorted_triple[i][0][1]:                    if i < 10:                        hits += 1                    rank_sum = rank_sum + i + 1                    break            step += 1            if step % 5000 == 0:                print("hits: ", hits, "rank sum: ", rank_sum)        # print results        self.hits10 = hits / len(self.test_triple)        self.mean_rank = hits / len(self.test_triple)        print("hits10: %.6f, mean rank: %.6f" % (self.hits10, self.mean_rank))if __name__=="__main__":    print("load test config....")    config = Config()    print("load config success.")    print("load data to index...")    entity2id, relation2id = load_data2id(config)    print("load success. load %d of entities, %d types of relation." % (len(entity2id), len(relation2id)))    print("load train triple...")    datasets = Datasets(config)    datasets.build_data2id()    _, _, train_triple = datasets.build_data()    print("load train triple success.")    print("load entity_dic, relation_dic, test triple...")    entity_dic, relation_dic, test_triple = data_loader(config.entity_50dim_batch400, config.relation_50dim_batch400, config.test_fb15k, entity2id, relation2id)    print("load success. %d of entity" % (len(entity_dic)))    print("begin test with is_Fit = False...")    test = Test(entity_dic, relation_dic, test_triple, train_triple)    start_time = time.time()    print("rank entity...")    test.rank()    print("rank relation...")    test.rank_relation()    end_time = time.time()    print("ended. cost time: %d second" % (end_time - start_time))        print("begin test with is_Fit = True...")    test.is_Fit = True    start_time = time.time()    print("rank entity...")    test.rank()    print("rank relation...")    test.rank_relation()    end_time = time.time()    print("ended. cost time: %d second" % (end_time - start_time))

    得到下面训练400代的效果:

    hits:  5720 rank sum:  3262746

    hits:  11407 rank sum:  6664410

    hits:  17174 rank sum:  10058771

    hits:  22906 rank sum:  13428817

    hits:  28650 rank sum:  16678083

    hits:  34355 rank sum:  19921455

    hits:  40085 rank sum:  23137841

    hits:  45819 rank sum:  26537245

    hits:  51561 rank sum:  29760948

    hits:  57313 rank sum:  32927922

    hits:  63045 rank sum:  36246251

    hits10:  0.5731746542296559 mean rank:  658.1611619915018

    cost time: 19170 second

    这个效果距离论文中的还是存在非常大的差距的,可能跟训练的epochs不够久有一定关系,还有就是,训练的非常久,下一步会采用pytorch进行重写。

    参考文献

    [1] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko. Translating Embeddings for Modeling Multi-relational Data. Neural Information Processing Systems (NIPS), Dec 2013, South Lake Tahoe, United States. pp.1-9. ⟨hal-00920777⟩

    展开全文
  • TransE训练后模型的理想状态是,从实体矩阵和关系矩阵中各自抽取一个向量,进行L1或者L2运算,得到的结果近似于实体矩阵中的另一个实体的向量,从而达到通过词向量表示知识图谱中已存在的三元组 的关系。 TransE...
  • 依旧是从完整的模型流程来说明。先给出论文中的算法流程:数据处理pytorch自带数据处理的模块Dataset和DataLoader,可以很方便的供使用者使用。其中Dataset模块用于构建数据集,比如读取数据可以结合这个模块实现,...
  • TransE模型:知识图谱的经典表示学习方法

    万次阅读 多人点赞 2018-10-24 20:39:02
    TransE模型属于翻译模型:直观上,将每个三元组实例(head,relation,tail)中的关系relation看做从实体head到实体tail的翻译,通过不断调整h、r 和 t(head、relation和tail的向量),使(h + r) 尽可能与 t 相等...
  • TransE

    千次阅读 2017-12-25 20:10:09
    加号是大于0取原值,小于0则为0。我们叫做合页损失函数(hinge loss function),这种训练方法叫做margin-based ranking criterion。...整个TransE模型的训练过程比较简单,首先对头尾节点以及关系进行初始化,然后每对一
  • TransE模型 TransE认为在知识库中,三元组<h,r,t>可以看成头实体h到尾实体t利用关系r所进行的翻译。 比如,<柏拉图,老师,苏格拉底>头实体“柏拉图”的向量加上关系“老师”的向量,应该尽可能和尾...
  • TransE、TransH模型都是图谱中关于实体嵌入的模型,它们都是基于向量平移的模型。 它们的目的是使 h+r=t,所以它们的损失函数是比较h+r和t之间的距离,目的是使所有三元组的h+r与t的距离之和最小,我们可以定义h+r与t...
  • TransE论文第2节:翻译模型

    千次阅读 2016-04-13 06:56:01
    给定一个由三元组(h, l, t)组成的训练集S,其中h,t属于E,关系属于L,我们的模型学习实体和关系的向量嵌入。嵌入的取值属于(k是模型的一个参数),用相同的大写字母表示。
  • 论文解读:(TransE)Translating Embeddings for Modeling Multi-relational Data   表示学习是深度学习的基石,正式表示学习才能让深度学习可以...TransE模型正是一种基于深度学习的知识表示方法,也是Trans系列...
  • 知识图谱表示学习 TransE: Translating Embeddings for Modeling Multi-relational Data 表示学习是深度学习的基础,将数据用更有效的方式表达出来,才能...本文聚焦于TransE模型。 1. 引言 多元关系数据(Multi-relat
  • 知识图谱中,TransE模型的translate究竟是什么意思? 字典里如实说 :translate 英 [trænzˈleɪt] 美 [trænzˈleɪt],v.翻译;译;被翻译;被译成;(使)转变,变为 显然,译为“翻译”,只能呵呵了,啥译成啥啊,...
  • TransE算法

    千次阅读 2018-04-11 18:35:59
    TransE算法中存在一个设定,它将关系看作是实体间的平移向量,也就是说对于一个三元组(h,r,t)对应的向量lh,lr,lt,希望 lh+lr =lt 这源于Mikolov等人在2013年提出的word2vec词表示学习模型,他们发现词向量...
  • TransE系列源码

    2017-12-29 11:03:36
    TransE的直观含义,就是TransE基于实体和关系的分布式向量表示,将每个三元组实例(head,relation,tail)中的关系relation看做从实体head到实体tail的翻译(其实我一直很纳闷为什么叫做translating,其实就是向量...
  • 点击上方“AI公园”,关注公众号,选择加“星标“或“置顶”作者:Xu LIANG编译:ronghuaiyang导读一文打尽图嵌入Translate模型,各种模型的动机,优缺点分析。本文对...
  • 文章转载:【知识图谱】——8种Trans模型 基于翻译模型(Trans系列)的知识表示学习 TransE解释
  • TransE 算法学习笔记

    2019-03-12 15:15:00
    http://yaoleo.github.io/2017/10/27/TransE%E7%AE%97%E6%B3%95%E7%9A%84%E7%90%86%E8%A7%A3/ tranE是在模型中嵌入知识图谱等三元组类的一个方法,就像是句子利用词典嵌入一样。 转载于:...
  • 【论文解读】TransE

    千次阅读 2019-07-05 23:45:27
    本文主要介绍TransE和数据集Wordnet、Freebase等。 表示学习:主要面向知识图谱中实体和关系进行表示学习,一般使用建模方法将实体和向量表示在低维稠密向量空间中,然后计算并推理,主要的应用任务有三元组提取...
  • TransE论文:多元关系数据嵌入

    千次阅读 2016-04-12 08:02:37
    因此,我们提出了TransE,一个将关系作为低维空间实体嵌入的翻译的方法。尽管它很简单,但是这种假设被证明是强大的,因为大量的实验表明在两个知识库连接预测方面,TransE明显优于目前最新的方法。除此之外,它能够...
  • 基于翻译模型Trans系列)的知识表示学习

    万次阅读 多人点赞 2018-03-19 16:38:01
    翻译模型(Trans) 解决问题:知识表示与推理 将实体向量表示(Embedding)在低维稠密向量空间中,然后进行计算和推理。 主要应用:triplet ...TransE, NIPS2013, Translating embeddings for mode...
  • 本文介绍了TransE,TransH,TransR,TransD,RotatE知识表示嵌入模型的基本概念和评价函数。 TransE 论文标题:Translating Embeddings for Modeling Multi-relational Data (2013) TransE将relation看作head到tail...
  • 模型定义4. 损失函数的定义5.如何插入一段漂亮的代码片生成一个适合你的列表创建一个表格设定内容居中、居左、居右SmartyPants创建一个自定义列表如何创建一个注脚注释也是必不可少的KaTeX数学公式新的甘特图功能,...

空空如也

空空如也

1 2 3 4 5 6
收藏数 102
精华内容 40
关键字:

transe模型