精华内容
下载资源
问答
  • distributed_skipgram_mixture, 基于分布式skipgram混合模型的multisense词 分发 Multisense Word分布式Multisense字 Embedding(DMWE) 工具是在DMTK参数服务器上实现跳码混合 [1] 算法的并行化。 为多字词嵌入提供了...
  • tensorflow实现skip gram

    2019-11-02 15:39:50
    本文参考tensorflow自然语言处理书上的skip gram加以实现 skip gram:word_list -> embedding -> softmax(embedding) -> optimize loss -> select the best embedding & weight & bias 主要...

    本文参考tensorflow自然语言处理书上的skip gram加以实现

    skip gram:word_list -> embedding -> softmax(embedding) -> optimize loss -> select the best embedding & weight & bias

    主要思路:

    1.给出一批数据,求出在给定window_size下的目标词和上下文词组合,在本代码中,train_dataset即target word的集合,labels即为上下文word

    2.初始化embedding(根据train_dataset找出所对应的embedding),初始化权重,初始化偏置

    3.根据损失函数和优化器不断优化embedding、权重和偏置,即目的在于寻找词的最佳embedding

    4.计算词的相似度,在本代码中,挑选了10个高频率词和10个低频率词作为valid_dataset,根据该valid_dataset找出所对应的embedding,计算相似度,再根据该相似度找出这20个词各自对应的相似度较高的词,最后根据不断优化的embedding不断计算相似度。

    注意:

    另外本文数据采用的是人民日报1998数据,有需要的私信我哦。

    该代码无法生成图,因为pylab.cm.spectral这个包不存在,可能是版本变化,所以有知道怎么办的童鞋可以交流一下哈。

    如果要成为一个模型主要就是训练出各个词所对应的embedding,然后根据该embedding进行一系列操作即可。

     

    结果如下:效果感觉一般。。。。

     

    from matplotlib import pylab
    import tensorflow as tf
    import collections
    import math
    import numpy as np
    import os
    import random
    import bz2
    from six.moves import range
    from six.moves.urllib.request import urlretrieve
    from sklearn.cluster import KMeans
    import operator
    import csv
    import os
    from sklearn.manifold import TSNE
    from math import ceil
    import spectral
    
    
    # https://www.evanjones.ca/software/wikipedia2text.tar.bz2
    class Skip_gram:
        def __init__(self):
            self.url = 'http://www.evanjones.ca/software/'
            self.word_list = []
            self.voca_size = 20000
            self.data_index = 0
            self.batch_size = 128
            self.embedding_size = 128
            self.window_size = 4
            self.valid_size = 10
            self.valid_window = 50
            #从0-50中随机采样10个数据
            self.valid_examples = np.array(random.sample(range(self.valid_window),self.valid_size))
            #再从1000-1050中采样10个数据 
            #结果如[  46   17   16   19   34   41    2    4   43   28 1016 1035 1002 1005 1024 1034 1037 1041 1032 1028]
            self.valid_examples = np.append(self.valid_examples,random.sample(range(1000,1000+self.valid_window),self.valid_size),axis=0)
    
    
        def maybe_download(self,filename,expected_bytes):
            if not os.path.exists(filename):
                print('Downloading file')
                filename,_ = urlretrieve(self.url + filename,filename)
            statInfo = os.stat(filename)
            if statInfo.st_size == expected_bytes:
                print('Found and verified %s' % filename)
            else:
                print(statInfo.st_size)
                raise Exception(
                    'Failed to verify ' + filename + '. Can you get to it with a browser?')
            return filename
    
        def read_data(self,filename):
            for line in open(filename,encoding='utf8'):
                if line:
                    sub_word_list = line.split(' ')
                    for word in sub_word_list:
                        if word not in ['\n',',','“','”','、','。',',','的','了','在','和','有','是']:
                            self.word_list.append(word)
    
    
        #dictionary:{voca1:0,voca2:1,voca3:2}
        #reverse_dictionary:{0:voca1,1:voca2,2:voca3}
        #count:{voca1:2,voca2:3,voca3:5}
        #data:Contains the string of text we read.where string words are replaced with word IDs
        def build_dataset(self,words):
            count = [['UNK',-1]]
            count.extend(collections.Counter(words).most_common(self.voca_size-1))
            word_dictionary = dict()
            for word,_ in count:
                word_dictionary[word] = len(word_dictionary)
    
    
            data = list()
            unk_count = 0
            #常用单词用自己的索引
            #不常用单词用UNK索引(0)
            for word in words:
                if word in word_dictionary:
                    index = word_dictionary[word]
                else:
                    index = 0
                    unk_count = unk_count + 1
    
                data.append(index)
    
            count[0][1] = unk_count
            reverse_dictionary = dict(zip(word_dictionary.values(),word_dictionary.keys()))
            assert len(word_dictionary) == self.voca_size
    
            return data,word_dictionary,reverse_dictionary,count
    
        def generat_batch_skip_gram(self,batch_size,window_size):
            #存储target word
            batch = np.ndarray(shape=(batch_size),dtype=np.int32)
    
            #用于存储上下文 word
            label = np.ndarray(shape=(batch_size,1),dtype=np.int32)
    
            #上下文单词数量+target word
            span = 2*window_size+1
    
            #用于存储在span中的所有word的索引,包括target
            buffer = collections.deque(maxlen=span)
    
            for _ in range(span):
                buffer.append(data[self.data_index])
                self.data_index = (self.data_index+1)%len(data)
    
            #为单个target word所需采样的上下文word数量
            num_sample = 2 * window_size
    
            #把batch的读取分为两个循环
            #内循环负责用span中num_sample数量的data填充labels和batch
            #外循环负责重复batch_size//num_sample次这个过程产出完成的batch
            for i in range(batch_size//num_sample):
                k=0
                #target word不作为预测
                for j in list(range(window_size)) + list(range(window_size+1,2*window_size+1)):
                    #存储target word
                    #[0,1,2,3,target word,5,6,7,8]
                    #batch[0] = buffer[4]  batch[1] = buffer[4]
                    batch[i*num_sample + k] = buffer[window_size]
                    #label[0...7] = buffer[1,2,3,4]和buffer[5,6,7,8]即tarword的上下文单词
                    label[i*num_sample + k,0] = buffer[j]
                    k += 1
                #存储完一个span,span(buffer)向右移动一个单位,因为span(buffer)是一个deque,即双向队列
                #且上面设置其最大长度为9,再添加一个元素,则队首出队,新元素加入队尾
                #即span由[0,1,2,3,4,5,6,7,8] 变为 [1,2,3,4,5,6,7,8,9]  此时的target word则变成了5
                #故每一个batch有16个target word 对应128个 context word
                buffer.append(data[self.data_index])
                self.data_index = (self.data_index + 1) % len(data)
            return batch,label
    
        def skip_gram(self):
            tf.reset_default_graph()
            train_dataset = tf.placeholder(tf.int32, shape=[self.batch_size])
            train_labels = tf.placeholder(tf.int32, shape=[self.batch_size, 1])
            valid_dataset = tf.constant(self.valid_examples, dtype=tf.int32)
            #初始化embedding
            embeddings = tf.Variable(tf.random_uniform([self.voca_size, self.embedding_size], -1.0, 1.0))
            #初始化softmax的权重,注意的是softmax就是将原来得到的值映射到[0,1]区间,也可以看作为概率值
            #根据此进行分类和预测
            #算法流程为:词经过嵌入层后变为词向量,再经过softmax函数进行预测和分类得到预测值a,
            #logits = wx + b(在本代码里就是随机初始化了embedding向量,再进行优化)   softmax(logits) --> a
            #根据交叉熵的作为损失函数 Loss=-Σy*lna,优化权重  y为真实值, a为经过softmax后的预测值
            softmax_weights = tf.Variable(
                tf.truncated_normal([self.voca_size, self.embedding_size],
                                    stddev=0.5 / math.sqrt(self.embedding_size))
            )
            softmax_biases = tf.Variable(tf.random_uniform([self.voca_size], 0.0, 0.01))
            #tf.nn.embedding_lookup函数的用法主要是选取一个张量里面索引对应的元素
            #用train_dataset寻找embedding里的张量
            embed = tf.nn.embedding_lookup(embeddings, train_dataset)
            num_sampled = 2*self.window_size
            #定义损失函数
            loss = tf.reduce_mean(
                tf.nn.sampled_softmax_loss(
                    weights=softmax_weights, biases=softmax_biases, inputs=embed,
                    labels=train_labels, num_sampled=num_sampled, num_classes=self.voca_size)
            )
            #计算minibatch和all embeddings 之间的相似度,用余弦距离
            #规范化因子
            norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
            normalized_embeddings = embeddings / norm
            valid_embeddings = tf.nn.embedding_lookup(
                normalized_embeddings, valid_dataset)
            similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
    
            #优化
            optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
    
            num_steps = 100001
            skip_losses = []
            # ConfigProto is a way of providing various configuration settings
            # required to execute the graph
            with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as session:
                # Initialize the variables in the graph
                tf.global_variables_initializer().run()
                print('Initialized')
                average_loss = 0
    
                # Train the Word2vec model for num_step iterations
                for step in range(num_steps):
                    # Generate a single batch of data
                    batch_data, batch_labels = self.generat_batch_skip_gram(
                        self.batch_size, self.window_size)
    
                    # Populate the feed_dict and run the optimizer (minimize loss)
                    # and compute the loss
                    feed_dict = {train_dataset: batch_data, train_labels: batch_labels}
                    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    
                    # Update the average loss variable
                    average_loss += l
    
                    if (step + 1) % 2000 == 0:
                        if step > 0:
                            average_loss = average_loss / 2000
    
                        skip_losses.append(average_loss)
                        # The average loss is an estimate of the loss over the last 2000 batches.
                        print('Average loss at step %d: %f' % (step + 1, average_loss))
                        average_loss = 0
    
                    # Evaluating validation set word similarities
                    if (step + 1) % 10000 == 0:
                        sim = similarity.eval()
                        # Here we compute the top_k closest words for a given validation word
                        # in terms of the cosine distance
                        # We do this for all the words in the validation set
                        # Note: This is an expensive step
                        for i in range(self.valid_size):
                            valid_word = reverse_dictionary[self.valid_examples[i]]
                            top_k = 8  # number of nearest neighbors
                            nearest = (-sim[i, :]).argsort()[1:top_k + 1]
                            log = 'Nearest to %s:' % valid_word
                            for k in range(top_k):
                                close_word = reverse_dictionary[nearest[k]]
                                log = '%s %s,' % (log, close_word)
                            print(log)
                skip_gram_final_embeddings = normalized_embeddings.eval()
    
            # We will save the word vectors learned and the loss over time
            # as this information is required later for comparisons
            np.save('skip_embeddings', skip_gram_final_embeddings)
    
            with open('skip_losses.csv', 'wt') as f:
                writer = csv.writer(f, delimiter=',')
                writer.writerow(skip_losses)
    
            num_points = 1000
            tsne = TSNE(perplexity=30,n_components=2,init='pca',n_iter=5000)
            print('Fitting embeddings to T-SNE. This can take some time ...')
            selected_embeddings = skip_gram_final_embeddings[:num_points,:]
            two_d_embeddings = tsne.fit_transform(selected_embeddings)
    
            print('Pruning the T-SNE embeddings')
            # prune the embeddings by getting ones only more than n-many sample above the similarity threshold
            # this unclutters the visualization
            selected_ids = self.find_cluster_embeddings(selected_embeddings, .25, 10)
            two_d_embeddings = two_d_embeddings[selected_ids, :]
    
            print('Out of ', num_points, ' samples, ', selected_ids.shape[0], ' samples were selected by pruning')
    
            # words = [reverse_dictionary[i] for i in selected_ids]
            # self.plot(two_d_embeddings,words)
    
        def find_cluster_embeddings(self,embeddings,distance_threshold,sample_threshold):
            consine_sim = np.dot(embeddings,np.transpose(embeddings))
            norm = np.dot(np.sum(embeddings ** 2, axis=1).reshape(-1, 1),
                          np.sum(np.transpose(embeddings) ** 2, axis=0).reshape(1, -1))
            assert consine_sim.shape == norm.shape
            consine_sim /= norm
            np.fill_diagonal(consine_sim,-1,0)
            argmax_cos_sim = np.argmax(consine_sim,axis=1)
            mod_cos_sim = consine_sim
            for _ in range(sample_threshold - 1):
                argmax_cos_sim = np.argmax(consine_sim, axis=1)
                mod_cos_sim[np.arange(mod_cos_sim.shape[0]), argmax_cos_sim] = -1
    
            max_cosine_sim = np.max(mod_cos_sim, axis=1)
    
            return np.where(max_cosine_sim>distance_threshold)[0]
    
        def plot(self,embeddings, labels):
            n_clusters = 20  # number of clusters
    
            # automatically build a discrete set of colors, each for cluster
            label_colors = [pylab.cm.spectral(float(i) / n_clusters) for i in range(n_clusters)]
    
            assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
    
            # Define K-Means
            kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=0).fit(embeddings)
            kmeans_labels = kmeans.labels_
    
            pylab.figure(figsize=(15, 15))  # in inches
    
            # plot all the embeddings and their corresponding words
            for i, (label, klabel) in enumerate(zip(labels, kmeans_labels)):
                x, y = embeddings[i, :]
                pylab.scatter(x, y, c=label_colors[klabel])
    
                pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                               ha='right', va='bottom', fontsize=10)
    
            # use for saving the figure if needed
            # pylab.savefig('word_embeddings.png')
            pylab.show()
    
    if __name__ == '__main__':
        filename = 'C:/Users/dell/Desktop/WordSegment-master/data/train.txt'
        sg = Skip_gram()
        sg.read_data(filename)
        # print(len(sg.word_list))
        data,word_dictionary,reverse_dictionary,count = sg.build_dataset(sg.word_list)
        # batch,label = sg.generat_batch_skip_gram(8,1)
        sg.skip_gram()

     

    展开全文
  • SkipGram Model -Formulation

    2020-07-06 19:35:32
    SkipGram Model -Formulation

    SkipGram Model -Formulation

     

    展开全文
  • SkipGram_Negative_Sampling 带有负采样的skipgram模型的Pytorch实现。 word2vec的概念 支持的功能 跳过图 批量更新 负采样 可视化 如下所示,将嵌入内容和单词保存到TSV文件中,然后将这两个TSV文件上传到以实现更...
  • import numpy as np import torch import torch.nn as nn import torch.nn.functional as F class SkipGram(nn.Module): def __init__(self, vocab_size, embd_size): super(SkipGram, self).__init__...
    import numpy as np
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    
    
    class SkipGram(nn.Module):
        def __init__(self, vocab_size, embd_size):
            super(SkipGram, self).__init__()
            self.embeddings = nn.Embedding(vocab_size, embd_size)
    
        def forward(self, focus, context):
            embed_focus = self.embeddings(focus)
            embed_ctx = self.embeddings(context)
            # score = torch.mm(embed_focus, torch.t(embed_ctx))
            score = torch.mul(embed_focus, embed_ctx).sum(dim=1)
            log_probs = score #F.logsigmoid(score)
    
            return log_probs
    
        def loss(self, log_probs, target):
            loss_fn = nn.BCEWithLogitsLoss()
            # loss_fn = nn.NLLLoss()
            loss = loss_fn(log_probs, target)
            return loss
    
    
    class CBOW(nn.Module):
        def __init__(self, vocab_size, embd_size, context_size, hidden_size):
            super(CBOW, self).__init__()
            self.embeddings = nn.Embedding(vocab_size, embd_size)
            self.linear1 = nn.Linear(2 * context_size * embd_size, hidden_size)
            self.linear2 = nn.Linear(hidden_size, vocab_size)
    
        def forward(self, inputs):
            embedded = self.embeddings(inputs).view((1, -1))
            hid = F.relu(self.linear1(embedded))
            out = self.linear2(hid)
            log_probs = F.log_softmax(out)
            return log_probs
    
    import random
    import re
    
    import torch
    import torch.optim as optim
    from tqdm import tqdm
    from pytorch_word2vec_model import SkipGram
    
    epochs = 50
    negative_sampling = 4
    window = 2
    vocab_size = 1
    embd_size = 300
    
    
    def batch_data(x, batch_size=128):
        in_w = []
        out_w = []
        target = []
        for text in x:
            for i in range(window, len(text) - window):
                word_set = set()
                in_w.append(text[i])
                in_w.append(text[i])
                in_w.append(text[i])
                in_w.append(text[i])
    
                out_w.append(text[i - 2])
                out_w.append(text[i - 1])
                out_w.append(text[i + 1])
                out_w.append(text[i + 2])
    
                target.append(1)
                target.append(1)
                target.append(1)
                target.append(1)
                # negative sampling
                count = 0
                while count < negative_sampling:
                    rand_id = random.randint(0, vocab_size-1)
                    if not rand_id in word_set:
                        in_w.append(text[i])
                        out_w.append(rand_id)
                        target.append(0)
                        count += 1
    
                if len(out_w) >= batch_size:
                    yield [in_w, out_w, target]
                    in_w = []
                    out_w = []
                    target = []
        if out_w:
            yield [in_w, out_w, target]
    
    
    def train(train_text_id, model,opt):
        model.train()  # 启用dropout和batch normalization
        ave_loss = 0
        pbar = tqdm()
        cnt=0
        for x_batch in batch_data(train_text_id):
            in_w, out_w, target = x_batch
            in_w_var = torch.tensor(in_w)
            out_w_var = torch.tensor(out_w)
            target_var = torch.tensor(target,dtype=torch.float)
    
            model.zero_grad()
            log_probs = model(in_w_var, out_w_var)
            loss = model.loss(log_probs, target_var)
            loss.backward()
            opt.step()
            ave_loss += loss.item()
            pbar.update(1)
            cnt += 1
            pbar.set_description('< loss: %.5f >' % (ave_loss / cnt))
        pbar.close()
    text_id = []
    vocab_dict = {}
    
    with open('corpus.txt',encoding='utf-8') as fp:
        for line in fp:
            lines = re.sub("[^A-Za-z0-9']+", ' ', line).lower().split()
            line_id = []
            for s in lines:
                if not s:
                    continue
                if s not in vocab_dict:
                    vocab_dict[s] = len(vocab_dict)
                id = vocab_dict[s]
                line_id.append(id)
                if id==11500:
                    print(id,s)
            text_id.append(line_id)
    vocab_size = len(vocab_dict)
    print('vocab_size', vocab_size)
    model = SkipGram(vocab_size, embd_size)
    
    for epoch in range(epochs):
        print('epoch', epoch)
        opt = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()),
        lr=0.001, weight_decay=0)
        train(text_id, model,opt)
    
    
    展开全文
  • 古希腊语的SkipGram模型 我的Word2Vec模型(在Skip Gram Negative中)的实现(在Pytorch 1.7.0 )在本地文本上进行了培训。 对我而言,主要目标是提高自己的编程技能,玩弄古希腊文字并尝试更好地理解此类模型的...
  • SkipGram.py

    2020-06-14 22:07:38
    对于词嵌入skip-gram方法的python实现。于预测的词嵌入 基于预测的词嵌入是我们今天讲解的重点。 CBow&Skip-gram GloVe
  • 词向量训练skipgram的python实现

    千次阅读 2019-05-21 22:41:40
    skipgram的原理及公式推倒就不详细说了,主要记录一下第一个正向传播和反向传播都自己写的神经网络,也终于体验了一把负采样对于词向量训练速度的惊人提升,感人!虽然最终的时间复杂度依然较高,不过我正在研究同样...

    skipgram的原理及公式推倒就不详细说了,主要记录一下第一个正向传播和反向传播都自己写的神经网络,也终于体验了一把负采样对于词向量训练速度的惊人提升,感人!虽然最终的时间复杂度依然较高,不过我正在研究同样使用python的gensim为啥这么快的原因!

    (明天有时间会把)数据和代码放在本人的github里,写的比较搓,待改进...

    1.工具介绍

    python: 3.6

    电脑:mac本地跑

    数据集: text8的英文语料

    2. 数据预处理

    • 替换文本中特殊符号
    • 将文本分词
    • 去除文本中的低频词
    def preprocess(text, freq=5):
        '''
        对文本进行预处理
    
        参数
        ---
        text: 文本数据
        freq: 词频阈值
        '''
        # 替换文本中特殊符号
        text = text.lower()
        text = text.replace('.', ' <PERIOD> ')
        text = text.replace(',', ' <COMMA> ')
        text = text.replace('"', ' <QUOTATION_MARK> ')
        text = text.replace(';', ' <SEMICOLON> ')
        text = text.replace('!', ' <EXCLAMATION_MARK> ')
        text = text.replace('?', ' <QUESTION_MARK> ')
        text = text.replace('(', ' <LEFT_PAREN> ')
        text = text.replace(')', ' <RIGHT_PAREN> ')
        text = text.replace('--', ' <HYPHENS> ')
        text = text.replace('?', ' <QUESTION_MARK> ')
        text = text.replace(':', ' <COLON> ')
        words = text.split()
    
        # 删除低频词,减少噪音影响
        word_counts = Counter(words)
        trimmed_words = [word for word in words if word_counts[word] > freq]
    
        return trimmed_words

    3. 训练样本构建

    • 获取vocabulary,即id->word,和word->id这两个单词映射表。
    • 将文本序列转化为id序列。
    • 剔除停用词:停用词可能频率比较高,采用以下公式来计算每个单词被删除的概率大小。

                                         P \left( w _ { i } \right) = 1 - \sqrt { \frac { t } { f \left( w _ { i } \right) } }

                     其中f \left( w _ { i } \right)代表单词w _ { i }的出现频次。t为一个阈值,一般介于1e-3到1e-5之间,若P \left( w _ { i } \right)大于一个阈值,就删除w _ { i }

    def get_train_words(path, t, threshold, freq):
        with open(path) as f:
            text = f.read()
        words = preprocess(text, freq)
        vocab = set(words)
        vocab_to_int = {w: c for c, w in enumerate(vocab)}
        int_to_vocab = {c: w for c, w in enumerate(vocab)}
    
    
        # 对原文本进行vocab到int的转换
        int_words = [vocab_to_int[w] for w in words]
    
        # 统计单词出现频次
        int_word_counts = Counter(int_words)
        total_count = len(int_words)
        # 计算单词频率
        word_freqs = {w: c/total_count for w, c in int_word_counts.items()}
        # 计算被删除的概率
        prob_drop = {w: 1 - np.sqrt(t / word_freqs[w]) for w in int_word_counts}
        # 对单词进行采样
        train_words = [w for w in int_words if prob_drop[w] < threshold]
        return int_to_vocab, train_words

    4. 生成skipgram模型的输入单词对(中心词,上下文词)

    这里上下文单词的window是随机采样的,这么做是为了更多的采样离中心词更近的单词,毕竟离中心词越近,跟中心词关联的越紧密嘛!

    def get_targets(words, idx, window_size):
        '''
        获得中心词的上下文单词列表
    
        参数
        ---
        words: 单词列表
        idx: input word的索引号
        window_size: 窗口大小
        '''
        target_window = np.random.randint(1, window_size + 1)
        # 这里要考虑input word前面单词不够的情况
        start_point = idx - target_window if (idx - target_window) > 0 else 0
        end_point = idx + target_window
        # output words(即窗口中的上下文单词)
        targets = set(words[start_point: idx] + words[idx + 1: end_point + 1])
        return list(targets)
    
    
    def get_batches(words, window_size):
        '''
        将中心词的上下文单词列表一一与中心词配对
        '''
        for idx in range(0, len(words)):
            targets = get_targets(words, idx, window_size)
            for y in targets:
                yield words[idx], y

    5. 一些基础函数构建

    其中sigmoid_grad是对sigmoid函数求梯度。

    def softmax(vector):
        res = np.exp(vector)
        e_sum = np.sum(res)
        res /= e_sum
        return res
    
    
    def sigmoid(inp):
        return 1.0 / (1.0 + 1.0 / np.exp(inp))
    
    
    def sigmiod_grad(inp):
        return inp * (1 - inp)

    6. skipgram模型构建

    def forward_backword(input_vectors, output_vectors, in_idx, out_idx, sigma, vector_dimension, vocabulary_size):
        hidden = input_vectors[in_idx]
        output = np.dot(output_vectors, hidden)
        output_p = softmax(output)
        loss = -np.log(output_p[out_idx])
        output_grad = output_p.copy()
        output_grad[out_idx] -= 1.0
        hidden_grad = np.dot(output_vectors.T, output_grad)
        hidden = hidden.reshape(vector_dimension, 1)
        output_grad = output_grad.reshape(vocabulary_size, 1)
        output_vectors_grad = np.dot(output_grad, hidden.T)
        output_vectors -= sigma * output_vectors_grad
        input_vectors[in_idx] -= sigma * hidden_grad
        return loss

    但是要注意,这个是最基础的skipgram模型的前向传播和反向传播,它实在是太慢了!慢到根本无法使用!所以下面会用负采样模型替代它。

    def neg_forward_backword(input_vectors, output_vectors, in_idx, out_idx, sigma, vocabulary_size, K=10):
        epsilon = 1e-5
        hidden = input_vectors[in_idx]
        neg_idxs = neg_sample(vocabulary_size, out_idx, K)
        tmp = sigmoid(np.dot(output_vectors[out_idx], hidden))
        hidden_grad = (tmp - 1.0) * output_vectors[out_idx]
        output_vectors[out_idx] -= sigma * (tmp - 1.0) * hidden
        loss = -np.log(tmp + epsilon)
        for idx in neg_idxs:
            tmp = sigmoid(np.dot(output_vectors[idx], hidden))
            loss -= np.log(1.0 - tmp + epsilon)
            hidden_grad += tmp * output_vectors[idx]
            output_vectors[idx] -= sigma * tmp * hidden
        input_vectors[in_idx] -= sigma * hidden_grad
        return loss
    
    
    def neg_sample(vocabulary_size, out_idx, K):
        res = [None] * K
        for i in range(K):
            tmp = np.random.randint(0, vocabulary_size)
            while tmp == out_idx:
                tmp = np.random.randint(0, vocabulary_size)
            res[i] = tmp
        return np.array(res)

    7. 求一些单词的最相似的K个单词

    为了验证一下我们的词向量训练效果,得看看单词的最相似的K个单词是不是和它比较相似,这个函数就是随机选取一些高频单词,求这些单词的最相似的K个单词。

    def get_simi(input_vectors):
        valid_size = 16
        valid_window = 100
        # 从不同位置各选8个单词
        valid_examples = np.array(random.sample(range(valid_window), valid_size // 2))
        valid_examples = np.append(valid_examples,
                                   random.sample(range(1000, 1000 + valid_window), valid_size // 2))
    
        valid_size = len(valid_examples)
    
        # 计算每个词向量的模并进行单位化
        norm = np.sqrt(np.square(input_vectors).sum(axis=1)).reshape(len(input_vectors), 1)
        normalized_embedding = input_vectors / norm
        # 查找验证单词的词向量
        valid_embedding = normalized_embedding[valid_examples]
        # 计算余弦相似度
        similarity = np.dot(valid_embedding, normalized_embedding.T)
        return similarity, valid_size, valid_examples

    8. main函数

    • 参数设置
    • 设计整体代码流程(即按顺序引入上述函数)
    • 结果验证(即看最相似的K个单词)
    if __name__ == "__main__":
        path = './text8.txt'
        t = 1e-5
        threshold = 0.8  # 剔除概率阈值
        freq = 5
        windows = 10
        int_to_vocab, train_words = get_train_words(path, t, threshold, freq)
        np.save('int_to_vocab', int_to_vocab)
        vocabulary_size = len(int_to_vocab)
        vector_dimension = 200
        input_vectors = np.random.random([vocabulary_size, vector_dimension])
        output_vectors = np.random.random([vocabulary_size, vector_dimension])
        epochs = 10  # 迭代轮数
        sigma = 0.01
        K = 10
        
        iter = 1
        for e in range(1, epochs + 1):
            if e > 1:
                sigma = 0.001
            elif e > 3:
                sigma = 0.0001
            loss = 0
            batches = get_batches(train_words, windows)
            start = time.time()
            for x, y in batches:
                loss += neg_forward_backword(input_vectors, output_vectors, x, y, sigma, vocabulary_size, K)
                if iter % 100000 == 0:
                    end = time.time()
                    print("Epoch {}/{}".format(e, epochs),
                          "Iteration: {}".format(iter),
                          "Avg. Training loss: {:.4f}".format(loss / 100000),
                          "{:.4f} sec/100000".format((end - start)))
                    loss = 0
                    start = time.time()
                if iter % 4000000 == 0:
                    np.save('input_vectors', input_vectors)
                    similarity, valid_size, valid_examples = get_simi(input_vectors)
                    for i in range(valid_size):
                        valid_word = int_to_vocab[valid_examples[i]]
                        top_k = 8  # 取最相似单词的前8个
                        nearest = (-similarity[i, :]).argsort()[1:top_k + 1]
                        log = 'Nearest to [%s]:' % valid_word
                        for k in range(top_k):
                            close_word = int_to_vocab[nearest[k]]
                            log = '%s %s,' % (log, close_word)
                        print(log)
                iter += 1
    

    9.结果

    可以看出还是有一些效果的,但由于时间复杂度比较高,没有调参,epoch跑的也不够,数据量用的也比较小,所以效果不是太好。但对于熟悉skipgram模型的内部机制、熟悉负采样也足够了!不过我正在研究同样使用python的gensim为啥这么快的原因!打算借鉴一下,再自己实现一下hierarchical softmax。

    参考网址:https://www.leiphone.com/news/201706/QprrvzsrZCl4S2lw.html

                      https://zhuanlan.zhihu.com/p/33625794

    展开全文
  • skip gram是其中一种,是由中心词到周围词 小周亲手画的 上图中,首先输入中心词的onehot编码,与矩阵W1相乘,获取中心词的词向量;接着与矩阵W2相乘,相当于中心词的词向量与每一个其他词的词向量相乘...
  • numpy、tensorflow手写SkipGram(没有negative sampling)和cbow: http://www.claudiobellei.com/2018/01/07/backprop-word2vec-python/ 这两种实现都需要动手算梯度,手动实现梯度下降,且不没有使用negative ...
  • TensorFlow用skipgram实现中文word2vec
  • 图神经网络-图游走算法核心代码SkipGram、Node2Vec实现1. DeepWalk采样算法对于给定的节点,DeepWalk会等概率的选取下一个相邻节点加入路径,直至达到最大路径长度...
  • Word2Vec Tutorial - The Skip-Gram Model 19 Apr 2016 This tutorial covers the skip gram neural network architecture for Word2Vec. My intention with this tutorial was to skip over the usual introduc
  • Bangla word2vec 使用 skipgram 方法。 数据集 玩具孟加拉语用于训练词向量。 我们通过从一个著名的孟加拉纳托克语中提取所有单词来创建这个数据,在কোথাও কেউ নেই名字。 在 90 年代初,它在人们中引起...
  • 【Keras】word2vec_skipgram

    千次阅读 2018-08-14 21:56:36
    对每个句子提取出3个连续单词组成一个tuple=(left,center,right),skipgram模型(假设词窗大小为3)的目标是从center预测出left、从center预测出right。 因此对于每个tuple=(left,center,right)的数据,整理...
  • CS224n 词的向量表示word2vec 之skipgram: word2vec是google的一个NLP工具,将词向量化,挖掘词之间的联系,本案例通过Skip-Gram模型,softmax交叉熵计算损失度cost及对权重参数W(inputVectors)、W'...
  • CS224n 词的向量表示word2vec 之skipgram (Negative sampling )   #!/usr/bin/env python import numpy as np import random from q1_softmax import softmax from q2_gradcheck import gradcheck_naive ...
  • word2vec skip gram 直观理解 目标 根据给定的单词预测与该单词同处一个窗口内其他每个单词出现的概率 目标损失函数:针对每个窗口内非target word 的context word,构建 C(θ)=−∑wi∑c=1Clogp(wO,c∣wI)C(\theta...
  • skip gram是已知中间词,最大化它相邻词的概率。 与CBOW的不同:CBOW的时候,是选取一次负采样;而这里对于中间词的上下文的每一个词,每一次都需要进行一个负采样。 下面看一下条件概率: 与之前的CBOW大体...
  • PyTorch SGNS Python中Word2Vec的SkipGramNegativeSampling 。 在实现了另一个但非常普遍的。 它可以与任何嵌入方案一起使用! 我敢打赌。 vocab_size = 20000 word2vec = Word2Vec ( vocab_size = vocab_size ,...
  • skip gram和cbow的优缺点

    千次阅读 2020-01-14 21:31:35
    skip-gram里面,每个词在作为中心词的时候,实际上是 1个学生 VS K个老师,K个老师(周围词)都会对学生(中心词)进行“专业”的训练,这样学生(中心词)的“能力”(向量结果)相对就会扎实(准确)一些,但是...
  • word2vec skipgram模型

    2021-03-04 19:30:16
    ship-gram模型图 Word2Vec模型实际上分为了两个部分,第一部分为建立模型,第二部分是通过模型获取嵌入词向量。 首先我们选句子中间的一个词作为我们的输入词作为input word;定义一个叫做skip_window的参数,代表着...
  • 如何训练词向量_Skip_Gram_算法_#3.3_(莫烦Python_NLP_自然语言处理教学)
  • Skip gram模型一直都只是知道思想,就是由中心词预测窗口内的背景词,但是从来没有动手实现过。这次有机会从头开始实现,发现了许多实践上的问题。 重点一:训练样本的形式 一开始非常纠结的一个问题是:每个训练...
  • Word2vec tutorial-the skip gram 1.总述:  创建一个简单的神经网络,一个输入层,一个隐藏层,一个输出层, 我们只需要得到有效的隐藏层的权重即可。 2.构建数据:  使用word pairs...
  • 6.4 PyTorch实现Skipgram模型

    千次阅读 2020-02-15 19:45:50
    :语料一方面用于构建词典,另一方面需要预处理成模型可以读入的训练对(x,y), skipgram是输入中心词,预测上下文词,因此其数据对应为(center_word, contenx_word),且需要转换成索引形式(center_word_ix, contenx...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 10,246
精华内容 4,098
关键字:

skipgram