精华内容
下载资源
问答
  • 2021-06-03 10:40:37

    import torch
    import torchtext
    from torchtext import data
    from torchtext import datasets
    from torchtext.vocab import GloVe
    import spacy
    from spacy.lang.en import English
    import random
    import torch.nn as nn
    import torch.nn.functional as F
    
    SEED = 1234
    torch.manual_seed(SEED)
    torch.cuda.manual_seed(SEED)
    torch.backends.cudnn.deterministic = True
    
    

    1.数据准备

    nlp = English() #确定分词方式
    TEXT = data.Field(tokenize=nlp)
    # TEXT = data.Field(lower= True)
    LABEL = data.LabelField(dtype=torch.float)
    train_data, test_data = datasets.IMDB.splits(TEXT, LABEL) #载入数据
    
    train_data, valid_data = train_data.split(split_ratio=0.7,random_state=random.seed(SEED)) #再拆分一个validation_set出来
    
    print(f'Number of training examples: {len(train_data)}')
    print(f'Number of validation examples: {len(valid_data)}')
    print(f'Number of testing examples: {len(test_data)}')
    
    Number of training examples: 17500
    Number of validation examples: 7500
    Number of testing examples: 25000
    
    print(vars(train_data[0]),'\n',len(vars(train_data[0])['text']))
    print(vars(train_data[0])['text'][0],'\n',type((vars(train_data[0])['text'][0])))
    
    {'text': This movie has got to be one of the worst I have ever seen make it to DVD!!! The story line might have clicked if the film had more funding and writers that would have cut the nonsense and sickly scenes that I highly caution parents on.... But the story line is like a loose cannon. If there was such a thing as a drive thru movie maker-this one would have sprung from that.It reminded me a lot of the quickie films that were put out in the 1960's, poor script writing and filming. <br /><br />The only sensible characters in the whole movie was the bartender and beaver. The rest of the film, could have easily been made by middle school children. I give this film a rating of 1 as it is truly awful and left my entire family with a sense of being cheated. My advice-Don't Watch It!!!, 'label': 'neg'} 
     173
    This 
     <class 'spacy.tokens.token.Token'>
    
    # 从上面可以看出,使用English()这种方法切出来的每个单词并不是string类型,
    # 它的type是token,因此我们要把切出来的词的type都转化为str
    for data in [train_data,valid_data,test_data]:
        for i in range(len(data)):
            a = data[i]
            a.text = [str(j) for j in a.text]
    
    #建立词典
    TEXT.build_vocab(train_data, max_size=25000,vectors='glove.6B.100d',unk_init=torch.Tensor.normal_)
    LABEL.build_vocab(train_data)
    
    print(len(TEXT.vocab),len(LABEL.vocab))
    print(TEXT.vocab.freqs.most_common(20))
    
    25002 2
    [('the', 203566), (',', 192495), ('.', 165539), ('and', 109443), ('a', 109116), ('of', 100702), ('to', 93766), ('is', 76328), ('in', 61255), ('I', 54004), ('it', 53508), ('that', 49187), ('"', 44285), ("'s", 43329), ('this', 42445), ('-', 37165), ('/><br', 35752), ('was', 35034), ('as', 30384), ('with', 29774)]
    
    # 建立iterator(其实就是dataloader)
    train_iterator, valid_iterator, test_iterator = torchtext.data.BucketIterator.splits(
                                                    (train_data, valid_data, test_data), 
                                                    batch_size=batch_size,
                                                    device=device)
    

    2.wordavg model

    2.1定义模型

    class WordAvgModel(nn.Module):
        def __init__(self,vocab_size,embed_size,output_size,pad_idx):
            super(WordAvgModel,self).__init__()
            self.embed = nn.Embedding(vocab_size,embed_size,pad_idx)
            self.linear = nn.Linear(embed_size,output_size)
            
        def forward(self,text):
            #text 的size是(sq_length,batch_size),即一列是一句话(原始长度达不到sq_length的,用pad来填充)
            embedded = self.embed(text) # (sq_length,batch_size,embed_size)
            embedded = embedded.permute(1,0,2) # 把第一维度和第二维度交换一下,size变成(batch_size,seq_length,embed_size)
            pooled = F.avg_pool2d(embedded,(embedded.shape[1],1)).squeeze() # kernel的size为(seq_length,1),那么结果池化之后,原来的矩阵的size就变成(batch_size,1,embed_size),然后squeeze,变成(batch_size,embed_size)
            return self.linear(pooled)
    

    2.2设置参数

    batch_size = 64
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    vocab_size = len(TEXT.vocab)
    embed_size = 100
    output_size = 1
    pad_idx = TEXT.vocab.stoi[TEXT.pad_token]
    unk_idx = TEXT.vocab.stoi[TEXT.unk_token]
    

    2.3 初始化模型

    avg_model = WordAvgModel(vocab_size=vocab_size,embed_size=embed_size,
                             output_size=output_size,pad_idx=pad_idx)
    avg_model.to(device)
    
    WordAvgModel(
      (embed): Embedding(25002, 100, padding_idx=1)
      (linear): Linear(in_features=100, out_features=1, bias=True)
    )
    
    num_parameters = sum(p.numel() for p in avg_model.parameters() if p.requires_grad)
    print(num_parameters) #看一下待训练的参数个数
    
    2500301
    

    2.4 glove初始化模型embedding层

    pretrained_embed = TEXT.vocab.vectors
    #pretrained_embed的size是(25002,100),25002是字典中单词的个数,100是因为我们在建立字典时,对参数vectors设定的是glove.6b.100d 即100维 
    avg_model.embed.weight.data.copy_(pretrained_embed)
    
    tensor([[-0.6946,  0.0269,  0.0063,  ...,  1.2692, -1.3969, -0.4796],
            [-2.2822,  0.1412, -1.3277,  ..., -0.0465, -1.0185, -0.1024],
            [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
            ...,
            [-0.3617,  0.6201,  0.1105,  ...,  0.2994, -0.5920,  1.0949],
            [-0.3312,  0.9364, -0.1638,  ...,  0.9859, -1.0950, -1.1516],
            [-0.1954,  0.5692, -0.0671,  ...,  0.2170,  0.7001, -0.1479]],
           device='cuda:0')
    
    avg_model.embed.weight.data.size() 
    #embed的size是(25002,100) 每一行代表vocab里面的一个单词,
    #其中第一个单词是<unk>,第二个单词是<pad>
    #一般情况下,我们会把这两个单词的weight初始化为0
    
    torch.Size([25002, 100])
    
    avg_model.embed.weight.data[pad_idx] = torch.zeros(embed_size)
    avg_model.embed.weight.data[unk_idx] = torch.zeros(embed_size)
    # avg_model.embed.weight.data[pad_idx]指的就是<pad>所代表的那行
    

    2.5 定义训练过程、评估函数

    def train(model,dataset,optimizer,loss_fn):
        epoch_loss,epoch_count,epoch_acc_count=0.,0.,0.
        model.train()
        total_len = 0
        for batch in dataset:
            preds = model(batch.text).squeeze() #model出来的size为(batch_size,1),把那个1 squeeze掉,size变成batch_size
            loss = loss_fn(preds,batch.label)
            acc = binary_accuracy(preds,batch.label)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
    
            epoch_loss += loss.item()*len(batch.label)      #用于计算整个epoch个loss
            epoch_count += len(batch.label)                 #用于计算整个epoch上的样本数
            epoch_acc_count += acc.item()*len(batch.label)  #用于计算整个epoch上预测正确的样本数
        
        return epoch_loss/epoch_count,epoch_acc_count/epoch_count
    
     
    def evaluate(model,dataset,loss_fn):
        epoch_loss,epoch_count,epoch_acc_count=0.,0.,0.
        model.eval()
        total_len = 0
        for batch in dataset:
            preds = model(batch.text).squeeze() #model出来的size为(batch_size,1),把那个1 squeeze掉,size变成batch_size
            loss = loss_fn(preds,batch.label)
            acc = binary_accuracy(preds,batch.label)
    
            epoch_loss += loss.item()*len(batch.label)      #用于计算整个epoch个loss
            epoch_count += len(batch.label)                 #用于计算整个epoch上的样本数
            epoch_acc_count += acc.item()*len(batch.label)  #用于计算整个epoch上预测正确的样本数
        
        model.eval()
        return epoch_loss/epoch_count,epoch_acc_count/epoch_count
    
    def binary_accuracy(preds,y):
        rounded_preds = torch.round(torch.sigmoid(preds))
        num_correct = (rounded_preds==y).float()
        acc = num_correct.sum()/len(y)
        return acc
        
    

    2.6 开始训练

    optimizer = torch.optim.Adam(avg_model.parameters(),lr=0.005)
    loss_fn = nn.BCEWithLogitsLoss()
    
    epochs = 10
    best_valid_acc = 0.
    for epoch in range(epochs):
        train_loss,train_acc = train(avg_model,train_iterator,optimizer,loss_fn)
        valid_loss,valid_acc = evaluate(avg_model,valid_iterator,loss_fn)
        
        if valid_acc>best_valid_acc:
            best_valid_acc = valid_acc
            best_epoch = epoch
            torch.save(avg_model.state_dict(),'./wordavg_model.txt')
    #         print('模型在验证集上的正确率({})有所提高,已将模型保存'.format(valid_acc))
        
    
        print("Epoch:", epoch, "Train_Loss:", train_loss, "Train_Acc:", train_acc,
              "Valid_Loss", valid_loss, "Valid_Acc", valid_acc)
        
    print("training has finished,the best epoch is {},the best valid_acc is {}".format(best_epoch,best_valid_acc))
           
    
    Epoch: 0 Train_Loss: 0.5956396281242371 Train_Acc: 0.694228571496691 Valid_Loss 0.4051449816385905 Valid_Acc 0.8417333333969116
    Epoch: 1 Train_Loss: 0.3593949766363416 Train_Acc: 0.8733142857960292 Valid_Loss 0.46664917748769125 Valid_Acc 0.8840000000317891
    Epoch: 2 Train_Loss: 0.2551696341242109 Train_Acc: 0.913428571510315 Valid_Loss 0.5249438627560934 Valid_Acc 0.8950666667302449
    Epoch: 3 Train_Loss: 0.196742424092974 Train_Acc: 0.9325142858232771 Valid_Loss 0.6135396106402079 Valid_Acc 0.8957333333969116
    Epoch: 4 Train_Loss: 0.15810192627225603 Train_Acc: 0.9501142857687814 Valid_Loss 0.6637696914672852 Valid_Acc 0.9009333333969116
    Epoch: 5 Train_Loss: 0.1267459169966834 Train_Acc: 0.9622285714830671 Valid_Loss 0.7350258693695069 Valid_Acc 0.9008000000635783
    Epoch: 6 Train_Loss: 0.10385001053469521 Train_Acc: 0.9716 Valid_Loss 0.835720943514506 Valid_Acc 0.8982666667302449
    Epoch: 7 Train_Loss: 0.08529832897612026 Train_Acc: 0.9776 Valid_Loss 0.8945791959762573 Valid_Acc 0.8969333333969116
    Epoch: 8 Train_Loss: 0.0711212798680578 Train_Acc: 0.9828571428843907 Valid_Loss 0.9895696968078613 Valid_Acc 0.8968000000635783
    Epoch: 9 Train_Loss: 0.05655052126603467 Train_Acc: 0.9883428571428572 Valid_Loss 1.065309889539083 Valid_Acc 0.8962666667302449
    training has finished,the best epoch is 4,the best valid_acc is 0.9009333333969116
    
    best_model = WordAvgModel(vocab_size=vocab_size,embed_size=embed_size,
                             output_size=output_size,pad_idx=pad_idx)
    best_model.load_state_dict(torch.load('./wordavg_model.txt'))
    best_model.to(device)
    
    WordAvgModel(
      (embed): Embedding(25002, 100, padding_idx=1)
      (linear): Linear(in_features=100, out_features=1, bias=True)
    )
    

    2.7 检验分类效果

    def predict_sentiment(sentence):
        tokennized = [str(tok) for tok in TEXT.tokenize(sentence)]
        print(tokennized)
        indexed = torch.LongTensor([TEXT.vocab.stoi[t] for t in tokennized]).to(device).unsqueeze(1)
        pred = torch.sigmoid(best_model(indexed))
        return pred.item()    
    
    sentence = input('please input the sentence you want to predict(in English):')
    print('输入语句表达正向情感的概率为:{}'.format(predict_sentiment(sentence)))
    
    please input the sentence you want to predict(in English): this is a good movie
    
    
    ['this', 'is', 'a', 'good', 'movie']
    输入语句表达正向情感的概率为:1.0
    
    sentence = input('please input the sentence you want to predict(in English):')
    print('输入语句表达正向情感的概率为:{}'.format(predict_sentiment(sentence)))
    
    please input the sentence you want to predict(in English): the film is great while the stars are awful
    
    
    ['the', 'film', 'is', 'great', 'while', 'the', 'stars', 'are', 'awful']
    输入语句表达正向情感的概率为:3.232804579589299e-10
    
    sentence = input('please input the sentence you want to predict(in English):')
    print('输入语句表达正向情感的概率为:{}'.format(predict_sentiment(sentence)))
    
    please input the sentence you want to predict(in English):  the film is great and the stars are good
    
    
    [' ', 'the', 'film', 'is', 'great', 'and', 'the', 'stars', 'are', 'good']
    输入语句表达正向情感的概率为:1.0
    

    3.LSTM模型

    class LstmModel(nn.Module):
        def __init__(self,vocab_size,embed_size,output_size,pad_idx,hidden_size,dropout_ratio):
            super(LstmModel,self).__init__()
            self.embed = nn.Embedding(vocab_size,embed_size,padding_idx=pad_idx)
            self.lstm = nn.LSTM(embed_size,hidden_size,bidirectional=True,num_layers=1)
            self.linear = nn.Linear(hidden_size*2,output_size)
            self.dropout = nn.Dropout(dropout_ratio)
            
        def forward(self,text):
            embedded = self.dropout(self.embed(text))
            output,(hidden,cell) = self.lstm(embedded)
            # output size: (seq_length,batch_size,num_directions*num_layers)
            # hidden 和 cell的size: (num_layers * num_directions, batch_size, hidden_size)
            
            hidden = torch.cat([hidden[-1],hidden[-2]],dim=1)
            # hidden[-1] 和 hidden[-2]的size都是(batch_size,hidden_size),
            #cat之后,hidden的size变成(batch_size,hidden_size*2)
    #         print(hidden.size())
            hidden = self.dropout(hidden.squeeze())
            return self.linear(hidden)
            
    
    vocab_size = len(TEXT.vocab)
    embed_size = 100
    output_size = 1
    pad_idx = TEXT.vocab.stoi[TEXT.pad_token]
    hidden_size = 100
    dropout_ratio= 0.5
    lstm_model = LstmModel(vocab_size,embed_size,output_size,pad_idx,hidden_size,dropout_ratio).to(device)
    
    num_parameters = sum(p.numel() for p in lstm_model.parameters() if p.requires_grad)
    print(num_parameters)
    
    2662001
    
    # lstm_model.to(device)
    pretrained_embed = TEXT.vocab.vectors
    lstm_model.embed.weight.data.copy_(pretrained_embed)
    
    unk_idx = TEXT.vocab.stoi[TEXT.unk_token]
    lstm_model.embed.weight.data[pad_idx] = torch.zeros(embed_size)
    lstm_model.embed.weight.data[unk_idx] = torch.zeros(embed_size)
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    optimizer = torch.optim.Adam(lstm_model.parameters(),lr=0.001)
    loss_fn = nn.BCEWithLogitsLoss()
    # lstm_model.to(device)
    # loss_fn.to(device)
    
    epochs = 2
    best_valid_acc = 0.
    for epoch in range(epochs):
        train_loss,train_acc = train(lstm_model,train_iterator,optimizer,loss_fn)
        valid_loss,valid_acc = evaluate(lstm_model,valid_iterator,loss_fn)
        
        if valid_acc>best_valid_acc:
            best_valid_acc = valid_acc
            best_epoch = epoch
            torch.save(avg_model.state_dict(),'./lstm_model.txt')
    #         print('模型在验证集上的正确率({})有所提高,已将模型保存'.format(valid_acc))
        
    
        print("Epoch:", epoch, "Train_Loss:", train_loss, "Train_Acc:", train_acc,
              "Valid_Loss", valid_loss, "Valid_Acc", valid_acc)
        
    print("training has finished,the best epoch is {},the best valid_acc is {}".format(best_epoch,best_valid_acc))
    

    4.CNN 模型

    class CNNModel(nn.Module):
        def __init__(self,vocab_size,embedding_size,output_size,pad_idx,num_filters,filter_size,dropout_ratio):
            super(CNNModel,self).__init__()
            self.embed = nn.Embedding(vocab_size,embedding_size,padding_idx=pad_idx)
            self.conv = nn.Conv2d(in_channels=1,out_channels=num_filters,kernel_size=(filter_size,embedding_size))
            self.dropout = nn.Dropout(dropout_ratio)
            self.linear = nn.Linear(num_filters,ouput_size)
            
        def forward(self,text):
            text = text.permute(1,0) # 把batch_size换到第一维
            embedded = self.embed(text) # (batch_size,seq_length,embed_size)
            embedded = embedded.unsqueeze(1) #(batch_size,1,seq_length,embde_size) 这是因为cnn的input size是(batch_size,c_in,h_in,w_in)
                                             #其中c_in表示输入通道个数,比如灰度照片时,就为1;rgb照片时,就为3
            conved = F.relu(self.conv(embeded)) # (batch_size,num_filters,seq_length-fliter_size+1,1)
            conved = conved.squeeze() #把最后那个维度的1 给squeeze掉 (batch_size,num_filters,seq_length-filter_size+1)
            pooled = F.max_pool1d(conved,conved.shape[2]) 
            # max_pool1d kernel的size是(seq_length-filter_size+1,1),这个池化操作就是将每个样本的每个filter上的最大值取出来,
            # 所以经过这步池化之后,size为:(batch_size,num_filters,1)
            pooled = pooled.squeeze() #把那个1 squeeze掉
            pooled = self.dropout(pooled) 
            
            return self.linear(pooled)
    
    更多相关内容
  • model.py: #!/usr/bin/python # -*- coding: utf-8 -*- import torch from torch import nn import numpy as np from torch.autograd import Variable import torch.nn.functional as F class TextRNN(nn.Module):...
  • 通过双向LSTM-CNNs-CRF教程进行端到端序列标签 这是针对ACL'16论文的PyTorch教程 该存储库包括 资料夹 设置说明文件 预训练模型目录(笔记本电脑将根据需要自动将预训练模型下载到此目录中) 作者 安装 最好的...
  • lstm-cnn- pytorch版学习
  • pytorch+cnn+lstm+词向量

    千次阅读 2021-05-14 21:01:09
    pytorch+cnn+lstm+词向量视频分类,词向量 视频分类,词向量 # -*- coding: utf-8 -*- """ Created on Fri Nov 6 12:53:02 2020 @author: HUANGYANGLAI """ import os import numpy as np import torch import ...

    pytorch+cnn+lstm+词向量

    视频分类,词向量

    # -*- coding: utf-8 -*-
    """
    Created on Fri Nov  6 12:53:02 2020
    
    @author: HUANGYANGLAI
    """
    
    import os
    import numpy as np
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import torchvision.models as models
    import torchvision.transforms as transforms
    import torch.utils.data as data
    import torchvision
    from torch.autograd import Variable
    import matplotlib.pyplot as plt
    from functions import *
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import OneHotEncoder, LabelEncoder
    from sklearn.metrics import accuracy_score
    import pickle
    
    data_path='I:\\test\\image'
    save_model_path = "I:\\test\\CRNN_ckpt\\"
    
    # EncoderCNN architecture
    #cnn编码器构架
    CNN_fc_hidden1, CNN_fc_hidden2 = 1024, 768#编码器第一隐藏层,第二隐藏层参数
    CNN_embed_dim = 512      # latent dim extracted by 2D CNN
    img_x, img_y = 256, 342  # resize video 2d frame size(可能更改图片尺寸)
    dropout_p = 0.3         # dropout probability(随机失活)
    
    
    # DecoderRNN architecture
    #RNN解码器框架
    RNN_hidden_layers = 3#单层网络层数
    RNN_hidden_nodes = 512#隐藏节点
    RNN_FC_dim = 256
    
    
    # training parameters
    #训练参数
    #k = 2             # number of target category(目标类别数)
    epochs = 10    # training epochs(迭代次数)
    batch_size = 1     #(批处理)
    learning_rate = 1e-2    #(学习精度)
    log_interval = 10   # interval for displaying training info(显示训练信息的时间间隔)
    
    # Select which frame to begin & end in videos
    #选择视频中开始和结束的帧
    begin_frame, end_frame, skip_frame = 1, 90, 2#跳过帧
    
    #用来固定第一次的词向量
    number1=0
    
    class EncoderCNN1(nn.Module):
        def __init__(self, img_x=90, img_y=120, fc_hidden1=512, fc_hidden2=512, drop_p=0.3, CNN_embed_dim=300):
            super(EncoderCNN1, self).__init__()
    
            self.img_x = img_x
            self.img_y = img_y
            self.CNN_embed_dim = CNN_embed_dim
    
            # CNN architechtures
            self.ch1, self.ch2, self.ch3, self.ch4 = 32, 64, 128, 256
            self.k1, self.k2, self.k3, self.k4 = (5, 5), (3, 3), (3, 3), (3, 3)      # 2d kernal size
            self.s1, self.s2, self.s3, self.s4 = (2, 2), (2, 2), (2, 2), (2, 2)      # 2d strides
            self.pd1, self.pd2, self.pd3, self.pd4 = (0, 0), (0, 0), (0, 0), (0, 0)  # 2d padding
    
            # conv2D output shapes
            self.conv1_outshape = conv2D_output_size((self.img_x, self.img_y), self.pd1, self.k1, self.s1)  # Conv1 output shape
            self.conv2_outshape = conv2D_output_size(self.conv1_outshape, self.pd2, self.k2, self.s2)
            self.conv3_outshape = conv2D_output_size(self.conv2_outshape, self.pd3, self.k3, self.s3)
            self.conv4_outshape = conv2D_output_size(self.conv3_outshape, self.pd4, self.k4, self.s4)
    
            # fully connected layer hidden nodes
            self.fc_hidden1, self.fc_hidden2 = fc_hidden1, fc_hidden2
            self.drop_p = drop_p
    
            self.conv1 = nn.Sequential(
                nn.Conv2d(in_channels=3, out_channels=self.ch1, kernel_size=self.k1, stride=self.s1, padding=self.pd1),
                nn.BatchNorm2d(self.ch1, momentum=0.01),
                nn.ReLU(inplace=True),                      
                # nn.MaxPool2d(kernel_size=2),
            )
            self.conv2 = nn.Sequential(
                nn.Conv2d(in_channels=self.ch1, out_channels=self.ch2, kernel_size=self.k2, stride=self.s2, padding=self.pd2),
                nn.BatchNorm2d(self.ch2, momentum=0.01),
                nn.ReLU(inplace=True),
                # nn.MaxPool2d(kernel_size=2),
            )
    
            self.conv3 = nn.Sequential(
                nn.Conv2d(in_channels=self.ch2, out_channels=self.ch3, kernel_size=self.k3, stride=self.s3, padding=self.pd3),
                nn.BatchNorm2d(self.ch3, momentum=0.01),
                nn.ReLU(inplace=True),
                # nn.MaxPool2d(kernel_size=2),
            )
    
            self.conv4 = nn.Sequential(
                nn.Conv2d(in_channels=self.ch3, out_channels=self.ch4, kernel_size=self.k4, stride=self.s4, padding=self.pd4),
                nn.BatchNorm2d(self.ch4, momentum=0.01),
                nn.ReLU(inplace=True),
                # nn.MaxPool2d(kernel_size=2),
            )
    
            self.drop = nn.Dropout2d(self.drop_p)
            self.pool = nn.MaxPool2d(2)
            self.fc1 = nn.Linear(self.ch4 * self.conv4_outshape[0] * self.conv4_outshape[1], self.fc_hidden1)   # fully connected layer, output k classes
            self.fc2 = nn.Linear(self.fc_hidden1, self.fc_hidden2)
            self.fc3 = nn.Linear(self.fc_hidden2, self.CNN_embed_dim)   # output = CNN embedding latent variables
            ###########用来将3维变成2维
            self.fc4= nn.Sequential(
                nn.Linear(23040,100),
                nn.Tanh(),
                )
    
        def forward(self, x_3d):
            cnn_embed_seq = []
            print('x_3d的形状',x_3d.size())#([1, 28, 3, 256, 342])
            print('x_3d的形状1',x_3d.size(1))
            for t in range(x_3d.size(1)):
                # CNNs
                #print('x_3d[:, t, :, :, :]',x_3d[:, t, :, :, :].size())#torch.Size([1, 3, 256, 342])
                x = self.conv1(x_3d[:, t, :, :, :])#torch.Size([1, 32, 126, 129])
                #print('self.conv1(x_3d[:, t, :, :, :])',x.size())
                x = self.conv2(x)
                #print('self.conv2(x_3d[:, t, :, :, :])',x.size())#torch.Size([1, 64, 62, 84])
                x = self.conv3(x)
                #print('self.conv3(x_3d[:, t, :, :, :])',x.size())#torch.Size([1, 128, 30, 41])
                x = self.conv4(x)
                #print('x的形状',x.size())# torch.Size([1, 256, 14, 20])
                x = x.view(x.size(0), -1)           # flatten the output of conv
                #print('拉直的x的形状',x.size())# torch.Size([1, 71680])
                # FC layers
                x = F.relu(self.fc1(x))
                #print('F.relu(self.fc1(x))', x.size())#torch.Size([1, 1024])
                # x = F.dropout(x, p=self.drop_p, training=self.training)
                x = F.relu(self.fc2(x))
                #print('x = F.relu(self.fc2(x))',x.size())#torch.Size([1, 768])
                x = F.dropout(x, p=self.drop_p, training=self.training)
                x = self.fc3(x)
                #print('x = self.fc3(x)',x.size())#torch.Size([1, 512])
                cnn_embed_seq.append(x)
    
            # swap time and sample dim such that (sample dim, time dim, CNN latent dim)
            #print('cnn_embed_seq',cnn_embed_seq)
            cnn_embed_seq = torch.stack(cnn_embed_seq, dim=0).transpose_(0, 1)
            print('cnn_embed_seq1',cnn_embed_seq.size())#torch.Size([1, 28, 512])
            cnn_embed_seq=cnn_embed_seq.reshape(1,23040)
            cnn_embed_seq=F.relu(self.fc4(cnn_embed_seq))
            print('cnn_embed_seq2',cnn_embed_seq.size())
            # cnn_embed_seq: shape=(batch, time_step, input_size)
    
            return cnn_embed_seq
    
    
    def train(log_interval, model, device, train_loader, optimizer, epoch):
        # set model as training mode
        #(设置模式作为训练模式)
        cnn_encoder, rnn_decoder = model
        cnn_encoder.train()#训练模式
        rnn_decoder.train()#训练模式
    
        losses = []
        scores = []#分数
        N_count = 0   # counting total trained sample in one epoch(计算一次训练内训练的样本数)
        for batch_idx, (X, y) in enumerate(train_loader):
            print('迭代次数',batch_idx)
            print('看看是啥',X.size())#torch.Size([1, 28, 3, 256, 342])
            print('看看标签',y)#tensor([[1]])
            ####################################################构造词向量
            yy=y.clone()
            yy=yy.squeeze()
            yy=yy.numpy().tolist()
            print('yyyyyyyy',yy)
            if(yy==0):
                y_embed=yd0
            if(yy==1):
                y_embed=yd1
                
            ################################################################
            # distribute data to device#使用设备训练可以是显卡或者cpu
            X, y = X.to(device), y.to(device).view(-1, )
            
    
            N_count += X.size(0)
    
    
            output = cnn_encoder(X)  
            
            loss = loss_func(output,y_embed)
            print('outpt是啥',output.size())
            '''
            output = rnn_decoder(cnn_encoder(X))   # output has dim = (batch, number of classes),批数分类数
            # print('outpt是啥',output)# tensor([[-0.0016, -0.0368]],
            # print('cnn_encoder(X)',cnn_encoder(X).size())#(1,28,512)
            loss = F.cross_entropy(output, y)#交叉熵函数
            '''
            losses.append(loss.item())#误差
            '''
            # to compute accuracy(计算精确度)
            # print('呦西呦西qq',output.size())#torch.Size([1, 2])
            y_pred = torch.max(output, 1)[1]  # y_pred != output
            # print('呦西呦西',y_pred)#tensor([0])
            # print('柯基柯基',y.size())#torch.Size([1])
            # print('呦西呦西1',y.cpu().data)#tensor([0])
            # print('呦西呦西2',y_pred.cpu().data)#tensor([0])
            #step_score = accuracy_score(y.cpu().data.squeeze().numpy(), y_pred.cpu().data.squeeze().numpy())
            step_score = accuracy_score(y.cpu().data, y_pred.cpu().data)
            # print('八嘎',step_score)
            #accuracy_score中normalize:默认值为True,返回正确分类的比例
            scores.append(step_score)         # computed on CPU
            '''
      
            optimizer.zero_grad()#老三步
            loss.backward(retain_graph=True)#老三步
            optimizer.step()#老三步
    
            # show information
            if (batch_idx + 1) % log_interval == 0:#十次显示一次
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}, Accu: {:.2f}%'.format(
                    epoch + 1, N_count, len(train_loader.dataset), 100. * (batch_idx + 1) / len(train_loader), loss.item(), 100 * step_score))
    
        return losses, scores
    
    
    
    def validation(model, device, optimizer, test_loader):#测试
        # set model as testing mode
        cnn_encoder, rnn_decoder = model
        cnn_encoder.eval()#测试模式
        rnn_decoder.eval()#测试模式
    
        test_loss =0
        all_y = []
        all_y_pred = []
        ko=0
        with torch.no_grad():#在测试中不需要进行梯度翻传等操作
            for X, y in test_loader:
                # distribute data to device
                X, y = X.to(device), y.to(device).view(-1, )
    
                output = rnn_decoder(cnn_encoder(X))
                # print('预测的output',output)
                #print('cnn_encoder(x)',cnn_encoder(x))
                loss = F.cross_entropy(output, y, reduction='sum')#对n个样本的loss进行求
                
                print('实际loss',loss)
                ko=ko+1
                print('ko',ko)
                test_loss+=loss.item()                # sum up batch loss
                y_pred = output.max(1, keepdim=True)[1]  # (y_pred != output) get the index of the max log-probability
                # print('预测的标签',y_pred)
                # print('真实标签',y)
                # collect all y and y_pred in all batches
                all_y.extend(y)
                all_y_pred.extend(y_pred)
    
        print('test_loss',test_loss)
        test_loss = test_loss/len(test_loader.dataset)
        
        # print('len(test_loader.dataset)',len(test_loader.dataset))
        # # compute accuracy
        # print('all1y',all_y)
        all_y = torch.stack(all_y, dim=0)#所有元素相加
    #    print('ally',all_y)
        all_y_pred = torch.stack(all_y_pred, dim=0)
        print('all_y_pred',all_y_pred)
        print('test_loss',test_loss)
        print('len(test_loader.dataset)',len(test_loader.dataset))
        test_score = accuracy_score(all_y.cpu().data, all_y_pred.cpu().data)
        #accuracy_score中normalize:默认值为True,返回正确分类的比例
    
        # show information
        print('\nTest set ({:d} samples): Average loss: {:.4f}, Accuracy: {:.2f}%\n'.format(len(all_y), loss.item(), 100* test_score))
    
        # save Pytorch models of best record
        #torch.save(cnn_encoder.state_dict(), os.path.join(save_model_path, 'cnn_encoder_epoch{}.pth'.format(epoch + 1)))  # save spatial_encoder
        #存放torch.nn.Module模块中的state_dict只包含卷积层和全连接层的参数,当网络中存在batchnorm时,例如vgg网络结构,torch.nn.Module模块中的state_dict也会存放batchnorm's running_mean
        #torch.save(rnn_decoder.state_dict(), os.path.join(save_model_path, 'rnn_decoder_epoch{}.pth'.format(epoch + 1)))  # save motion_encoder
        #torch.save(optimizer.state_dict(), os.path.join(save_model_path, 'optimizer_epoch{}.pth'.format(epoch + 1)))      # save optimizer
        print("Epoch {} model saved!".format(epoch + 1))
    
        return test_loss, test_score
    
    # Detect devices
    use_cuda = torch.cuda.is_available()                   # check if GPU exists
    device = torch.device("cuda" if use_cuda else "cpu")   # use CPU or GPU
    
    # Data loading parameters
    #params = {'batch_size': batch_size, 'shuffle': False, 'num_workers': 4, 'pin_memory': True} if use_cuda else {}
    params = {'batch_size': batch_size, 'shuffle': True, 'num_workers': 0, 'pin_memory': True} 
    
    '''
    # load UCF101 actions names
    #加装类别名字
    with open(action_name_path, 'rb') as f:
        action_names = pickle.load(f)
    '''
    
    
    action_names=['ApplyEyeMakeup','BandMarching']
    
    
    
    # convert labels -> category
    le = LabelEncoder()
    le.fit(action_names)
    print("jjjjjjjjjj",le.fit(action_names))
    # show how many classes there are(列表中显示标签名称)
    list(le.classes_)
    print('list(le.classes_)',list(le.classes_))
    
    # convert category -> 1-hot
    action_category = le.transform(action_names).reshape(-1, 1)#(将字符串标签给编号0-100print('action_category',action_category)
    enc = OneHotEncoder()#实现将分类特征的每一个数值转化为一个可以用来计算的值
    enc.fit(action_category)#这里的作用是为后面enc.transfrom中生成自动编码做准备
    
    print("kkkkkkkkkkkkkkkkkk",enc.fit(action_category))
    actions = []
    fnames = os.listdir(data_path)#(得到数据路径下的所有文件,返回以列表的形式)
    
    all_names = []
    for f in fnames:
        loc1 = f.find('v_')
        loc2 = f.find('_g')
        actions.append(f[(loc1 + 2): loc2])
    
        all_names.append(f)
    
    
    # list all data files(列出所有文件数据)
    all_X_list = all_names                  # all video file names(所有视频文件名)
    all_y_list = labels2cat(le, actions)    # all video labels(即每个视频文件夹对应的标签)
    print('\n')
    print(all_X_list)
    print(all_y_list)
    print(actions)
    print('\n')
    train_list, test_list, train_label, test_label = train_test_split(all_X_list, all_y_list, test_size=0.5, random_state=42)
    print('train_list是啥',train_list)#['v_ApplyEyeMakeup_g01_c01', 'v_BandMarching_g01_c01']
    print('test_list是啥',test_list)#['v_ApplyEyeMakeup_g01_c02', 'v_BandMarching_g01_c02']
    print('train_label是啥',train_label)#[0,1]
    print('test_label是啥',test_label)# [0,1]
    
    transform = transforms.Compose([transforms.Resize([img_x, img_y]),#改变形状
                                    transforms.ToTensor(),
                                    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
    
    
    selected_frames = np.arange(begin_frame, end_frame, skip_frame).tolist()#挑选帧
    print('selected_frames',selected_frames)
    
    
    
    train_set, valid_set = Dataset_CRNN(data_path, train_list, train_label, selected_frames, transform=transform), \
                           Dataset_CRNN(data_path, test_list, test_label, selected_frames, transform=transform)
    #print('train_set的形状',train_set)# <functions.Dataset_CRNN object at 0x0000027A97384488>
    train_loader = data.DataLoader(train_set, **params)
    #print('train_loader的形状',train_loader)
    valid_loader = data.DataLoader(valid_set, **params)
    
    cnn_encoder = EncoderCNN1(img_x=img_x, img_y=img_y, fc_hidden1=CNN_fc_hidden1, fc_hidden2=CNN_fc_hidden2,
                             drop_p=dropout_p, CNN_embed_dim=CNN_embed_dim).to(device)
    
    rnn_decoder = DecoderRNN(CNN_embed_dim=CNN_embed_dim, h_RNN_layers=RNN_hidden_layers, h_RNN=RNN_hidden_nodes, 
                             h_FC_dim=RNN_FC_dim, drop_p=dropout_p, num_classes=k).to(device)
    
    #crnn_params = list(cnn_encoder.parameters())
    #optimizer = torch.optim.Adam(cnn_encoder.parameters(), lr=learning_rate)#优化cnn编码器和rnn解码器的参数
    optimizer = torch.optim.SGD(cnn_encoder.parameters(), lr=learning_rate)#优化cnn编码器和rnn解码器的参数
    #loss_func=torch.nn.MSELoss()
    loss_func=torch.nn.SmoothL1Loss()
    
    epoch_train_losses = []
    epoch_train_scores = []
    
    #############################################词向量
    word_to_ix={'ApplyEyeMakeup':0,'BandMarching':1}
    idex_to_word={word_to_ix[word]:word for word in word_to_ix}
    embeds = torch.nn.Embedding(2,100)
    
    y_idx0=torch.LongTensor([word_to_ix['ApplyEyeMakeup']])
    print('y_idx',y_idx0.size())
    y_embed00 = embeds(y_idx0)
    print('y_embed00',y_embed00.size())
    
    yd0=y_embed00
    
    y_idx1=torch.LongTensor([word_to_ix['ApplyEyeMakeup']])
    print('y_idx',y_idx1.size())
    y_embed1 = embeds(y_idx1)
    print('y_embed1',y_embed1.size())
    
    yd1=y_embed1
    ###########################################
    for epoch in range(epochs):
        train_losses, train_scores = train(log_interval, [cnn_encoder, rnn_decoder], device, train_loader, optimizer, epoch)
        epoch_train_losses.append(train_losses)
        epoch_train_scores.append(train_scores)
        A = np.array(epoch_train_losses)
        B = np.array(epoch_train_scores)
        
        
    fig = plt.figure(figsize=(10, 4))
    plt.subplot(121)
    plt.plot(np.arange(1, epochs + 1), A[:, -1])  # train loss (on epoch end)
    #plt.plot(np.arange(1, epochs + 1), C)         #  test loss (on epoch end)
    plt.title("model loss")
    plt.xlabel('epochs')
    plt.ylabel('loss')
    #plt.legend(['train', 'test'], loc="upper left")
    plt.legend(['train'], loc="upper left")
    '''
    # 2nd figure
    plt.subplot(122)
    plt.plot(np.arange(1, epochs + 1), B[:, -1])  # train accuracy (on epoch end)
    #plt.plot(np.arange(1, epochs + 1), D)         #  test accuracy (on epoch end)
    plt.title("training scores")
    plt.xlabel('epochs')
    plt.ylabel('accuracy')
    #plt.legend(['train', 'test'], loc="upper left")
    title = "./fig_UCF101_CRNN.png"
    plt.savefig(title, dpi=600)
    # plt.close(fig)
    plt.show()
    '''
    
    展开全文
  • 各位大佬们,使用pytorch框架,把用cnn提取的特征作为lstm的输入或者cnn和lstm并联,跪求完整代码(随便什么分类都可以,mnist都行)
  • 深层CNN编码器+ LSTM解码器,用于图像到乳胶, 的模型架构的pytorch实现 此实现的示例结果 在IM2LATEX-100K测试数据集上的实验结果 蓝色4 编辑距离 完全符合 40.80 44.23 0.27 入门 安装依赖项: pip install -...
  • LSTM预测航班搬运自:LSTM,感谢原作者Usman Malik 。(顺便种草一个软件:神经网络结构绘图软件https://github.com/alexlenail/NN-SVG)本节将介绍另一种...1.1、数据集问题定义import 让我们打印Seaborn库内置...

    a39b5190720ba66c5a32eea80f5262af.png

    LSTM预测航班

    搬运自:LSTM,感谢原作者Usman Malik 。

    (顺便种草一个软件:神经网络结构绘图软件https://github.com/alexlenail/NN-SVG)


    本节将介绍另一种常用的门控循环神经网络:长短期记忆(long short-term memory,LSTM)。它 比门控循环单元的结构稍微复杂一点。

    1.1、数据集和问题定义

    import 

    让我们打印Seaborn库内置的所有数据集的列表:

    [

    让我们将数据集加载到我们的应用程序中

    flight_data 

    3e173a758d899bf6acd7dba2bf71b87e.png

    该数据集有三列:year,month,和passengers。该passengers列包含指定月份旅行旅客的总数。让我们绘制数据集的形状:

    flight_data

    可以看到数据集中有144行和3列,这意味着数据集包含12年的乘客旅行记录。

    任务是根据前132个月来预测最近12个月内旅行的乘客人数。请记住,我们有144个月的记录,这意味着前132个月的数据将用于训练我们的LSTM模型,而模型性能将使用最近12个月的值进行评估。

    让我们绘制每月乘客的出行频率。

    接下来的脚本绘制了每月乘客人数的频率:

    fig_size 

    bdf2140ab2931e8ec145cd8cdf44b6d6.png

    输出显示,多年来,乘飞机旅行的平均乘客人数有所增加。一年内旅行的乘客数量波动,这是有道理的,因为在暑假或寒假期间,旅行的乘客数量与一年中的其他部分相比有所增加。

    1.2、数据预处理

    数据集中的列类型为object,如以下代码所示:

    flight_data

    第一步是将passengers列的类型更改为float。

    all_data 

    接下来,我们将数据集分为训练集和测试集。LSTM算法将在训练集上进行训练。然后将使用该模型对测试集进行预测。将预测结果与测试集中的实际值进行比较,以评估训练后模型的性能。

    前132条记录将用于训练模型,后12条记录将用作测试集。以下脚本将数据分为训练集和测试集。

    test_data_size 

    我们的数据集目前尚未规范化。最初几年的乘客总数远少于后来几年的乘客总数。标准化数据以进行时间序列预测非常重要。以在一定范围内的最小值和最大值之间对数据进行规范化。我们将使用模块中的MinMaxScaler类sklearn.preprocessing来扩展数据。

    以下代码 分别将最大值和最小值分别为-1和1归一化。

    from 

    您可以看到数据集值现在在-1和1之间。

    在此重要的是要提到数据标准化仅应用于训练数据,而不应用于测试数据。如果对测试数据进行归一化处理,则某些信息可能会从训练集中 到测试集中。

    下一步是将我们的数据集转换为张量,因为PyTorch模型是使用张量训练的。要将数据集转换为张量,我们可以简单地将数据集传递给FloatTensor对象的构造函数,如下所示

    train_data_normalized 

    最后的预处理步骤是将我们的训练数据转换为序列和相应的标签。

    您可以使用任何序列长度,这取决于领域知识。但是,在我们的数据集中,使用12的序列长度很方便,因为我们有月度数据,一年中有12个月。如果我们有每日数据,则更好的序列长度应该是365,即一年中的天数。因此,我们将训练的输入序列长度设置为12。

    接下来,我们将定义一个名为的函数create_inout_sequences。该函数将接受原始输入数据,并将返回一个元组列表。在每个元组中,第一个元素将包含与12个月内旅行的乘客数量相对应的12个项目的列表,第二个元组元素将包含一个项目,即在12 + 1个月内的乘客数量。

    train_window 

    执行以下脚本以创建序列和相应的标签进行训练:

    train_inout_seq 

    如果打印train_inout_seq列表的长度,您将看到它包含120个项目。这是因为尽管训练集包含132个元素,但是序列长度为12,这意味着第一个序列由前12个项目组成,第13个项目是第一个序列的标签。同样,第二个序列从第二个项目开始,到第13个项目结束,而第14个项目是第二个序列的标签,依此类推。

    现在让我们输出train_inout_seq列表的前5个项目:

    train_inout_seq

    您会看到每个项目都是一个元组,其中第一个元素由序列的12个项目组成,第二个元组元素包含相应的标签。

    1.3、创建LSTM模型

    让我总结一下以上代码中发生的事情。LSTM该类的构造函数接受三个参数:

    input_size:对应于输入中的要素数量。尽管我们的序列长度为12,但每个月我们只有1个值,即乘客总数,因此输入大小为1。 hidden_layer_size:指定隐藏层的数量以及每层中神经元的数量。我们将有一层100个神经元。 output_size:输出中的项目数,由于我们要预测未来1个月的乘客人数,因此输出大小为1。 接下来,在构造函数中,我们创建变量hidden_layer_size,lstm,linear,和hidden_cell。LSTM算法接受三个输入:先前的隐藏状态,先前的单元状态和当前输入。该hidden_cell变量包含先前的隐藏状态和单元状态。的lstm和linear层变量用于创建LSTM和线性层。

    在forward方法内部,将input_seq作为参数传递,该参数首先传递给lstm图层。lstm层的输出是当前时间步的隐藏状态和单元状态,以及输出。lstm图层的输出将传递到该linear图层。预计的乘客人数存储在predictions列表的最后一项中,并返回到调用函数。

    下一步是创建LSTM()类的对象,定义损失函数和优化器。由于我们正在解决分类问题,

    让我们输出模型:

    class 

    1.4、训练模型

    epochs 

    1.5、做出预测

    现在我们的模型已经训练完毕,我们可以开始进行预测了。

    fut_pred 

    您可以将上述值与train_data_normalized数据列表的最后12个值进行比较。

    该test_inputs项目将包含12个项目。在for循环内,这12个项目将用于对测试集中的第一个项目进行预测,即项目编号133。然后将预测值附加到test_inputs列表中。在第二次迭代中,最后12个项目将再次用作输入,并将进行新的预测,然后将其test_inputs再次添加到列表中。for由于测试集中有12个元素,因此该循环将执行12次。在循环末尾,test_inputs列表将包含24个项目。最后12个项目将是测试集的预测值。

    以下脚本用于进行预测:

    如果输出test_inputs列表的长度,您将看到它包含24个项目。可以按以下方式打印最后12个预测项目:

    model

    由于我们对训练数据集进行了标准化,因此预测值也进行了标准化。我们需要将归一化的预测值转换为实际的预测值。

    actual_predictions 

    现在让我们针对实际值绘制预测值。看下面的代码:

    x 

    在上面的脚本中,我们创建一个列表,其中包含最近12个月的数值。第一个月的索引值为0,因此最后一个月的索引值为143。

    在下面的脚本中,我们将绘制144个月的乘客总数以及最近12个月的预计乘客数量。

    plt

    2cbf1f11a8d5943771654f3e555cf2b7.png

    我们的LSTM所做的预测用橙色线表示。您可以看到我们的算法不太准确,但是它仍然能够捕获最近12个月内旅行的乘客总数的上升趋势以及偶尔的波动。您可以尝试在LSTM层中使用更多的时期和更多的神经元,以查看是否可以获得更好的性能。

    为了更好地查看输出,我们可以绘制最近12个月的实际和预测乘客数量,如下所示:

    plt

    0c7c7e58e281f2e895117db7d4dbebb6.png
    展开全文
  • 文章目录前言一、数据处理与Word2vec词向量训练二、创建神经网络的输入batch三、神经网络模型1.LSTM2.CNN四、训练与测试六、完整代码1.LSTM2.CNN 前言 本文使用pytorch,利用两种神经网络(lstm,cnn)实现中文的...


    前言

    本文使用pytorch,利用两种神经网络(lstm,cnn)实现中文的文本情感识别。代码都有详细的注释说明。使用的是谭松波酒店评价语料库,其中包含3000条负面评价,7000条正面评价。


    一、数据处理与Word2vec词向量训练

    原始的语料数据如下图
    图1 原始数据
    通过txt进行处理,将文本前的1与空格去除,得到结果如下图,将其作为程序的输入
    在这里插入图片描述
    将输入的文本进行预处理,利用jieba函数库进行分词

    def del_stop_words(text): #分词
    	word_ls = jieba.lcut(text)
    	#word_ls = [i for i in word_ls if i not in stopwords]
    	return word_ls
    
    with open("F:/python_data/practice/tansongbo/neg.txt", "r", encoding='UTF-8') as e:     # 加载负面语料
        neg_data1 = e.readlines()
    
    with open("F:/python_data/practice/tansongbo/pos.txt", "r", encoding='UTF-8') as s:     # 加载正面语料
        pos_data1 = s.readlines()
    
    neg_data = sorted(set(neg_data1), key=neg_data1.index)  #列表去重 保持原来的顺序
    pos_data = sorted(set(pos_data1), key=pos_data1.index)
    
    neg_data = [del_stop_words(data.replace("\n", "")) for data in neg_data]   # 处理负面语料
    pos_data = [del_stop_words(data.replace("\n", "")) for data in pos_data]
    all_sentences = neg_data + pos_data  # 全部语料 用于训练word2vec
    

    训练词向量,创建词向量词典

    ####训练过一次后可以不再训练词向量模型####
    
    ####用于训练词向量模型###
    
    model = Word2Vec(all_sentences,     # 上文处理过的全部语料
                     size=100,  # 词向量维度 默认100维
                     min_count=1,  # 词频阈值 词出现的频率 小于这个频率的词 将不予保存
                     window=5  # 窗口大小 表示当前词与预测词在一个句子中的最大距离是多少
                     )
    model.save('f.model')  # 保存模型
    
    #加载模型,提取出词索引和词向量
    def create_dictionaries(model):
    	
        gensim_dict = Dictionary()    # 创建词语词典
        gensim_dict.doc2bow(model.wv.vocab.keys(), allow_update=True)
    
        w2indx = {v: k + 1 for k, v in gensim_dict.items()}  # 词语的索引,从1开始编号
        w2vec = {word: model[word] for word in w2indx.keys()}  # 词语的词向量
        return w2indx, w2vec
    
    model = Word2Vec.load('F:/python_data/practice/tansongbo/f.model')         # 加载模型
    index_dict, word_vectors= create_dictionaries(model)  # 索引字典、词向量字典
    
    #使用pickle进行字典索引与词向量的存储
    output = open('F:/python_data/practice/tansongbo/dict.txt' + ".pkl", 'wb')      
    pickle.dump(index_dict, output)  # 索引字典
    pickle.dump(word_vectors, output)  # 词向量字典
    output.close()
    

    二、创建神经网络的输入batch

    将文本句子转换为词向量的多维矩阵,并创建输入到神经网络中的batch
    #参数设置
    vocab_dim = 100 # 向量维度
    maxlen = 28 # 文本保留的最大长度
    n_epoch = 10   # 迭代次数
    batch_size = 64    #每次送入网络的句子数
    
    #加载词向量数据,填充词向量矩阵
    f = open("F:/python_data/practice/tansongbo/dict.txt.pkl", 'rb')  # 预先训练好的
    index_dict = pickle.load(f)    # 索引字典,{单词: 索引数字}
    word_vectors = pickle.load(f)  # 词向量, {单词: 词向量(100维长的数组)}
    
    n_symbols = len(index_dict) + 1  # 索引数字的个数,因为有的词语索引为0,所以+1
    embedding_weights = np.zeros((n_symbols, vocab_dim))  # 创建一个n_symbols * 100的0矩阵
    
    for w, index in index_dict.items():  # 从索引为1的词语开始,用词向量填充矩阵
        embedding_weights[index, :] = word_vectors[w]  # 词向量矩阵,第一行是0向量(没有索引为0的词语,未被填充)
        
    #将文本数据映射成数字(是某个词的编号,不是词向量)    
    def text_to_index_array(p_new_dic, p_sen): 
        
        ##文本或列表转换为索引数字
        
        if type(p_sen) == list:
            new_sentences = []
            for sen in p_sen:
                new_sen = []
                for word in sen:
                    try:
                        new_sen.append(p_new_dic[word])  # 单词转索引数字
                    except:
                        new_sen.append(0)  # 索引字典里没有的词转为数字0
                new_sentences.append(new_sen)
            return np.array(new_sentences)   # 转numpy数组
        else:
            new_sentences = []
            sentences = []
            p_sen = p_sen.split(" ")
            for word in p_sen:
                try:
                    sentences.append(p_new_dic[word])  # 单词转索引数字
                except:
                    sentences.append(0)  # 索引字典里没有的词转为数字0
            new_sentences.append(sentences)
            return new_sentences
    
    #将数据切割成一样的指定长度    
    def text_cut_to_same_long(sents):
        data_num = len(sents)
        new_sents = np.zeros((data_num,maxlen)) #构建一个矩阵来装修剪好的数据
        se = []
        for i in range(len(sents)):
            new_sents[i,:] = sents[i,:maxlen]        
        new_sents = np.array(new_sents)
        return new_sents
    
    #将每个句子的序号矩阵替换成词向量矩阵
    def creat_wordvec_tensor(embedding_weights,X_T):
        X_tt = np.zeros((len(X_T),maxlen,vocab_dim))
        num1 = 0
        num2 = 0
        for j in X_T:
            for i in j:
                X_tt[num1,num2,:] = embedding_weights[int(i),:]
                num2 = num2+1
            num1 = num1+1
            num2 = 0
        return X_tt
        
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print ('正在使用计算的是:%s'%device)    
    data = all_sentences  #获取之前分好词的数据
    # 读取语料类别标签
    label_list = ([0] * len(neg_data) + [1] * len(pos_data))
    
    # 划分训练集和测试集,此时都是list列表
    X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(data, label_list, test_size=0.2)
    #print (X_train_l[0])
    # 转为数字索引形式
    
    # token = Tokenizer(num_words=3000)   #字典数量
    # token.fit_on_texts(train_text)
    
    X_train = text_to_index_array(index_dict, X_train_l)
    X_test = text_to_index_array(index_dict, X_test_l)
    #print("训练集shape: ", X_train[0])
    
    
    y_train = np.array(y_train_l)  # 转numpy数组
    y_test = np.array(y_test_l)
    
    ##将数据切割成一样的指定长度
    from torch.nn.utils.rnn import pad_sequence
    #将数据补长变成和最长的一样长
    X_train = pad_sequence([torch.from_numpy(np.array(x)) for x in X_train],batch_first=True).float() 
    X_test = pad_sequence([torch.from_numpy(np.array(x)) for x in X_test],batch_first=True).float()
    #将数据切割成需要的样子
    X_train = text_cut_to_same_long(X_train)
    X_test = text_cut_to_same_long(X_test)
    
    #将词向量字典序号转换为词向量矩阵
    X_train = creat_wordvec_tensor(embedding_weights,X_train)
    X_test = creat_wordvec_tensor(embedding_weights,X_test)
    
    #print("训练集shape: ", X_train.shape)
    #print("测试集shape: ", X_test.shape)
    
    ####Datloader和创建batch#### 
    from torch.utils.data import TensorDataset, DataLoader
     
    # 创建Tensor datasets
    train_data = TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
    test_data = TensorDataset(torch.from_numpy(X_test), torch.from_numpy(y_test))
     
    # shuffle是打乱数据顺序
    train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
    test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)
    

    三、神经网络模型

    1.LSTM

    class lstm(nn.Module):
        def __init__(self):
            super(lstm, self).__init__()
            self.lstm = nn.LSTM(
                input_size=vocab_dim,
                hidden_size=128,
                batch_first=True)     #batch_first 是因为DataLoader所读取的数据与lstm所需的输入input格式是不同的,
                                      #所在的位置不同,故通过batch_first进行修改
            self.fc = nn.Linear(128, 2)#连接层的输入维数是hidden_size的大小
            
        def forward(self, x):
            out, (h_0, c_0) = self.lstm(x)
            out = out[:, -1, :]
            out = self.fc(out)
            out = F.softmax(out, dim= 1)
            return out, h_0
    
    model = lstm()
    optimizer = torch.optim.Adam(model.parameters())
    model = model.to(device)    #将模型放入GPU
    
    

    2.CNN

    class CNN(nn.Module):
        def __init__(self, embedding_dim, n_filters, filter_sizes, dropout):
            super(CNN, self).__init__()
    
            self.convs = nn.ModuleList([
                nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(fs, embedding_dim))
                for fs in filter_sizes])   #.ModuleList将模块放入一个列表
    
            self.fc = nn.Linear(n_filters * len(filter_sizes), 2)
    
            self.dropout = nn.Dropout(dropout)  #防止过拟合
    
        def forward(self, text):
    
            # text = [batch_size, sent_len, emb_dim]
    
            embedded = text.unsqueeze(1)
    
            # embedded = [batch_size, 1, sent_len, emb_dim]
    
            convd = [conv(embedded).squeeze(3) for conv in self.convs]
    
            # conv_n = [batch_size, n_filters, sent_len - fs + 1]
    
            pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in convd]
    
            # pooled_n = [batch_size, n_filters]
    
            cat = self.dropout(torch.cat(pooled, dim=1))  #torch.cat使张量进行拼接
    
            # cat = [batch_size, n_filters * len(filter_sizes)]
    
            return self.fc(cat)
    
    n_filters = 100
    filter_sizes = [2, 3, 4]
    dropout = 0.5
    
    model = CNN(vocab_dim, n_filters, filter_sizes, dropout)
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters())
    

    四、训练与测试

    下面代码展示的是LSTM模型的代码,CNN基本也相同,主要差别在于输出结果,具体不同可以查看最后的完整代码。

    ####训练train data####
    from sklearn.metrics import accuracy_score, classification_report
    print ('————————进行训练集训练————————')
    for epoch in range(n_epoch):
        correct = 0
        total = 0
        epoch_loss = 0
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):        
            #print (data.shape)
           
            data = torch.as_tensor(data, dtype=torch.float32)
            target = target.long()   ##要保证label的数据类型是long
            optimizer.zero_grad()
            data,target = data.cuda(),target.cuda()  #将数据放入GPU
            output, h_state = model(data)
            #labels = output.argmax(dim= 1)
            #acc = accuracy_score(target, labels)
            
            correct += int(torch.sum(torch.argmax(output, dim=1) == target))
            total += len(target)
            
            #梯度清零;反向传播;
            optimizer.zero_grad()
            loss = F.cross_entropy(output, target) #交叉熵损失函数;
            epoch_loss += loss.item()
            loss.backward() 
            optimizer.step()
        
        loss = epoch_loss / (batch_idx + 1)
        print ('epoch:%s'%epoch, 'accuracy:%.3f%%'%(correct *100 / total), 'loss = %s'%loss)
        
    ####进行测试集验证####
    print ('————————进行测试集验证————————')
    for epoch in range(1):
        correct = 0
        total = 0
        epoch_loss = 0
        model.train()
        for batch_idx, (data, target) in enumerate(test_loader):        
            #print (data.shape)
           
            data = torch.as_tensor(data, dtype=torch.float32)
            target = target.long()   ##要保证label的数据类型是long
            data,target = data.cuda(),target.cuda()  #将数据放入GPU
            optimizer.zero_grad()
            output, h_state = model(data)
            #labels = output.argmax(dim= 1)
            #acc = accuracy_score(target, labels)
            
            correct += int(torch.sum(torch.argmax(output, dim=1) == target))
            total += len(target)
            
            #梯度清零;反向传播;
            optimizer.zero_grad()
            loss = F.cross_entropy(output, target) #交叉熵损失函数;
            epoch_loss += loss.item()
            loss.backward() 
            optimizer.step()
        
        loss = epoch_loss / (batch_idx + 1)
        print ('epoch:%s'%epoch, 'accuracy:%.3f%%'%(correct *100 / total), 'loss = %s'%loss)
    

    六、实验结果

    1.LSTM
    训练了40个epoch,最终正确率在83%左右
    在这里插入图片描述
    2.CNN
    训练了10个epoch,正确率在78%左右
    在这里插入图片描述

    七、完整代码

    1.LSTM

    # -*- coding: utf-8 -*-
    ####数据预处理####
    #分词
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import numpy as np
    import jieba
    from sklearn.model_selection import train_test_split
    
    #f = open('./stop_words.txt', encoding='utf-8')         # 加载停用词
    #stopwords = [i.replace("\n", "") for i in f.readlines()]    # 停用词表
    
    def del_stop_words(text): #分词
    	word_ls = jieba.lcut(text)
    	#word_ls = [i for i in word_ls if i not in stopwords]
    	return word_ls
    
    with open("F:/python_data/practice/tansongbo/neg.txt", "r", encoding='UTF-8') as e:     # 加载负面语料
        neg_data1 = e.readlines()
    
    with open("F:/python_data/practice/tansongbo/pos.txt", "r", encoding='UTF-8') as s:     # 加载正面语料
        pos_data1 = s.readlines()
    
    neg_data = sorted(set(neg_data1), key=neg_data1.index)  #列表去重 保持原来的顺序
    pos_data = sorted(set(pos_data1), key=pos_data1.index)
    
    neg_data = [del_stop_words(data.replace("\n", "")) for data in neg_data]   # 处理负面语料
    pos_data = [del_stop_words(data.replace("\n", "")) for data in pos_data]
    all_sentences = neg_data + pos_data  # 全部语料 用于训练word2vec
    
    ####文本向量化####
    #创建word2vec词向量模型
    from gensim.models.word2vec import Word2Vec
    from gensim.corpora.dictionary import Dictionary
    import pickle
    import logging
    
    #logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)  # 将日志输出到控制台
    
    ####训练过一次后可以不再训练词向量模型####
    
    ####用于训练词向量模型###
    
    model = Word2Vec(all_sentences,     # 上文处理过的全部语料
                     size=100,  # 词向量维度 默认100维
                     min_count=1,  # 词频阈值 词出现的频率 小于这个频率的词 将不予保存
                     window=5  # 窗口大小 表示当前词与预测词在一个句子中的最大距离是多少
                     )
    model.save('f.model')  # 保存模型
    
    #加载模型,提取出词索引和词向量
    def create_dictionaries(model):
    	
        gensim_dict = Dictionary()    # 创建词语词典
        gensim_dict.doc2bow(model.wv.vocab.keys(), allow_update=True)
    
        w2indx = {v: k + 1 for k, v in gensim_dict.items()}  # 词语的索引,从1开始编号
        w2vec = {word: model[word] for word in w2indx.keys()}  # 词语的词向量
        return w2indx, w2vec
    
    model = Word2Vec.load('F:/python_data/practice/tansongbo/f.model')         # 加载模型
    index_dict, word_vectors= create_dictionaries(model)  # 索引字典、词向量字典
    
    #使用pickle进行字典索引与词向量的存储
    output = open('F:/python_data/practice/tansongbo/dict.txt' + ".pkl", 'wb')      
    pickle.dump(index_dict, output)  # 索引字典
    pickle.dump(word_vectors, output)  # 词向量字典
    output.close()
    
    
    ####LSTM训练####
    #参数设置
    vocab_dim = 100 # 向量维度
    maxlen = 50 # 文本保留的最大长度
    n_epoch = 40   # 迭代次数
    batch_size = 64    #每次送入网络的句子数
    
    #加载词向量数据,填充词向量矩阵
    f = open("F:/python_data/practice/tansongbo/dict.txt.pkl", 'rb')  # 预先训练好的
    index_dict = pickle.load(f)    # 索引字典,{单词: 索引数字}
    word_vectors = pickle.load(f)  # 词向量, {单词: 词向量(100维长的数组)}
    
    n_symbols = len(index_dict) + 1  # 索引数字的个数,因为有的词语索引为0,所以+1
    embedding_weights = np.zeros((n_symbols, vocab_dim))  # 创建一个n_symbols * 100的0矩阵
    
    for w, index in index_dict.items():  # 从索引为1的词语开始,用词向量填充矩阵
        embedding_weights[index, :] = word_vectors[w]  # 词向量矩阵,第一行是0向量(没有索引为0的词语,未被填充)
        
    #将文本数据映射成数字(是某个词的编号,不是词向量)    
    def text_to_index_array(p_new_dic, p_sen): 
        
        ##文本或列表转换为索引数字
        
        if type(p_sen) == list:
            new_sentences = []
            for sen in p_sen:
                new_sen = []
                for word in sen:
                    try:
                        new_sen.append(p_new_dic[word])  # 单词转索引数字
                    except:
                        new_sen.append(0)  # 索引字典里没有的词转为数字0
                new_sentences.append(new_sen)
            return np.array(new_sentences)   # 转numpy数组
        else:
            new_sentences = []
            sentences = []
            p_sen = p_sen.split(" ")
            for word in p_sen:
                try:
                    sentences.append(p_new_dic[word])  # 单词转索引数字
                except:
                    sentences.append(0)  # 索引字典里没有的词转为数字0
            new_sentences.append(sentences)
            return new_sentences
    
    #将数据切割成一样的指定长度    
    def text_cut_to_same_long(sents):
        data_num = len(sents)
        new_sents = np.zeros((data_num,maxlen)) #构建一个矩阵来装修剪好的数据
        se = []
        for i in range(len(sents)):
            new_sents[i,:] = sents[i,:maxlen]        
        new_sents = np.array(new_sents)
        return new_sents
        
    #加载数据特征与标签,将数据特征映射成数字,分割训练集与测试集
    
    with open("F:/python_data/practice/tansongbo/neg.txt", "r", encoding='UTF-8') as f:
                neg_data1 = f.readlines()
    with open("F:/python_data/practice/tansongbo/pos.txt", "r", encoding='UTF-8') as g:
        pos_data1 = g.readlines()
    neg_data = sorted(set(neg_data1), key=neg_data1.index)  #列表去重 保持原来的顺序
    pos_data = sorted(set(pos_data1), key=pos_data1.index)
    
    neg_data = [del_stop_words(data) for data in neg_data]
    pos_data = [del_stop_words(data) for data in pos_data]
    data = neg_data + pos_data
    
    #将每个句子的序号矩阵替换成词向量矩阵
    def creat_wordvec_tensor(embedding_weights,X_T):
        X_tt = np.zeros((len(X_T),maxlen,vocab_dim))
        num1 = 0
        num2 = 0
        for j in X_T:
            for i in j:
                X_tt[num1,num2,:] = embedding_weights[int(i),:]
                num2 = num2+1
            num1 = num1+1
            num2 = 0
        return X_tt
        
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print ('正在使用计算的是:%s'%device)    
    data = all_sentences  #获取之前分好词的数据
    # 读取语料类别标签
    label_list = ([0] * len(neg_data) + [1] * len(pos_data))
    
    # 划分训练集和测试集,此时都是list列表
    X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(data, label_list, test_size=0.2)
    #print (X_train_l[0])
    # 转为数字索引形式
    
    # token = Tokenizer(num_words=3000)   #字典数量
    # token.fit_on_texts(train_text)
    
    X_train = text_to_index_array(index_dict, X_train_l)
    X_test = text_to_index_array(index_dict, X_test_l)
    #print("训练集shape: ", X_train[0])
    
    
    y_train = np.array(y_train_l)  # 转numpy数组
    y_test = np.array(y_test_l)
    
    ##将数据切割成一样的指定长度
    from torch.nn.utils.rnn import pad_sequence
    #将数据补长变成和最长的一样长
    X_train = pad_sequence([torch.from_numpy(np.array(x)) for x in X_train],batch_first=True).float() 
    X_test = pad_sequence([torch.from_numpy(np.array(x)) for x in X_test],batch_first=True).float()
    #将数据切割成需要的样子
    X_train = text_cut_to_same_long(X_train)
    X_test = text_cut_to_same_long(X_test)
    
    #将词向量字典序号转换为词向量矩阵
    X_train = creat_wordvec_tensor(embedding_weights,X_train)
    X_test = creat_wordvec_tensor(embedding_weights,X_test)
    
    #print("训练集shape: ", X_train.shape)
    #print("测试集shape: ", X_test.shape)
    
    ####Datloader和创建batch#### 
    from torch.utils.data import TensorDataset, DataLoader
     
    # 创建Tensor datasets
    train_data = TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
    test_data = TensorDataset(torch.from_numpy(X_test), torch.from_numpy(y_test))
     
    # shuffle是打乱数据顺序
    train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
    test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)
        
    class lstm(nn.Module):
        def __init__(self):
            super(lstm, self).__init__()
            self.lstm = nn.LSTM(
                input_size=vocab_dim,
                hidden_size=64,
                batch_first=True)     #batch_first 是因为DataLoader所读取的数据与lstm所需的输入input格式是不同的,
                                      #所在的位置不同,故通过batch_first进行修改
            self.fc = nn.Linear(64, 2)#连接层的输入维数是hidden_size的大小
            
        def forward(self, x):
            out, (h_0, c_0) = self.lstm(x)
            out = out[:, -1, :]
            out = self.fc(out)
            out = F.sigmoid(out)    #二分类使用sigmoid函数,多分类使用softmax函数 out = F.softmax(out,dim=1)
            return out, h_0
    
    model = lstm()
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters())
    
    ####训练train data####
    from sklearn.metrics import accuracy_score, classification_report
    print ('————————进行训练集训练————————')
    for epoch in range(n_epoch):
        correct = 0
        total = 0
        epoch_loss = 0
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):        
            #print (data.shape)
           
            data = torch.as_tensor(data, dtype=torch.float32)
            target = target.long()   ##要保证label的数据类型是long
            optimizer.zero_grad()
            data,target = data.cuda(),target.cuda()  #将数据放入GPU
            output, h_state = model(data)
            #labels = output.argmax(dim= 1)
            #acc = accuracy_score(target, labels)
            
            correct += int(torch.sum(torch.argmax(output, dim=1) == target))
            total += len(target)
            
            #梯度清零;反向传播;
            optimizer.zero_grad()
            loss = F.cross_entropy(output, target) #交叉熵损失函数;
            epoch_loss += loss.item()
            loss.backward() 
            optimizer.step()
        
        loss = epoch_loss / (batch_idx + 1)
        print ('epoch:%s'%epoch, 'accuracy:%.3f%%'%(correct *100 / total), 'loss = %s'%loss)
        
    ####进行测试集验证####
    print ('————————进行测试集验证————————')
    for epoch in range(1):
        correct = 0
        total = 0
        epoch_loss = 0
        model.train()
        for batch_idx, (data, target) in enumerate(test_loader):        
            #print (data.shape)
           
            data = torch.as_tensor(data, dtype=torch.float32)
            target = target.long()   ##要保证label的数据类型是long
            optimizer.zero_grad()
            data,target = data.cuda(),target.cuda()  #将数据放入GPU
            output, h_state = model(data)
            #labels = output.argmax(dim= 1)
            #acc = accuracy_score(target, labels)
            
            correct += int(torch.sum(torch.argmax(output, dim=1) == target))
            total += len(target)
            
            #梯度清零;反向传播;
            optimizer.zero_grad()
            loss = F.cross_entropy(output, target) #交叉熵损失函数;
            epoch_loss += loss.item()
            loss.backward() 
            optimizer.step()
        
        loss = epoch_loss / (batch_idx + 1)
        print ('epoch:%s'%epoch, 'accuracy:%.3f%%'%(correct *100 / total), 'loss = %s'%loss)
    

    2.CNN

    # -*- coding: utf-8 -*-
    ####数据预处理####
    #分词
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import numpy as np
    import jieba
    from sklearn.model_selection import train_test_split
    
    #f = open('./stop_words.txt', encoding='utf-8')         # 加载停用词
    #stopwords = [i.replace("\n", "") for i in f.readlines()]    # 停用词表
    
    def del_stop_words(text): #分词
    	word_ls = jieba.lcut(text)
    	#word_ls = [i for i in word_ls if i not in stopwords]
    	return word_ls
    
    with open("F:/python_data/practice/tansongbo/neg.txt", "r", encoding='UTF-8') as e:     # 加载负面语料
        neg_data1 = e.readlines()
    
    with open("F:/python_data/practice/tansongbo/pos.txt", "r", encoding='UTF-8') as s:     # 加载正面语料
        pos_data1 = s.readlines()
    
    neg_data = sorted(set(neg_data1), key=neg_data1.index)  #列表去重 保持原来的顺序
    pos_data = sorted(set(pos_data1), key=pos_data1.index)
    
    neg_data = [del_stop_words(data.replace("\n", "")) for data in neg_data]   # 处理负面语料
    pos_data = [del_stop_words(data.replace("\n", "")) for data in pos_data]
    all_sentences = neg_data + pos_data  # 全部语料 用于训练word2vec
    
    ####文本向量化####
    #创建word2vec词向量模型
    from gensim.models.word2vec import Word2Vec
    from gensim.corpora.dictionary import Dictionary
    import pickle
    import logging
    
    #logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)  # 将日志输出到控制台
    
    ####训练过一次后可以不再训练词向量模型####
    
    ####用于训练词向量模型###
    
    model = Word2Vec(all_sentences,     # 上文处理过的全部语料
                     size=100,  # 词向量维度 默认100维
                     min_count=1,  # 词频阈值 词出现的频率 小于这个频率的词 将不予保存
                     window=5  # 窗口大小 表示当前词与预测词在一个句子中的最大距离是多少
                     )
    model.save('f.model')  # 保存模型
    
    #加载模型,提取出词索引和词向量
    def create_dictionaries(model):
    	
        gensim_dict = Dictionary()    # 创建词语词典
        gensim_dict.doc2bow(model.wv.vocab.keys(), allow_update=True)
    
        w2indx = {v: k + 1 for k, v in gensim_dict.items()}  # 词语的索引,从1开始编号
        w2vec = {word: model[word] for word in w2indx.keys()}  # 词语的词向量
        return w2indx, w2vec
    
    model = Word2Vec.load('F:/python_data/practice/tansongbo/f.model')         # 加载模型
    index_dict, word_vectors= create_dictionaries(model)  # 索引字典、词向量字典
    
    #使用pickle进行字典索引与词向量的存储
    output = open('F:/python_data/practice/tansongbo/dict.txt' + ".pkl", 'wb')      
    pickle.dump(index_dict, output)  # 索引字典
    pickle.dump(word_vectors, output)  # 词向量字典
    output.close()
    
    
    ####LSTM训练####
    #参数设置
    vocab_dim = 100 # 向量维度
    maxlen = 28 # 文本保留的最大长度
    n_epoch = 10   # 迭代次数
    batch_size = 64    #每次送入网络的句子数
    
    #加载词向量数据,填充词向量矩阵
    f = open("F:/python_data/practice/tansongbo/dict.txt.pkl", 'rb')  # 预先训练好的
    index_dict = pickle.load(f)    # 索引字典,{单词: 索引数字}
    word_vectors = pickle.load(f)  # 词向量, {单词: 词向量(100维长的数组)}
    
    n_symbols = len(index_dict) + 1  # 索引数字的个数,因为有的词语索引为0,所以+1
    embedding_weights = np.zeros((n_symbols, vocab_dim))  # 创建一个n_symbols * 100的0矩阵
    
    for w, index in index_dict.items():  # 从索引为1的词语开始,用词向量填充矩阵
        embedding_weights[index, :] = word_vectors[w]  # 词向量矩阵,第一行是0向量(没有索引为0的词语,未被填充)
        
    #将文本数据映射成数字(是某个词的编号,不是词向量)    
    def text_to_index_array(p_new_dic, p_sen): 
        
        ##文本或列表转换为索引数字
        
        if type(p_sen) == list:
            new_sentences = []
            for sen in p_sen:
                new_sen = []
                for word in sen:
                    try:
                        new_sen.append(p_new_dic[word])  # 单词转索引数字
                    except:
                        new_sen.append(0)  # 索引字典里没有的词转为数字0
                new_sentences.append(new_sen)
            return np.array(new_sentences)   # 转numpy数组
        else:
            new_sentences = []
            sentences = []
            p_sen = p_sen.split(" ")
            for word in p_sen:
                try:
                    sentences.append(p_new_dic[word])  # 单词转索引数字
                except:
                    sentences.append(0)  # 索引字典里没有的词转为数字0
            new_sentences.append(sentences)
            return new_sentences
    
    #将数据切割成一样的指定长度    
    def text_cut_to_same_long(sents):
        data_num = len(sents)
        new_sents = np.zeros((data_num,maxlen)) #构建一个矩阵来装修剪好的数据
        se = []
        for i in range(len(sents)):
            new_sents[i,:] = sents[i,:maxlen]        
        new_sents = np.array(new_sents)
        return new_sents
        
    #将每个句子的序号矩阵替换成词向量矩阵
    def creat_wordvec_tensor(embedding_weights,X_T):
        X_tt = np.zeros((len(X_T),maxlen,vocab_dim))
        num1 = 0
        num2 = 0
        for j in X_T:
            for i in j:
                X_tt[num1,num2,:] = embedding_weights[int(i),:]
                num2 = num2+1
            num1 = num1+1
            num2 = 0
        return X_tt
        
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print ('正在使用计算的是:%s'%device)    
    data = all_sentences  #获取之前分好词的数据
    # 读取语料类别标签
    label_list = ([0] * len(neg_data) + [1] * len(pos_data))
    
    # 划分训练集和测试集,此时都是list列表
    X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(data, label_list, test_size=0.2)
    #print (X_train_l[0])
    # 转为数字索引形式
    
    # token = Tokenizer(num_words=3000)   #字典数量
    # token.fit_on_texts(train_text)
    
    X_train = text_to_index_array(index_dict, X_train_l)
    X_test = text_to_index_array(index_dict, X_test_l)
    #print("训练集shape: ", X_train[0])
    
    
    y_train = np.array(y_train_l)  # 转numpy数组
    y_test = np.array(y_test_l)
    
    ##将数据切割成一样的指定长度
    from torch.nn.utils.rnn import pad_sequence
    #将数据补长变成和最长的一样长
    X_train = pad_sequence([torch.from_numpy(np.array(x)) for x in X_train],batch_first=True).float() 
    X_test = pad_sequence([torch.from_numpy(np.array(x)) for x in X_test],batch_first=True).float()
    #将数据切割成需要的样子
    X_train = text_cut_to_same_long(X_train)
    X_test = text_cut_to_same_long(X_test)
    
    #将词向量字典序号转换为词向量矩阵
    X_train = creat_wordvec_tensor(embedding_weights,X_train)
    X_test = creat_wordvec_tensor(embedding_weights,X_test)
    
    #print("训练集shape: ", X_train.shape)
    #print("测试集shape: ", X_test.shape)
    
    ####Datloader和创建batch#### 
    from torch.utils.data import TensorDataset, DataLoader
     
    # 创建Tensor datasets
    train_data = TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
    test_data = TensorDataset(torch.from_numpy(X_test), torch.from_numpy(y_test))
     
    # shuffle是打乱数据顺序
    train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
    test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)
    
    
        
    class CNN(nn.Module):
        def __init__(self, embedding_dim, n_filters, filter_sizes, dropout):
            super(CNN, self).__init__()
    
            self.convs = nn.ModuleList([
                nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(fs, embedding_dim))
                for fs in filter_sizes])   #.ModuleList将模块放入一个列表
    
            self.fc = nn.Linear(n_filters * len(filter_sizes), 2)
    
            self.dropout = nn.Dropout(dropout)  #防止过拟合
    
        def forward(self, text):
    
            # text = [batch_size, sent_len, emb_dim]
    
            embedded = text.unsqueeze(1)
    
            # embedded = [batch_size, 1, sent_len, emb_dim]
    
            convd = [conv(embedded).squeeze(3) for conv in self.convs]
    
            # conv_n = [batch_size, n_filters, sent_len - fs + 1]
    
            pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in convd]
    
            # pooled_n = [batch_size, n_filters]
    
            cat = self.dropout(torch.cat(pooled, dim=1))  #torch.cat使张量进行拼接
    
            # cat = [batch_size, n_filters * len(filter_sizes)]
    
            return self.fc(cat)
    
    n_filters = 100
    filter_sizes = [2, 3, 4]
    dropout = 0.5
    
    model = CNN(vocab_dim, n_filters, filter_sizes, dropout)
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters())
    
    ####训练train data####
    from sklearn.metrics import accuracy_score, classification_report
    print ('————————进行训练集训练————————')
    for epoch in range(n_epoch):
        correct = 0
        total = 0
        epoch_loss = 0
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):        
            #print (data.shape)
           
            data = torch.as_tensor(data, dtype=torch.float32)
            target = target.long()   ##要保证label的数据类型是long
            optimizer.zero_grad()
            data,target = data.cuda(),target.cuda()  #将数据放入GPU
            output = model(data)
            #labels = output.argmax(dim= 1)
            #acc = accuracy_score(target, labels)
            
            correct += int(torch.sum(torch.argmax(output, dim=1) == target))
            total += len(target)
            
            #梯度清零;反向传播;
            optimizer.zero_grad()
            loss = F.cross_entropy(output, target) #交叉熵损失函数;
            epoch_loss += loss.item()
            loss.backward() 
            optimizer.step()
        
        loss = epoch_loss / (batch_idx + 1)
        print ('epoch:%s'%epoch, 'accuracy:%.3f%%'%(correct *100 / total), 'loss = %s'%loss)
        
    ####进行测试集验证####
    print ('————————进行测试集验证————————')
    for epoch in range(1):
        correct = 0
        total = 0
        epoch_loss = 0
        model.train()
        for batch_idx, (data, target) in enumerate(test_loader):        
            #print (data.shape)
           
            data = torch.as_tensor(data, dtype=torch.float32)
            target = target.long()   ##要保证label的数据类型是long
            optimizer.zero_grad()
            data,target = data.cuda(),target.cuda()  #将数据放入GPU
            output = model(data)
            #labels = output.argmax(dim= 1)
            #acc = accuracy_score(target, labels)
            
            correct += int(torch.sum(torch.argmax(output, dim=1) == target))
            total += len(target)
            
            #梯度清零;反向传播;
            optimizer.zero_grad()
            loss = F.cross_entropy(output, target) #交叉熵损失函数;
            epoch_loss += loss.item()
            loss.backward() 
            optimizer.step()
        
        loss = epoch_loss / (batch_idx + 1)
        print ('epoch:%s'%epoch, 'accuracy:%.3f%%'%(correct *100 / total), 'loss = %s'%loss)
    
    展开全文
  • 微调预训练的CNN模型(AlexNet,VGG,ResNet),然后微调LSTM。 该网络应用于手势控制无人机。 训练: 下载直升机编组数据集: : usp 将数据集放在/data文件夹下 运行培训代码并指定数据文件夹的路径 python basic...
  • 一个模型+主程序,然后里面还有CWRU轴承的数据,直接可以运行。 想修改模型可以在model.py里修改,这样就可以拿来自己搞点东西。
  • LSTM 手动实现车牌识别 Pytorch代码

    千次阅读 2020-11-24 15:25:47
    循环神经网络RNN长短期记忆网络LSTM的原理,许多文章都讲的很清晰,我就不到处抄了…… 听说实现车牌识别还挺简单的,来尝试一下叭~ 首先找车牌图片,虽然有一些生成车牌的软件,但是一般不能批量生成,而且我们...
  • Pytorch实现的LSTM模型结构

    万次阅读 多人点赞 2021-03-27 19:25:40
    LSTM模型结构1、LSTM模型结构2、LSTM网络3、LSTM的输入结构4、Pytorch中的LSTM4.1、pytorch中定义的LSTM模型4.2、喂给LSTM的数据格式4.3、LSTM的output格式5、LSTM和其他网络组合 1、LSTM模型结构 BP网络和CNN网络...
  • 这是最近两个月来的一个小总结,实现的demo已经上传github,里面包含了CNNLSTM、BiLSTM、GRU以及CNNLSTM、BiLSTM结合还有多层多通道CNNLSTM、BiLSTM等多个神经网络模型的的实现。这篇文章总结一下最近一段...
  • 情感分析分类 先决条件 安装依赖项 pip install -r requirements.txt 安装Spacy英语数据 python -m spacy download en 框架 火炬 数据集 Cornell MR(电影评论)数据集 ... LSTMLSTM LSTM +注意 有线电视新闻网
  • 本项目是以pytorch为框架进行mnist图像分类任务: CNN: #coding = utf-8 import torch import torch.nn as nn from torch.autograd import Variable import torch.utils.data as Data import torchvision import ...
  • Pytorch_lstm详细讲解

    千次阅读 2021-12-13 20:33:31
    pytorchlstm参数与案例理解。_wangwangstone的博客-CSDN博客_torch.lstm RNN_了不起的赵队-CSDN博客_rnn 这里主要要领清楚堆叠lstm层,使用的hidden state从lstm1着一层传到lstm2着一层,而不是一行中的几个lstm...
  • 今天小编就为大家分享一篇Pytorch实现LSTM和GRU示例,具有很好的参考价值,希望对大家有所帮助。一起跟随小编过来看看吧
  • PyTorch搭建LSTM实现多变量时间序列预测(负荷预测)
  • PyTorch搭建LSTM实现时间序列预测(负荷预测)

    万次阅读 多人点赞 2022-01-18 22:03:24
    PyTorch搭建LSTM实现时间序列预测(负荷预测)
  • TrafficFlowPrediction(高速公路-交通流预测)(数据集+LSTM+SAE+CNN) (数据集+LSTM长短时神经网络+SAE堆栈自编码+CNN卷积神经网络) 基于CNN和LSTM的高速公路交通流预测模型 python代码
  • RNN和LSTM网络 标准RNN模型 LSTM 模型 通过每步分析LSTM网络 导入库 import torch from torch import nn import torchvision.datasets import torchvision.transforms as transforms import ...
  • PyTorch-CNN-股票预测

    2021-02-09 16:05:36
    PyTorch-CNN-股票预测 在这个项目中,我采用了一种完全不同的方法来解决库存预测问题。 由于RNN的顺序性质,它们通常用于股票预测。 但是,我实现了PyTorch CNN管道进行库存预测。 我还在努力。
  • 首先说明作者是神经网络纯新手,虽然之前用过神经网络的代码,但基本上都是各种copy,只搞清楚了input_sizeoutput_size,这两天因为工作需要要跑一个lstm的回归预测,在网上找的教程都不太清楚,对新手不是很友好...
  • 本文建立在读者已有一定深度学习知识的基础上,可以看看pytorch实现验证码图片识别,基本的反向传播思想,以及机器学习的相关概念(损失函数、随机梯度下降、训练集测试集等等)。 LSTM看这一篇就够了。 接下来给出...
  • BiLSTMPyTorch应用

    千次阅读 2020-07-02 09:11:58
    本文介绍一下如何使用BiLSTM(基于PyTorch)解决一个实际问题,实现给定一个长句子预测下一个单词 如果不了解LSTM的同学请先看我的这两篇文章LSTMPyTorch中的LSTM。下面直接开始代码讲解 导库 ''' code by Tae ...
  • pytorch实现BiLSTM

    2021-11-06 20:40:32
    import torch.nn as nn class BidirectionalLSTM(nn.Module): def __init__(self, input_size, hidden_size, output_size): ... self.rnn = nn.LSTM(input_size, hidden_size, bidirectional=True, batch_first
  • Pytorch批量标准化LSTM Cooijmans等人的“的实现。 要求 火炬0.4 python 3.x
  • Pytorch实现LSTM进行文本分类实例

    千次阅读 2019-08-23 16:56:41
    import torch import torch.nn as nn class rnn(nn.Module): super(rnn).__init__() def __init__(self,input_dim,output_dim,num_layer,num_... self.layer1 = nn.LSTM(input_dim,output_dim,num_layer) ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 4,532
精华内容 1,812
关键字:

lstm和cnn结合代码pytorch