精华内容
下载资源
问答
  • NLP之旅(包含NLP文章/代码集锦)
  • Python 代码: input_str = ”The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil.” input_str = input_str.lower() print(input_str) 输出:...

    在获得文本之后

     

    将所有字母转换为小写或大写

     

    目录

    在获得文本之后

    将所有字母转换为小写或大写

    将数字转换为单词或删除数字

    删除标点、重音符号和其他音调符号

    删除空格

    扩展缩写词

    删除停止词、稀疏词和特定词

    文本规范化---单词减为词干、词根或词干的过程

    词性标记

    分块是一种自然的语言过程,用于识别组成部分把句子(名词、动词、形容词等)联系起来具有离散语法意义的顺序单位(名词组或短语、动词组等)。

    命名实体识别

    词搭配提取

     


    Python 代码:

    input_str = ”The 5 biggest countries by population in
    2017 are China, India, United States, Indonesia, and
    Brazil.”
    input_str = input_str.lower()
    print(input_str)

    输出:

     

    the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.

    将数字转换为单词或删除数字

    2. 删除数字

     

    删除与分析无关的数字。通常,正则表达式用于删除数字。

     

    Python 代码:

     

    import re
    input_str = ’Box A contains 3 red and 5 white balls,
    while Box B contains 4 red and 2 blue balls.’
    result = re.sub(r’\d+’, ‘’, input_str)
    print(result)

     

    输出:

     

    Box A contains red and white balls, while Box B contains red and blue balls.

     

    删除标点、重音符号和其他音调符号

    3. 删除标点符号

     

    以下代码删除了这组符号:[!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]:

     

    Python 代码:

     

    import string
    input_str = “This &is [an] example? {of} string.
    with.? punctuation!!!!” # Sample string
    result = input_str.translate(string.maketrans(“”,””),
    string.punctuation)
    print(result)

     

    输出:

     

    This is an example of string with punctuation

    删除空格

    要删除前导空格和结束空格,可以使用 strip() 函数。

     

    Python 代码:

     

    input_str = “ \t a string example\t “
    input_str = input_str.strip()
    input_str

     

    输出:

     

    ‘a string example’

    扩展缩写词

     

    删除停止词、稀疏词和特定词

    停止词”是语言中最常见的词,如“the”、“a”,“开”、“是”、“全部”。这些词没有重要意义,而且通常从文本中删除。可以使用自然语言工具包(NLTK),一套用于符号和统计自然语言处理。

     

    Python 代码:

     

    input_str = “NLTK is a leading platform for building
    Python programs to work with human language data.”
    stop_words = set(stopwords.words(‘english’))
    from nltk.tokenize import word_tokenize
    tokens = word_tokenize(input_str)
    result = [i for i in tokens if not i in stop_words]
    print (result)

     

    输出:

     

    [‘NLTK’, ‘leading’, ‘platform’, ‘building’, ‘Python’, ‘programs’, ‘work’, ‘human’, ‘language’, ‘data’, ‘.’]

    文本规范化---单词减为词干、词根或词干的过程

    Stemming 是将单词减为词干、词根或词干的过程。根表单(例如,books-book、looked-look)。主要两个算法是 Porter stemming algorithm(从单词中删除常见的词尾)和 Lancaster stemming algorithm(一种更具攻击性的词干算法)。

     

    Python 代码:

     

    from nltk.stem import PorterStemmer
    from nltk.tokenize import word_tokenize
    stemmer= PorterStemmer()
    input_str=”There are several types of stemming
    algorithms.”
    input_str=word_tokenize(input_str)
    for word in input_str:
    print(stemmer.stem(word))

     

    输出:

     

    There are sever type of stem algorithm.

     

    7. Lemmatization

     

    Lemmatization 与 Stemming 功能相似,但作用相反。它不是简单地切掉词干。相反,它使用词汇知识获取正确单词基本形式的基础。

     

    Python 代码:

     

    from nltk.stem import WordNetLemmatizer
    from nltk.tokenize import word_tokenize
    lemmatizer=WordNetLemmatizer()
    input_str=”been had done languages cities mice”
    input_str=word_tokenize(input_str)
    for word in input_str:
    print(lemmatizer.lemmatize(word))

     

    输出:

     

    be have do language city mouse

    词性标记

    一部分旨在将词性部分分配给基于它的给定文本(如名词、动词、形容词和其他)定义及其上下文。

     

    Python 代码:

     

    input_str=”Parts of speech examples: an article, to
    write, interesting, easily, and, of”
    from textblob import TextBlob
    result = TextBlob(input_str)
    print(result.tags)

     

    输出:

     

    [(‘Parts’, u’NNS’), (‘of’, u’IN’), (‘speech’, u’NN’), (‘examples’, u’NNS’), (‘an’, u’DT’), (‘article’, u’NN’), (‘to’, u’TO’), (‘write’, u’VB’), (‘interesting’, u’VBG’), (‘easily’, u’RB’), (‘and’, u’CC’), (‘of’, u’IN’)]

     

    分块----名词组或短语、动词组等

    是一种自然的语言过程,用于识别组成部分把句子(名词、动词、形容词等)联系起来具有离散语法意义的顺序单位(名词组或短语、动词组等)。

     

    Python 代码:

     

    input_str=”A black television and a white stove were
    bought for the new apartment of John.”
    from textblob import TextBlob
    result = TextBlob(input_str)
    print(result.tags)

     

    输出:

     

    [(‘A’, u’DT’), (‘black’, u’JJ’), (‘television’, u’NN’), (‘and’, u’CC’), (‘a’, u’DT’), (‘white’, u’JJ’), (‘stove’, u’NN’), (‘were’, u’VBD’), (‘bought’, u’VBN’), (‘for’, u’IN’), (‘the’, u’DT’), (‘new’, u’JJ’), (‘apartment’, u’NN’), (‘of’, u’IN’), (‘John’, u’NNP’)]

     

    Python 代码:

     

    reg_exp = “NP: {<DT>?<JJ>*<NN>}”
    rp = nltk.RegexpParser(reg_exp)
    result = rp.parse(result.tags)
    print(result)

     

    输出:

     

    (S (NP A/DT black/JJ television/NN) and/CC (NP a/DT white/JJ stove/NN) were/VBD bought/VBN for/IN (NP the/DT new/JJ apartment/NN) of/IN John/NNP)

     

    命名实体识别

    命名实体识别(NER)旨在在文本中查找命名实体并将其分为预先定义的类别(人员姓名,地点、组织、时间等)。

     

    Python 代码:

     

    from nltk import word_tokenize, pos_tag, ne_chunk
    input_str = “Bill works for Apple so he went to Boston
    for a conference.”
    print ne_chunk(pos_tag(word_tokenize(input_str)))

     

    输出:

     

    (S (PERSON Bill/NNP) works/VBZ for/IN Apple/NNP so/IN he/PRP went/VBD to/TO (GPE Boston/NNP) for/IN a/DT conference/NN ./.)

     

    词搭配提取

     

    搭配是经常出现在一起的单词组合。例如 “break the rules,” “free time,” “draw a conclusion,” “keep in mind,” “get ready,” 等。

     

    Python 代码:

     

    input=[“he and Chazz duel with all keys on the line.”]
    from ICE import CollocationExtractor
    extractor =
    CollocationExtractor.with_collocation_pipeline(“T1” ,
    bing_key = “Temp”,pos_check = False)
    print(extractor.get_collocations_of_length(input,
    length = 3))

     

    输出:

     

    [“on the line”]

     

     

     

     

     

     

     

     

     

     

     

    展开全文
  • Python 2.7上测试 pip install -r requirements.txt 如何使用 运行: python test/test_mt_text_score.py 目前仅支持MT指标 笔记本电脑 公制 应用 笔记本 布鲁 机器翻译 GLEU(Google-BLEU) 机器翻译 WER(字...
  • 主要是用lightmbg库: # coding: utf-8 # In[1]: # -*- coding: utf-8 -*- import sys stdi,stdo,stde=sys.stdin,sys.stdout,sys.stderr ...sys.stdin,sys.stdout,sys.stderr=stdi,stdo,stde ...

    主要是用lightmbg库:

    
    # coding: utf-8
    
    # In[1]:
    
    
    # -*- coding: utf-8 -*-
    import sys
    stdi,stdo,stde=sys.stdin,sys.stdout,sys.stderr 
    reload(sys)
    sys.stdin,sys.stdout,sys.stderr=stdi,stdo,stde 
    # 合并两个csv文件到
    
    
    def combine(combine_file, filename1, filename2):
        f = open("true_value.txt",'w')
        len_merge_sum = 0
        with open(combine_file, 'w') as fout:
            with open(filename1, 'r') as f1:
                for eachLine in f1:
                    lineno, sen1, sen2, label = eachLine.strip().split('\t')
                    fout.write(lineno + '\t' + sen1 + '\t' + sen2 + '\t' + label + '\n')
                    if int(label) == 1:
                        f.write(sen1 + '\t' + sen2 + '\t' + label + '\n')
                    len_merge_sum += 1
            with open(filename2, 'r') as f1:
                for eachLine in f1:
                    lineno, sen1, sen2, label = eachLine.strip().split('\t')
                    fout.write(lineno + '\t' + sen1 + '\t' + sen2 + '\t' + label + '\n')
                    if int(label) == 1:
                        f.write(sen1 + '\t' + sen2 + '\t' + label + '\n')
                    len_merge_sum += 1
        fout.close()
        f.close()
        return combine_file, len_merge_sum
    
    
    # In[2]:
    
    
    # -*- coding: utf-8 -*-
    from gensim.models import word2vec
    import pandas as pd
    import numpy as np
    import sys
    import time
    import re
    import jieba
    import io
    stdi,stdo,stde=sys.stdin,sys.stdout,sys.stderr 
    reload(sys)
    sys.stdin,sys.stdout,sys.stderr=stdi,stdo,stde 
    sys.setdefaultencoding('utf-8')
    
    
    def process_simi_stop(simiwords, stopwords, line):
        for word, subword in simiwords.iteritems():
            if word in line:
                # print line
                #line = re.sub(word, subword, line)
                line = line.replace(word,subword)
                # print subword
        words1 = [w for w in jieba.cut(line) if w.strip()]
        word1 = []
        for i in words1:
            if i not in stopwords:
                word1.append(i)
        return word1,line
    
    
    def splitSentence(inputFile, inpath, segment, submit):
        print u'分词开始!', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))  # 输出当前时间
        start = time.clock()
        jieba.load_userdict("jieba_dict.txt")
        corpus = []
    
        simiwords = {}
        with io.open("simiwords.txt", encoding='utf-8') as fr:
            for line in fr:
                words = re.split(",", line.strip())
                simiwords[words[0]] = words[1]
    
        stopwords = []  # 停用词
        fstop = open('chinese_stopwords.txt', 'r')
        for eachWord in fstop:
            stopwords.append(eachWord.strip())
            # print eachWord.strip()
    
        fin = open(inputFile, 'r')  # 以读的方式打开文件 inputFile
        fin1 = open('sentences_1.txt','w')
        for eachLine in fin:
            # line = eachLine.strip()  # 去除每行首尾可能出现的空格
            # line = re.sub("[0-9\s+\.\!\/_,$%^*()?;;:-【】+\"\']+|[+——!,;:。?、~@#¥%……&*()]+", "", eachLine)
            eachLine = re.sub("\*", " ", eachLine)
            # jieba.del_word('年')
            lineno, sen1, sen2, label = eachLine.strip().split('\t')
            word1,sen_1 = process_simi_stop(simiwords, stopwords, sen1)
            word2,sen_2 = process_simi_stop(simiwords, stopwords, sen2)
            fin1.write(sen_1)
            fin1.write("\n")
            fin1.write(sen_2)
            fin1.write("\n")
    #         sen_11 = ' '.join(sen_1.decode('utf8'))
    #         sen_12 = sen_11.split(" ")
    #         for s in sen_12:
    #             if s != sen_12[-1]:
    #                 fin1.write(s+" ")
    #             else:
    #                 fin1.write(s)
    #         fin1.write('\n')
    #         sen_21 = ' '.join(sen_2.decode('utf8'))
    #         sen_22 = sen_21.split(" ") 
    #         for s in sen_22:
    #             if s != sen_22[-1]:
    #                 fin1.write(s+" ")
    #             else:
    #                 fin1.write(s)
    #         fin1.write('\n')
            corpus.append(word1)
            corpus.append(word2)
        print len(corpus)
        with open(inpath, 'r') as fin2:  # inpath
            for eachLine in fin2:
                eachLine = re.sub("\*", " ", eachLine)
                if submit:
                    lineno, sen1, sen2 = eachLine.strip().split('\t')
                    #print "ceshijieshisha:", sen1, sen2
                else:
                    lineno, sen1, sen2, label = eachLine.strip().split('\t')   # 无label
                    #print "ceshijieshisha:", sen1, sen2
                word1,sen_1 = process_simi_stop(simiwords, stopwords, sen1)
                word2,sen_2 = process_simi_stop(simiwords, stopwords, sen2)
                fin1.write(sen_1)
                fin1.write("\n")
                fin1.write(sen_2)
                fin1.write("\n")
    #             sen_11 = ' '.join(sen_1.decode('utf8'))
    #             sen_12 = sen_11.split(" ")
    #             for s in sen_12:
    #                 if s != sen_12[-1]:
    #                     fin1.write(s+" ")
    #                 else:
    #                     fin1.write(s)
    #             fin1.write('\n')
    #             sen_21 = ' '.join(sen_2.decode('utf8'))
    #             sen_22 = sen_21.split(" ") 
    #             for s in sen_22:
    #                 if s != sen_22[-1]:
    #                     fin1.write(s+" ")
    #                 else:
    #                     fin1.write(s)
    #             fin1.write('\n')
                corpus.append(word1)
                corpus.append(word2)
        print len(corpus)  # 204954
        fin1.close()
        with open(segment, 'w') as fs:
            for word in corpus:
                # print type(word)
                for w in word:
                    # print w
                    fs.write(w)  # 将分词好的结果写入到输出文件
                    fs.writelines(' ')
                fs.write('\n')
        end = time.clock()
        print u'分词实际用时:', end - start
        return corpus
    
    
    def filter_word_in_model(model, filename):
        a = []
        with open(filename, 'r') as file_to_read:
            for line in file_to_read:
                if True:
                    if not line:
                        break
                    a.append(line)
        sentences = []  # 读sentences 里面的词
        for i in range(len(a)):
            b = a[i].strip().split()
            sentences.append(b)
        print 'sentences length:', len(sentences)
        new_sentences = []  # 完成获取模型训练,剩余含有词向量序列
        for i in range(len(sentences)):
            new_sentence = []
            for j in range(len(sentences[i])):
                if sentences[i][j].decode('utf8') in model:
                    new_sentence.append(sentences[i][j])
            new_sentences.append(new_sentence)
        print 'new_sentences length: ', len(new_sentences)
        # print(np.array(new_sentences).shape)
        print u'new_sentences,用时', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))  # 输出当前时间
        with open('new_sentences.txt', 'w') as fs:  # 写入new_sentences
            for word in new_sentences:
                for w in word:
                    fs.write(w)  # 将分词好的结果写入到输出文件
                    fs.writelines(' ')
                fs.write('\n')
        return new_sentences
    
    
    def eval_file(label1, pre):
        tp, tn, fp, fn = 0.0000001, 0.0000001, 0.0000001, 0.00000001
        for la, pr in zip(label1, pre):
            if la == 1 and pr == 1:
                tp += 1
            elif la == 1 and pr == 0:
                fn += 1
            elif la == 0 and pr == 0:
                tn += 1
            elif la == 0 and pr == 1:
                fp += 1
        recall = float(tp)/float(tp+fn)
        precision = float(tp)/float(tp+fp)
        f11 = 2*recall*precision/(recall+precision)
        return f11
    
    
    def cos_Vector(x, y):  # 用cos求夹角
        if len(x) != len(y):
            print u'error input,x and y is not in the same space'
            return
        x = np.array(x)
        y = np.array(y)
        num = (x * y.T)
        num = float(num.sum())
        if num == 0:
            return 0
        denom = np.linalg.norm(x) * np.linalg.norm(y)
        if denom == 0:
            return 0
        cos = num / denom  # 余弦值
        sim = 0.5 + 0.5 * cos  # 归一化
        return sim
    
    
    def vec_minus(x, y):  # 相减
        if len(x) != len(y):
            print u'error input,x and y is not in the same space'
            return
        x = np.array(x)
        y = np.array(y)
        sim = abs(x-y)
        return sim
    
    
    def vec_multi(x, y):  # 相乘
        if len(x) != len(y):
            print u'error input,x and y is not in the same space'
            return
        x = np.array(x)
        y = np.array(y)
        sim1 = x * y
        return sim1
    
    
    def calEuclideanDistance(x, y):
        if len(x) != len(y):
            print u'error input,x and y is not in the same space'
            return
        dist = np.sqrt(np.sum(np.square(x - y)))
        return dist
    
    #相同数组中相同元素的数量统计
    def cal_jaccard(list1, list2):
        set1 = set(list1)
        set2 = set(list2)
        avg_len = (len(set1) + len(set2)) / 2
        min_len = min(len(set1), len(set2))
        # return len(set1 & set2) * 1.0 / (len(set1) + len(set2) - len(set1 & set2))
        if min_len == 0:
            return 0
        else:
            return len(set1 & set2) * 1.0 / min_len
    
    
    def zishu(X,useStatus):  
        if useStatus:  
            return 1.0 / (1 + np.exp(-(X)));  
        else:  
            return (X);  
    
    
    # In[3]:
    
    
    def fenge(input_file,out_file,out_file1):
        f = open(input_file,'r')
        f_1 = open(out_file,'w')
        f_2 = open(out_file1,'w')
        lines = f.readlines()
        Row = len(lines)
        D = int(Row*0.85)
        for i in range(Row):
            if i < D:
                lineno, sen1, sen2, label = lines[i].strip().split('\t')
                f_1.write(lineno + '\t' + sen1 + '\t' + sen2 + '\t' + label + '\n')
            else:
                lineno, sen1, sen2, label = lines[i].strip().split('\t')
                f_2.write(lineno + '\t' + sen1 + '\t' + sen2 + '\t' + label + '\n')
        f.close()
        f_1.close()
        f_2.close()
        return D
    def feature_extraction(new_sentences,model,size):
        vec_titles = []  # 获取句子的向量
        for val in range(len(new_sentences)):
            vec = np.zeros(shape=(1, size))
            for i in range(len(new_sentences[val])):
                vec += model[new_sentences[val][i].decode('utf8')]
            if len(new_sentences[val]):
                vec = vec/len(new_sentences[val])
            vec_titles.append(vec)
        return vec_titles
                
    
    
    # In[4]:
    
    
    '''
    import threading
    import time
    gl_num = 0
    vec_titles = []
    lock = threading.RLock()
    def feature_extraction(new_sentences,model,size,val_start):
        lock.acquire()
        global vec_titles
        for val in range(val_start,len(new_sentences),4):
            vec = np.zeros(shape=(1, size))
            for i in range(len(new_sentences[val])):
                vec += model[new_sentences[val][i].decode('utf8')]
            if len(new_sentences[val]):
                vec = vec/len(new_sentences[val])
            vec_titles.append(vec)
        time.sleep(1)
        print len(vec_titles)
        lock.release()
    thread_list = [] 
    for i in range(4):
        t = threading.Thread(target=feature_extraction,args = (new_sentences,model,size,i))
        thread_list.append(t)
    for t in thread_list:
        t.start()
    '''
    
    
    # In[5]:
    
    
    # import jieba.analyse
    # def get_keyword(model, num_keywords, new_sentences):
    #     # 获取关键词
    #     content = open('new_sentences.txt', 'rb').read()
    #     jieba.analyse.set_stop_words('chinese_stopwords.txt')
    #     keywords = jieba.analyse.extract_tags(content, topK=num_keywords, withWeight=False, allowPOS=())
    #     print u'keywords长度:', len(keywords)
    
    #     # 获取在模型中的关键词
    #     keywords_in_model = []
    #     for i in range(len(keywords)):
    #         if keywords[i].decode('utf8') in model:
    #             keywords_in_model.append(keywords[i])
    #     print time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
    
    #     # 计算每一个keywords中的一个词,与句子中所有词语最大的值
    #     keywords_indexes = []  # 20w * 2k
    #     for i in range(len(new_sentences)):
    #         keywords_million_value = []
    #         for val in range(num_keywords):  # key_words 2k
    #             similar_values = []
    #             for j in range(len(new_sentences[i])):  # 每一个title里的词
    #                 try:
    #                     value = model.similarity(new_sentences[i][j].decode('utf-8'), keywords_in_model[val].decode('utf-8'))
    #                     similar_values.append(max(value, 0))
    #                 except:
    #                     print new_sentences[i][j]
    #             try:
    #                 keywords_one_value = max(similar_values)  # 得到第一个句子与第一词的相似度最大值
    #             except:
    #                 keywords_one_value = 0
    #                 print i
    #             keywords_million_value.append(keywords_one_value)  # 1w个
    #         keywords_indexes.append(keywords_million_value)
    #     print np.array(keywords_indexes).shape
    #     # 每个标题 1w维向量!
    #     np.save("train_data_similar_vec.npy", keywords_indexes)
    #     print u'词袋生成完毕:', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
    #     return keywords_indexes
    # keywords_indexes = get_keyword(model,80,new_sentences)
    
    
    # In[6]:
    
    
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer  
    def get_tiidf_vec(filename):
        # 把new_sentences 写成 corpus要求的类型
        corpus = [' '.join(a) for a in filename]
        stopword = [u' ']
        #vectorizer = CountVectorizer(min_df=0,stop_words=stopword,token_pattern='(?u)\\b\\w+\\b')  # 词频矩阵,矩阵元素a[i][j] 表示j词在i类文本下的词频
        #vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
        vectorizer = CountVectorizer(min_df=0,token_pattern=r"(?u)\W{1}|(?u)\b\w+\b")
        result = vectorizer.fit_transform(corpus)  # 文本转为词频矩阵
        transformer = TfidfTransformer()  # 统计每个词语的tf-idf权值
        tfidf = transformer.fit_transform(result)  # fit_transform是计算tf-idf
        vecs = []  # 每一个值的tfidf值
    #     hz = result.toarray()
        weight=tfidf.toarray()
        word=vectorizer.get_feature_names()#获取词袋模型中的所有词语
        print len(word)
    #     print hz.shape
        print weight.shape 
       # return word,weight
    #     for i in range(len(weight)):#打印每类文本的tf-idf词语权重,第一个for遍历所有文本,第二个for便利某一类文本下的词语权重  
    #         print u"-------这里输出第",i,u"类文本的词语tf-idf权重------"  
    #         for j in range(len(word)):  
    #             print word[j],weight[i][j] 
        tfidf_cos = []
        hz_cos = []
        jk_cos = []
        for i in range(weight.shape[0]/2):
    #         numerator = np.sum(np.min(hz[2*i:2*i+2,:], axis=0))
    #         denominator = np.sum(np.max(hz[2*i:2*i+2,:], axis=0))
            value = np.dot(weight[2*i], weight[2*i+1])
    #         value_hz = np.dot(hz[2*i], hz[2*i+1]) / (norm(hz[2*i]) * norm(hz[2*i+1]))
            value = 0.5 + 0.5 * value
    #         value_hz = 0.5 + 0.5 * value_hz
            tfidf_cos.append([value])
    #         hz_cos.append([value_hz])
    #         jk_cos.append([1.0 * numerator / denominator])
        return tfidf_cos,hz_cos,jk_cos
    # word_list,tfidf = get_tiidf_vec(new_sentences)
    
    
    # In[5]:
    
    
    # # -*- coding: utf-8 -*-
    # import numpy as np
    # import sys
    # import time
    # from gensim.models import word2vec
    # import lightgbm as lgb
    # from sklearn.model_selection import train_test_split   # 随机分割
    # from scipy.linalg import norm
    # # import bm25
    # stdi,stdo,stde=sys.stdin,sys.stdout,sys.stderr 
    # reload(sys)
    # sys.stdin,sys.stdout,sys.stderr=stdi,stdo,stde 
    # sys.setdefaultencoding('utf-8')
    # # 合并两个csv文件到
    # filename1 = 'atec_nlp_sim_train.csv'
    # filename2 = 'atec_nlp_sim_train_add.csv'
    # combine_file, len_merge_sum = combine('merge_sum.csv', filename1, filename2)
    # if __name__ == '__main__':
    #     SUBMIT = False
    #     if SUBMIT:
    #         inpath, outpath = sys.argv[1], sys.argv[2]
    #         testpath = combine_file
    #         test_num = len_merge_sum
    #     else:
    #         num = fenge("merge_sum.csv","merge_train.csv","merge_test.csv")
    #         inpath, outpath = 'merge_test.csv', 'output.csv'
    #         testpath = 'merge_train.csv'
    #         #test_num = 92228
    #         test_num = 87105
    #         # inpath, outpath = 'empty.csv', 'output.csv'
    #         #         # testpath = 'merge_sum.csv'
    #         #         # test_num = 92228
    #     filename = 'sentences.txt'
    #     corpus = splitSentence(testpath, inpath, filename, SUBMIT)  # jieba 分词
    
    #     print u'语料corpus生成完毕:', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
    
    # # # 训练词向量模型
    
    #     sentences = word2vec.Text8Corpus('sentences.txt')
    #     model = word2vec.Word2Vec(sentences, sg=1, size=100, window=5, min_count=5, negative=3, sample=0.001, hs=1, workers=4)
    #     model.save('result')  # save
    #     size = 100  # model_train size
    #     print u'词向量训练完毕', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))  # 输出当前时间
    #     # 导入model
    #     model = word2vec.Word2Vec.load('result')
    #     new_sentences = filter_word_in_model(model, filename)
        
        
    #     print "开始计算特征向量"
    #     tfidf_cos,hz_cos,jk_cos = get_tiidf_vec(new_sentences)
    # #     feature_2 = np.hstack((tfidf_cos,hz_cos,jk_cos))
    #     vec_titles = []  # 获取句子的向量
    #     max_titles = []
    
    #     for val in range(len(new_sentences)):
    #         vec = np.zeros(shape=(1, size))
    #         mat = np.zeros(shape=(30, size))
    #         for i in range(len(new_sentences[val])):
    #             print len(new_sentences[val])
    #             if i < 30:
    #                 vec += model[new_sentences[val][i].decode('utf8')]
    #                 mat[i] = model[new_sentences[val][i].decode('utf8')]
    #         if len(new_sentences[val]):
    #             vec = vec/len(new_sentences[val])
    #         vec_titles.append(vec)
    #         max_titles.append(mat)
    #     print "计算特征向量完毕"
    
    #     #vec_titles = feature_extraction(new_sentences,model,size)
    #     vec_titles = list(map(lambda x: x[0], vec_titles))  # 去掉外部的[], 获得title 的向量形式
    #     print(np.array(max_titles).shape)
    #     np.save("train_data_title_vec.npy", vec_titles)
    #     np.save("train_data_title_max.npy", max_titles)
    #     print u'生成train_data_title_vec完毕', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))  # 输出时间
    #     trains = np.load('train_data_title_max.npy')
    
    
    # In[7]:
    
    
    # # -*- coding: utf-8 -*-
    # import numpy as np
    # import sys
    # import time
    # from gensim.models import word2vec
    # import lightgbm as lgb
    # from sklearn.model_selection import train_test_split   # 随机分割
    # from scipy.linalg import norm
    
    # # from keras.datasets import mnist  
    # # from keras.models import Sequential  
    # # from keras.layers import Dense, Dropout, Activation, Flatten  
    # # from keras.layers import Convolution2D, MaxPooling2D  
    # # from keras.utils import np_utils  
    # # from keras import backend as K  
    # # import bm25
    # stdi,stdo,stde=sys.stdin,sys.stdout,sys.stderr 
    # reload(sys)
    # sys.stdin,sys.stdout,sys.stderr=stdi,stdo,stde 
    # sys.setdefaultencoding('utf-8')
    # # 合并两个csv文件到
    # filename1 = 'atec_nlp_sim_train.csv'
    # filename2 = 'atec_nlp_sim_train_add.csv'
    # combine_file, len_merge_sum = combine('merge_sum.csv', filename1, filename2)
    # if __name__ == '__main__':
    #     SUBMIT = False
    #     if SUBMIT:
    #         inpath, outpath = sys.argv[1], sys.argv[2]
    #         testpath = combine_file
    #         test_num = len_merge_sum
    #     else:
    #         num = fenge("merge_sum.csv","merge_train.csv","merge_test.csv")
    #         inpath, outpath = 'merge_test.csv', 'output.csv'
    #         testpath = 'merge_train.csv'
    #         #test_num = 92228
    #         test_num = 87105
    #         # inpath, outpath = 'empty.csv', 'output.csv'
    #         #         # testpath = 'merge_sum.csv'
    #         #         # test_num = 92228
    #     filename = 'sentences.txt'
    #     corpus = splitSentence(testpath, inpath, filename, SUBMIT)  # jieba 分词
    
    #     print u'语料corpus生成完毕:', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
    
    # # # 训练词向量模型
    
    #     sentences = word2vec.Text8Corpus('sentences.txt')
    #     model = word2vec.Word2Vec(sentences, sg=1, size=100, window=5, min_count=5, negative=3, sample=0.001, hs=1, workers=4)
    #     model.save('result')  # save
    #     size = 100  # model_train size
    #     print u'词向量训练完毕', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))  # 输出当前时间
    #     # 导入model
    #     model = word2vec.Word2Vec.load('result')
    #     new_sentences = filter_word_in_model(model, filename)
        
        
    #     print "开始计算特征向量"
    # #     tfidf_cos,hz_cos,jk_cos = get_tiidf_vec(new_sentences)
    # #     feature_2 = np.hstack((tfidf_cos,hz_cos,jk_cos))
    #     vec_titles = []  # 获取句子的向量
    #     max_titles = []
    
    #     for val in range(len(new_sentences)/2):
    #         mat = np.zeros(shape=(50, size))
    #         for i in range(len(new_sentences[2*val])):
    #             if i < 25:
    #                 mat[i] = model[new_sentences[2*val][i].decode('utf8')]
    #         for i in range(len(new_sentences[2*val+1])):
    #             if i < 25:
    #                 mat[i+25] = model[new_sentences[2*val+1][i].decode('utf8')]        
    #         max_titles.append(mat)
    #     print "计算特征向量完毕"
    
    #     #vec_titles = feature_extraction(new_sentences,model,size)
    # #     vec_titles = list(map(lambda x: x[0], vec_titles))  # 去掉外部的[], 获得title 的向量形式
    #     print (np.array(max_titles).shape)
    #     np.save("train_data_title_max.npy", max_titles)
    #     print u'生成train_data_title_vec完毕', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))  # 输出时间
    #     trains = np.load('train_data_title_max.npy')    
    # #     nb_filters = 28  
    # #     # size of pooling area for max pooling  
    # #     pool_size = (2, 2)  
    # #     # convolution kernel size  
    # #     kernel_size = (3, 100)  
    # #     input_shape = (img_rows, img_cols, 1)  
    # #     model = Sequential()  
    # # """ 
    # # model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1], 
    # #                         border_mode='same', 
    # #                         input_shape=input_shape)) 
    # # """  
    # #     model.add(Convolution2D(nb_filters, (kernel_size[0], kernel_size[1]),  
    # #                         padding='same',  
    # #                         input_shape=input_shape)) # 卷积层1  
    # #     model.add(Activation('relu')) #激活层  
    # #     model.add(Convolution2D(nb_filters, (kernel_size[0], kernel_size[1]))) #卷积层2  
    # #     model.add(Activation('relu')) #激活层  
    # #     model.add(MaxPooling2D(pool_size=pool_size)) #池化层  
    # #     model.add(Dropout(0.25)) #神经元随机失活  
    # #     model.add(Flatten()) #拉成一维数据  
    # #     model.add(Dense(128)) #全连接层1  
    # #     model.add(Activation('relu')) #激活层  
    # #     model.add(Dropout(0.5)) #随机失活  
    # #     model.add(Dense(nb_classes)) #全连接层2  
    # #     model.add(Activation('softmax')) #Softmax评分  
      
    # #     #编译模型  
    # #     model.compile(loss='categorical_crossentropy',  
    # #               optimizer='adadelta',  
    # #               metrics=['accuracy'])  
    # #     #训练模型  
    # #     model.fit(X_train, Y_train, batch_size=batch_size, epochs=epochs,  
    # #           verbose=1, validation_data=(X_test, Y_test))  
    
    
    # In[8]:
    
    
    # print trains[0]
    
    
    # In[28]:
    
    
    # -*- coding: utf-8 -*-
    import numpy as np
    import sys
    import time
    from gensim.models import word2vec
    import lightgbm as lgb
    from sklearn.model_selection import train_test_split   # 随机分割
    from scipy.linalg import norm
    # import bm25
    stdi,stdo,stde=sys.stdin,sys.stdout,sys.stderr 
    reload(sys)
    sys.stdin,sys.stdout,sys.stderr=stdi,stdo,stde 
    sys.setdefaultencoding('utf-8')
    # 合并两个csv文件到
    filename1 = 'atec_nlp_sim_train.csv'
    filename2 = 'atec_nlp_sim_train_add.csv'
    combine_file, len_merge_sum = combine('merge_sum.csv', filename1, filename2)
    if __name__ == '__main__':
        SUBMIT = True
        if SUBMIT:
            inpath, outpath = sys.argv[1], sys.argv[2]
            testpath = combine_file
            test_num = len_merge_sum
        else:
            num = fenge("merge_sum.csv","merge_train.csv","merge_test.csv")
            inpath, outpath = 'merge_test.csv', 'output.csv'
            testpath = 'merge_train.csv'
            #test_num = 92228
            test_num = 87105
            # inpath, outpath = 'empty.csv', 'output.csv'
            #         # testpath = 'merge_sum.csv'
            #         # test_num = 92228
        filename = 'sentences.txt'
        corpus = splitSentence(testpath, inpath, filename, SUBMIT)  # jieba 分词
    
        print u'语料corpus生成完毕:', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
    
    # # 训练词向量模型
    
        sentences = word2vec.Text8Corpus('sentences.txt')
        model = word2vec.Word2Vec(sentences, sg=1, size=120, window=5, min_count=5, negative=3, sample=0.001, hs=1, workers=4)
        model.save('result')  # save
        size = 120  # model_train size
        print u'词向量训练完毕', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))  # 输出当前时间
        # 导入model
        model = word2vec.Word2Vec.load('result')
        new_sentences = filter_word_in_model(model, filename)
        
        
        print "开始计算特征向量"
        tfidf_cos,hz_cos,jk_cos = get_tiidf_vec(new_sentences)
    #     feature_2 = np.hstack((tfidf_cos,hz_cos,jk_cos))
        vec_titles = []  # 获取句子的向量
        value_eig = []
        v_max = []
        v_v = [0,0,0]
        for val in range(len(new_sentences)):
            vec = np.zeros(shape=(1, size))
    #         mat = np.zeros(shape=(len(new_sentences[val]), size))
            for i in range(len(new_sentences[val])):
                vec += model[new_sentences[val][i].decode('utf8')]
    #             mat[i] = model[new_sentences[val][i].decode('utf8')]
            if len(new_sentences[val]):
                vec = vec/len(new_sentences[val])
    #             a,b,c = np.linalg.svd(mat)
    #             b_l = list(b)
    #             b_s = sorted(b_l)
    #             b_1 = np.max(b)
    #         v_max.append(b_s[-1])
    #         if len(b_s) < 2:
    #             v_max.append(0)
    #         else:
    #             v_max.append(b_s[-2])
    #         if len(b_s) < 3:
    #             v_max.append(0)
    #         else:
    #             v_max.append(b_s[-3])
            vec_titles.append(vec)
    #         if len(v_max) == 6:
    #             v_v[0] = abs(v_max[3]-v_max[0])
    #             v_v[1] = abs(v_max[4]-v_max[1])
    #             v_v[2] = abs(v_max[5]-v_max[2])
    #             value_eig.append(v_v)
    #             v_max = []
        print "计算特征向量完毕"
    
    #     vec_titles = []
    #     for val in range(len(new_sentences)):
    #         vec = np.zeros(shape=(len(new_sentences[val]), size))
    #         for i in range(len(new_sentences[val])):
    #             vec[i] = model[new_sentences[val][i].decode('utf8')]
    #         if len(new_sentences[val]):
    #             V = np.dot(vec.transpose(),vec)
    #             a,b = np.linalg.eig(V)
    #         vec1 = np.zeros(shape = (size,1))
    #         for k in range(a.shape[0]):
    #             if a[k] > 0.1:
    #                 vec1 = vec1+(a[k]*b[:,k]).reshape(size,1)
    #         print len(vec_titles)
    #         vec_titles.append(abs(vec1).reshape(1,size))
    
    
        #vec_titles = feature_extraction(new_sentences,model,size)
        vec_titles = list(map(lambda x: x[0], vec_titles))  # 去掉外部的[], 获得title 的向量形式
        print(np.array(vec_titles).shape)
        np.save("train_data_title_vec.npy", vec_titles)
        print u'生成train_data_title_vec完毕', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))  # 输出时间
        trains = np.load('train_data_title_vec.npy')
        # bm_score = np.load('bm_score.npy')
        # size = 100
        new_sentences = []
        with open('new_sentences.txt', 'r') as f:
            for eachLine in f:
                word = eachLine.decode('utf8').strip().split(' ')
                new_sentences.append(word)
    # 一维特征
        # 6  杰拉德距离
            J_dist = []
            for val in range(len(new_sentences) / 2):
                j = cal_jaccard(new_sentences[2 * val], new_sentences[2 * val + 1])
                J_dist.append(j)
        #  1向量夹角,2向量距离(最小) 3 bm_score
    
        juzi = []
        f1 = open('sentences_1.txt','r')
        for eachLine in f1:
            word = eachLine.decode('utf8').strip()
            juzi.append(list(word))
        distance = []
        for i in range(len(juzi)/2):
            j = cal_jaccard(juzi[2 * i], juzi[2 * i + 1])
            distance.append([j])
        f1.close()    
        
        cos_val = []
        E_Dist = []
        print 'train', len(trains)
        for i in range(len(trains) / 2):
            score1 = cos_Vector(trains[2 * i], trains[2 * i + 1])
            cos_val += [score1]
            score2 = calEuclideanDistance(trains[2 * i], trains[2 * i + 1])
            E_Dist += [score2]
        #添加一个二维特征
        combine_feature1 = np.vstack((cos_val, J_dist)).transpose()  # 2个
        np.save("cos_val.npy", cos_val), np.save('E_Dist.npy', E_Dist)
        # 4,5 各自的长度(2维),6 长度差异(删)
        len_ = []
        dif_length = []
        for val in range(len(new_sentences) / 2):
            a = [0.1*len(new_sentences[2 * val]), 0.1*len(new_sentences[2 * val + 1])]
            #长度特征
            len_.append(a)
            b = abs(len(new_sentences[2 * val]) - len(new_sentences[2 * val + 1]))
            #长度差特征
            dif_length.append(b)
        # combine_feature2 = np.vstack((np.transpose(len_), J_dist)).transpose()  # 3个
        # combine_feature = np.hstack((combine_feature1, combine_feature2))  # 5个
        # combine_feature2 = len_ #二维特征
        combine_feature3 = tfidf_cos
        combine_feature4 = distance
        # combine_feature = np.vstack((np.transpose(combine_feature), J_dist)).transpose()  # 6个
        combine_feature = np.hstack((combine_feature1, combine_feature3, combine_feature4))  # 4个
        print combine_feature.shape
        # 编辑距离
        dim_1_num = 4  # 一维特征列数
        # dim_1_num = 6  # 一维特征列数
        print 'combine_feature length:', len(combine_feature)
    # 产生新特征(size*num 维度)
        feature1_val = []
        feature2_val = []
        feature3_val = []
        feature4_val = []
        feature5_val = []
        feature6_val = []
        print 'train', len(trains)
        #train就是句子向量对应的矩阵
        for i in range(len(trains)/2):
            vec1 = trains[2*i]
            vec2 = trains[2*i+1]
            feature1_val.append(vec1)
            feature2_val.append(vec2)
        for i in range(len(trains)/2):
            vec3 = vec_minus(trains[2*i], trains[2*i+1])
            vec4 = vec_multi(trains[2*i], trains[2*i+1])
            feature3_val.append(vec3)
            feature4_val.append(vec4)
        print 'feature1_val length:', len(feature3_val)
        # feature_val = np.hstack((feature1_val, feature2_val, feature3_val, feature4_val, combine_feature))  # 400列///406
        #合成总特征feature_val
        feature_val = np.hstack((feature1_val, feature2_val,combine_feature))  # 300列///406
        #feature_val = combine_feature
        np.save("feature_val.npy", feature_val)
        # feature_num = 4
        feature_num = 2 #几个feature1_val向量组成的特征
        print u'特征生成完毕'
    
        y_true = []
        with open(testpath, 'r') as f, open(inpath, 'r') as fin:
            for line in f.readlines():
                pair_id, sen1, sen2, label = line.strip().split('\t')
                label = int(label)
                y_true += [label]
        np.save('y_true.npy', y_true)
        print 'y_true length', len(y_true)
    
        # 读取数据
        y_true = np.load('y_true.npy')
        # bm_score = np.load('bm_score.npy')
        # bm_score_train = bm_score[:test_num]
        # print 'bm_score_train length', len(bm_score_train)
        # bm_score_test = bm_score[test_num:]
        # bm_score = pd.Series([bm_score], index=['bm_score'])
        cos_val = np.load('cos_val.npy')
        cos_val_train = cos_val[:test_num]
        cos_val_test = cos_val[test_num:]
        feature_val = np.load('feature_val.npy')
        feature_val_train = feature_val[:test_num]
        feature_val_test = feature_val[test_num:]
        #每一行对应一个训练集,最后一列是标号
        trains = np.vstack((np.transpose(feature_val_train), y_true)).transpose()
        print np.array(trains).shape
        print u"数据拆分"
        train, val = train_test_split(trains, test_size=0.2, random_state=21)
        print 'train length', len(train)
        print 'val length', len(val)
        print u"训练集"
        y = [train[i][feature_num*size+dim_1_num] for i in range(len(train))]  # 训练集标签
        X = [train[i][:feature_num*size+dim_1_num] for i in range(len(train))]  # 训练集特征矩阵
        print u"验证集"
        val_y = [val[i][feature_num*size+dim_1_num] for i in range(len(val))]  # 验证集标签
        val_X = [val[i][:feature_num*size+dim_1_num] for i in range(len(val))]  # 验证集特征矩阵
        print u"测试集"
        tests = feature_val_test
        # 数据转换
        lgb_train = lgb.Dataset(X, y, free_raw_data=False)
        lgb_eval = lgb.Dataset(val_X, val_y, reference=lgb_train, free_raw_data=False)
        # 开始训练
        print u'设置参数'
        params = {
            'num_threads' : '4',
            'boosting_type': 'gbdt',
            'boosting': 'gbdt',
            'objective': 'binary',
            'metric': 'binary_logloss',
    
            'learning_rate': 0.1,
            'num_leaves': 25,
            'max_depth': 3,
    
            'max_bin': 10,
            'min_data_in_leaf': 8,
    
            'feature_fraction': 1,
            'bagging_fraction': 0.7,
            'bagging_freq': 5,
    
            'lambda_l1': 0,
            'lambda_l2': 0,
            'min_split_gain': 0
            
        }
        print u"开始训练"
        gbm = lgb.train(params,  # 参数字典
                        lgb_train,  # 训练集
                        num_boost_round=3000,  # 迭代次数
                        valid_sets=lgb_eval,  # 验证集
                        early_stopping_rounds=30)  # 早停系数
        # 保存模型
        from sklearn.externals import joblib
        joblib.dump(gbm, 'gbm.pkl')
        print u"预测,并输出在 outpath"
        preds_offline = gbm.predict(tests, num_iteration=gbm.best_iteration)  # 输出概率
        np.save('preds.npy', preds_offline)
    
        if not SUBMIT:
            N = 200
            score_best = 0
            with open('merge_test.csv', 'r') as f1:
                y_true_10 = []
                for eachLine in f1:
                    lineno, sen1, sen2, label = eachLine.strip().split('\t')
                    a = int(label)
                    y_true_10.append(a)
    
            for thred in range(1,N):  # 阈值的选取,如何找到最好的阈值
                thred = thred * (np.max(preds_offline) - np.min(preds_offline)) / N + np.min(preds_offline)
                pred = []
                for i in range(len(preds_offline)):
                    if preds_offline[i] > thred:
                        pred.append(1)
                    else:
                        pred.append(0)
                score = eval_file(y_true_10, pred)
                if score > score_best:
                    score_best = score
                    thred_best = thred
            print u'最优阈值:', thred_best
    
            for i in range(len(preds_offline)):
                if preds_offline[i] > thred_best:
                    preds_offline[i] = 1
                else:
                    preds_offline[i] = 0
            print len(preds_offline)
            f1_score = eval_file(y_true_10, preds_offline)
            print 'F1 score is :' + str(f1_score)
            fout = open(outpath,'w')
            for t in preds_offline:
                fout.write(str(t))
                fout.write('\n')
            fout.write('F1 score is :' + str(f1_score))
            fout.close()
        else:
            with open(inpath, 'r') as fin, open(outpath, 'w') as fout:
                line_id = []
                for line in fin:
                    lineno, sen1, sen2 = line.strip().split('\t')
                    line_id.append(lineno)
                for i in range(len(line_id)):
                    if preds_offline[i] >= 0.246:
                        fout.write(line_id[i] + '\t1\n')
                    else:
                        fout.write(line_id[i] + '\t0\n')
        print u'运行完毕', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))  # 输出当前时间
    
    

     

    展开全文
  • CNN分类模型架构 原理解释 CNN的每次卷积操作需要输入几个字向量,这就相当于在计算词的特征,将文本从字特征转到更高级的词特征。...python代码实现 #!/usr/bin/python # -*- coding: utf-8 -*- i...

    CNN分类模型架构

    原理解释

    CNN的每次卷积操作需要输入几个字向量,这就相当于在计算词的特征,将文本从字特征转到更高级的词特征。

    cnn网络处理文本的理解,可以把卷积层看作n-gram的一种处理。每一句话可以当做一个图像问题。

    卷积就是对每个词的上下文提取特征。

     

    python代码实现

    #!/usr/bin/python
    # -*- coding: utf-8 -*-
    
    import tensorflow as tf
    class TCNNConfig(object):
    #class TCNNConfig(filename):
        """CNN配置参数"""
    
        embedding_dim = 64      # 词向量维度
        seq_length = 600        # 序列长度
        num_classes = 2   # 类别数
        num_filters = 256        # 卷积核数目
        kernel_size = 5         # 卷积核尺寸
        vocab_size = 5000       # 字典大小
        # vocab_size = 5000       # 字典大小
    
        hidden_dim = 128        # 全连接层神经元
    
        dropout_keep_prob = 0.5 # dropout保留比例
        learning_rate = 1e-3    # 学习率
    
        batch_size = 128       # 每批训练大小
        num_epochs = 1        # 总迭代轮次
    
        print_per_batch = 10   # 每多少轮输出一次结果
        save_per_batch = 10      # 每多少轮存入tensorboard
    
    
    class TextCNN(object):
        """文本分类,CNN模型"""
        def __init__(self, config):
            self.config = config
    
            # 三个待输入的数据
            self.input_x = tf.placeholder(tf.int32, [None, self.config.seq_length], name='input_x')
            self.input_y = tf.placeholder(tf.float32, [None, self.config.num_classes], name='input_y')
            self.keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    
            self.cnn()
    
        def cnn(self):
            """CNN模型"""
            # 词向量映射
            with tf.device('/cpu:0'):
                embedding = tf.get_variable('embedding', [self.config.vocab_size, self.config.embedding_dim])
                embedding_inputs = tf.nn.embedding_lookup(embedding, self.input_x)
                # self.embedding_inputs = embedding_inputs
    
            with tf.name_scope("cnn"):
                # CNN layer
                conv = tf.layers.conv1d(embedding_inputs, self.config.num_filters, self.config.kernel_size, name='conv')
                # self._conv = conv
                # global max pooling layer
                gmp = tf.reduce_max(conv, reduction_indices=[1], name='gmp')
    
            with tf.name_scope("score"):
                # 全连接层,后面接dropout以及relu激活
                fc = tf.layers.dense(gmp, self.config.hidden_dim, name='fc1')
                fc = tf.contrib.layers.dropout(fc, self.keep_prob)
                fc = tf.nn.relu(fc)
    
                # 分类器
                self.logits = tf.layers.dense(fc, self.config.num_classes, name='fc2')
                self.props=tf.nn.softmax(self.logits)
                self.y_pred_cls = tf.argmax(tf.nn.softmax(self.logits), 1)  # 预测类别
    
            with tf.name_scope("optimize"):
                # 损失函数,交叉熵
                cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.input_y)
                self.loss = tf.reduce_mean(cross_entropy)
                # 优化器
                self.optim = tf.train.AdamOptimizer(learning_rate=self.config.learning_rate).minimize(self.loss)
    
            with tf.name_scope("accuracy"):
                # 准确率
                correct_pred = tf.equal(tf.argmax(self.input_y, 1), self.y_pred_cls)
                self.acc = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
    

     

    展开全文
  • 示例代码: 单独看不同n-gram计算出的 p n p_n pn​ import jieba from nltk.translate.bleu_score import sentence_bleu source = r'猫坐在垫子上' # source target = 'the cat sat on the mat' # target ...

    Bleu[1]是IBM在2002提出的,用于机器翻译任务的评价

    BLEU还有许多变种。根据n-gram可以划分成多种评价指标,常见的指标有BLEU-1、BLEU-2、BLEU-3、BLEU-4四种,其中n-gram指的是连续的单词个数为n。BLEU-1衡量的是单词级别的准确性,更高阶的bleu可以衡量句子的流畅性。

    它的总体思想就是准确率
    这里放上准确率(查准率)的公式(摘自西瓜书)作为提醒:
    在这里插入图片描述
    准确率是被预测为正例的样本中,真正例的比例。

    假如给定标准译文reference,神经网络生成的句子是candidate,句子长度为n。
    candidate就是我们预测出的正例。
    candidate中有m个单词出现在reference中,这一部分对应的就是真正例
    bleu的1-gram的计算公式就是m/n。

    改进的BLEU算法 — 分子截断计数

    在这里插入图片描述
    分子释义

    • Countclip=min(Count,Max_Ref_Count)Count_{clip}= min(Count, Max\_Ref\_Count)
    • min表示如果有必要,我们会截断每个单词的计数
    • Max_Ref_CountMax\_Ref\_Count表示n−gram在任何一个reference中出现的最大次数
    • CountCount表示n-gram在reference中出现的次数
    • 第一个求和符号统计的是所有的candidate,因为计算时可能有多个句子
    • 第二个求和符号统计的是一条candidate中所有的n−gram

    分母释义

    • 前两个求和符号和分子中的含义一样
    • Count(ngram)Count(n-gram')表示ngramn−gram′在candidate中的个数,分母是获得所有的candidate中n-gram的个数。

    举例来说:
    Candidate: the the the the the the the.
    Reference: The cat is on the mat.
    Modified Unigram Precision = 2/7
    如果没有clip操作,算出的BLEU值将会是7/7,因为candidate中每个词都出现在了reference中

    举一个例子来看看实际的计算:
    candinate: the cat sat on the mat
    reference: the cat is on the mat
    计算n−gram的精度:
    p1p_1=5/6=0.83333
    p2p_2=3/5=0.6
    p3p_3=1/4=0.25
    p4p_4=0/3=0

    添加对句子长度的惩罚因子

    这里对原论文的描述不太理解,参考了维基百科相关内容。
    在翻译时,若出现译文很短的句子时,往往会有较高的BLEU值,因此引入对短句的惩罚因子BP。
    如果我们逐句计算“简短”惩罚并平均惩罚,那么短句上的长度偏差将受到严厉的惩罚。
    Instead, we compute the brevity penalty over the entire corpus to allow some freedom at the sentence level.

    设r为参考语料库总长度,c为翻译语料库总长度。
    在这里插入图片描述
    在多个reference的情况下,r是长度最接近候选句长度的句子长度之和。

    如果有三个reference,长度分别为12、15和17个单词,并且候选翻译长度为12个单词,我们希望BP值为1。
    称最接近的reference的句子长度为**“best match length”**。
    通过对语料库中每个候选句子的"best match length"求和,计算出test corpus’ effective reference length r

    然而,在2009年之前NIST评估使用的度量版本中,使用了最短的参考语句。[4]
    在这里插入图片描述

    • exp(n=1Nwnlogpn)exp(\sum_{n=1}^N w_n \log p_n)表示不同的n-gram的pnp_n的加权求和。
      在这里插入图片描述
      In our baseline, we use N = 4 and uniform weights wnw_n = 1/N.

    下面是一个加了惩罚因子的实例。
    在这里插入图片描述

    示例代码:

    单独看不同n-gram计算出的pnp_n

    import jieba
    from nltk.translate.bleu_score import sentence_bleu
    
    source = r'猫坐在垫子上'  # source
    target = 'the cat sat on the mat'  # target
    inference = 'the cat is on the mat'  # inference
    
    # 分词
    source_fenci = ' '.join(jieba.cut(source))
    target_fenci = ' '.join(jieba.cut(target))
    inference_fenci = ' '.join(jieba.cut(inference))
    
    # reference是标准答案 是一个列表,可以有多个参考答案,每个参考答案都是分词后使用split()函数拆分的子列表
    # # 举个reference例子
    # reference = [['this', 'is', 'a', 'duck']]
    reference = []  # 给定标准译文
    candidate = []  # 神经网络生成的句子
    # 计算BLEU
    reference.append(target_fenci.split())
    candidate = (inference_fenci.split())
    score1 = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
    score2 = sentence_bleu(reference, candidate, weights=(0, 1, 0, 0))
    score3 = sentence_bleu(reference, candidate, weights=(0, 0, 1, 0))
    score4 = sentence_bleu(reference, candidate, weights=(0, 0, 0, 1))
    reference.clear()
    print('Cumulate 1-gram :%f' \
          % score1)
    print('Cumulate 2-gram :%f' \
          % score2)
    print('Cumulate 3-gram :%f' \
          % score3)
    print('Cumulate 4-gram :%f' \
          % score4)
    

    参考文献

    [1] Papineni K , Roukos S , Ward T , et al. BLEU: a Method for Automatic Evaluation of Machine Translation. 2002.
    [2] 机器翻译评测——BLEU算法详解 (新增 在线计算BLEU分值)
    [3] BLEU详解
    [4] 维基百科:BLEU

    展开全文
  • 本次作者评测所需(防吓退)一台可以上网的电脑基本的python代码阅读能力,用于修改几个模型参数对百度中文NLP最新成果的浓烈兴趣训练模型:Senta情感分析模型基本简介Senta是百度NLP开放的中文情感分析模型,可以用于...
  • 一文读懂NLP之HMM模型代码python实现与演示1. 前言2 概率计算问题2.1 前向算法2.2 后向算法3 模型训练问题3.1 监督学习–最大似然估计2.2 Baum·welch算法4 序列预测问题4.1 维特比算法 1. 前言 在上一篇《一文读懂...
  • python nlp

    2018-04-03 14:16:53
    natural language process(NLP)自然语言理解通过 pip install nltk进行安装输入import nltk nltk.downdowan()因为文件不大,可以全部安装现在开始抓取web页面我们引入urllib包,我使用的是python3.6,和版本2 在...
  • 数据科学中的Python代码 NLP,深度学习,强化学习和人工智能中的代码 欢迎来到我的GitHub存储库。 我是一名数据科学家,并且用R,Python和Wolfram Mathematica编写代码。 在这里,您将找到我开发的一些机器学习,...
  • Python Python研究 算法 ...大熊猫代码 人工智能 NLP:自然语言处理 清洁与标准化 一键编码 词干和词法提取(词干提取和词条提取) 停用词(停用词) 代币化(代币化) 回归 回归 简单线性回归
  • 导读:近日,微软研究院发文称,NLP即将迎来“黄金十年”。他们认为,各领域对NLP的需求会大幅度上升,对NLP质量也提出更高要求。如果你想赶上这“黄金十年”,现在好好学习...
  • 例如:在nlp领域中,往往需要花比较长的时间需训练一个模型,而通过远程工具连接Linux服务器跑python代码,如果中途关闭shell远程工具,正在运行的python代码就会终止,这时,可以通过使用后台运行python程序的方式...
  • NLP基础之Python爬虫

    2020-07-26 14:59:33
    通过Python代码与WEB页面上元素进行交互(点击、输入等),可以获取指定元素的内容。 目录安装部署爬虫案例体验 安装部署 selenium、XPath Helper chrome://extensions/ 页面设置XPath 获取地址 /html/body[@class=...
  • 每个部分都基于以前的笔记本中的想法和代码,但是您可以在脑海中填补空白,直接跳到您感兴趣的地方。 第01章 非常适合入门! 我们通过代码优先的方法学习得更好 第02章 笔记本,代码优先的方法,并附带说明。 涵盖...
  • NICF –适用于初学者的使用Python的自然语言处理(NLP) 按 这些是用于课程的练习文件。 课程大纲可以在下面找到 主题1 NLP和深度学习概述 NLP概述 NLP的应用 NLP的深度学习方法 递归神经网络(RNN)的基础 为NLP...
  • python机器学习——文本情感分析(英文文本情感分析)代码下载,代码完整可以运行。希望可以帮助到正在学习的伙伴们。
  • 天蓝色ML的Python代码 使用Microsoft Azure认知服务实现各种AI项目的Python代码。 需要Microsoft Azure订阅,因为您将需要授权密钥和终结点才能使用这些代码部署模型。 服务: I.Azure机器学习: 部署预测性服务。 ...
  • HMM模型在NLP中分词的应用 在分词时对序列进行标注,用BMES规则标注,B:代表开头字符,M:代表中间字符,E:代表结尾字符,S:代表单独存在的字符 如: 我 是 来自 厦门市 的 帅哥 标注:S S BE BME S BE 当...
  • NLP中关于转移学习的NAACL教程代码仓库
  • 编辑距离算法详解和python代码

    千次阅读 2018-12-10 16:33:07
    最近做NLP用到了编辑距离,网上学习了很多,看到很多博客写的有问题,这里做一个编辑距离的算法介绍,步骤和多种python代码实现,编辑距离有很多个定义,比如Levenshtein距离,LCS距离,汉明距离等,我们这里将...
  • NLP汉语自然语言处理原理与实践》(郑捷著)是一本专业研究自然语言处理的书籍,本文作者在阅读这本书,调试其中的程序代码时,发现由于版本升级,导致其中的某些程序无法执行。本文针对书中第24页“安装Stanford...
  • 完整可运行的python代码。 数据过滤,清洗,分割,特征选择,训练词向量模型,测试等等, 每行都有注释,真实的数据集超过20w条,是个不错的nlp入门例子。
  • Python NLP 入门

    2019-10-08 21:50:09
    # 首先 import nltk模块 import nltk nltk.download() ...# 以下为python 自然语言处理代码样例和学习笔记 # 统计文本的长度 len(text3) 44764 # 去重排序 sorted(set(text3)) ['!', "'", '(',...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 872
精华内容 348
关键字:

nlppython代码

python 订阅