精华内容
下载资源
问答
  • lda模型实战

    2021-08-10 15:33:09
    1、lda是一种无监督的贝叶斯模型: P(词 | 文档)=P(词 | 主题)P(主题 | 文档) 同一主题下,某个词出现的概率,以及同一文档下,某个主题出现的概率,两个概率的乘积,可以得到某篇文档出现某个词的概率。 2、lda...

    lda简介(理论部分见lda模型理论篇)

    1、lda是一种无监督的贝叶斯模型:
    P(词 | 文档)=P(词 | 主题)P(主题 | 文档)
    同一主题下,某个词出现的概率,以及同一文档下,某个主题出现的概率,两个概率的乘积,可以得到某篇文档出现某个词的概率。
    2、lda用来推测文档的主题分布。它可以将文档集中每篇文档的主题以概率分布的形式给出,从而通过分析一些文档抽取出它们的主题分布后,便可以根据主题分布进行主题聚类或文本分类。
    3、lda 采用词袋模型。所谓词袋模型,是将一篇文档,我们仅考虑一个词汇是否出现,而不考虑其出现的顺序。在词袋模型中,“我喜欢你”和“你喜欢我”是等价的。理论上不适用tfidf,但也可以尝试,与词袋模型相反的一个模型是n-gram,word2vec等考虑了词汇出现的先后顺序
    4、但在一定可以解决一词多义及一义多次
    一个词可以出现在多个主题中:一词多义
    多个词会出现在同一主题中:一义多次

    首先进行分词

    # -*- coding:utf-8 -*-
    # @time  : 11:20
    # @Author:xxma
    import jieba
    import jieba.analyse as ana
    import re
    def cn(text):
        '''
        剔除特殊字符及数字,只保留中文
        :param text:
        :return:
        '''
        str = re.findall(u"[\u4e00-\u9fa5]+", text)
        return str
    
    # 加载停用词
    with open(r'E:\xxma\文本分析\停用词.txt', 'r', encoding='utf-8') as f:
        stopwords = [s.strip() for s in f.readlines()]
    
    def jiebacut(text):
        """
        使用普通分词,sklearn中lda模型的输入,词以逗号分隔
        :param text: 原始文本
        :return:切分后的词,且用空格分隔
        """
        text=str(cn(text)).replace(' ','')
        # jieba.load_userdict(r'dict.txt') #加载自定义词典,本次不设置
        words = [w for w in jieba.lcut(text) if w not in stopwords]  # 剔除停用词
        # print(words,len(words))
        words = [w for w in words if len(words) > 2] #剔除短文本
        return ' '.join(words)
    
    
    def ldajiebacut(text):
        """
        使用普通分词,gensim中lda模型的输入,词以逗号分隔
        :param text: 原始文本
        :return:切分后的词,且用空格分隔
        """
        text=str(cn(text)).replace(' ','')
        # jieba.load_userdict(r'dict.txt') #加载自定义词典,本次不设置
        words = [w for w in jieba.lcut(text) if w not in stopwords]  # 剔除停用词
        # print(words,len(words))
        words = [w for w in words if len(words) > 2] #剔除短文本
        return words
    

    sklearn中的lda实战

    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    from sklearn.decomposition import LatentDirichletAllocation
    from 文本分析.分词 import jiebacut, tfidfcut
    
    # 加载数据
    df = pd.read_csv(r'contentLDA.txt').drop_duplicates()
    # df = df.iloc[0:1000,:]
    # 分词
    df['cut'] = df['content'].apply(jiebacut)
    df['len'] = df['cut'].apply(len)
    df = df[df['len'] > 0].reset_index(drop=True)
    
    # 使用词频生成词向量
    Vectorizer = CountVectorizer()
    vector = Vectorizer.fit_transform(df['cut'])
    # 使用tfidf生成词向量
    Vectorizer_tfidf = TfidfVectorizer()
    vector_tfidf = Vectorizer_tfidf.fit_transform(df['cut'])
    
    # lda主题模型,需要调参
    # 在实际的应用中,我们需要对K,α,η进行调参。如果是"online"算法,则可能需要对"online"算法的一些参数做调整。
    lda = LatentDirichletAllocation(n_components=2,  # 设置主题个数
                                    max_iter=1000,
                                    doc_topic_prior=0.01,
                                    learning_method='online',  # 将训练样本分批用于更新主题词分布,对内存要求低,但速度较慢
                                    verbose=False,
                                    random_state=0)  # 是否显示迭代过程
    result = lda.fit_transform(vector_tfidf)
    
    # 提取主题对应的词
    topic_words = []
    def print_top_words(model, feature_names, n_top_words):
        # 打印每个主题下权重较高的term
        for topic_idx, topic in enumerate(model.components_):
            # print("Topic #%d:" % topic_idx)
            words = " ".join([feature_names[i]for i in topic.argsort()[:-n_top_words - 1:-1]])
            # print(words)
            topic_words.append(words)
        return topic_words
        # 打印主题-词语分布矩阵
        # print(model.components_)
    
    
    n_top_words = 10
    tf_feature_names = Vectorizer_tfidf.get_feature_names()
    topic_detail = print_top_words(lda, tf_feature_names, n_top_words)
    
    # 获取每个文档的主题
    df_result = pd.DataFrame(result)
    df_result['max'] = df_result.max(axis=1)
    df_result['max_index'] = np.argmax(result,axis=1)
    df_result['topic_words'] = df_result['max_index'].apply(lambda x:topic_detail[x])
    
    # 数据保存
    df_result = df_result[['max','max_index','topic_words']]
    df = pd.DataFrame(df['content'])
    df_save = df.join(df_result)
    df_save.to_csv(r'contentLDA_sklearn_result.csv',sep='\t',index=False)
    
    # 保存模型,以便后期调用
    # joblib.dump(lda,'sklearn_lda.pkl')
    

    gensim中lda实战

    import pandas as pd
    import warnings
    warnings.filterwarnings('ignore')  # 警告扰人,手动封存
    from 文本分析.分词 import jiebacut, tfidfcut, ldajiebacut
    from gensim import corpora, models
    
    # 加载数据
    df = pd.read_csv(r'contentLDA.txt', sep='\t')
    # 分词
    df['cut'] = df['content'].apply(ldajiebacut)
    df['len'] = df['cut'].apply(len)
    df = df[df['len'] > 0].reset_index(drop=True)
    
    # 词向量
    dictionary = corpora.Dictionary(df['cut'])  # 建立词典
    corpus = [dictionary.doc2bow(text) for text in df['cut']]  # 计算文本向量
    # tfidf
    tfidf_model = models.TfidfModel(corpus)
    corpus_tfidf = tfidf_model[corpus]
    
    # lda模型训练(自定义参数:主题个数num_topics,遍历次数passes)
    topics_num = 2
    ldamodel = models.ldamodel.LdaModel(corpus_tfidf
                                        , id2word=dictionary
                                        , num_topics=topics_num
                                        , passes=200
                                        , iterations=1000
                                        , alpha=0.01
                                        ,random_state=0)
    
    # 获取主题对应的词
    topics_num_list = []
    for i in range(topics_num):
        topics = ldamodel.print_topic(topicno=i)
        topics_num_list.append(topics)
    # print(topics_num_list)
    
    # 获取文档对应的主题及对应的词
    corpus_lda = ldamodel[corpus_tfidf]
    topic_id_list = []
    topic_pro_list = []
    for doc in corpus_lda:
        # print(doc)
        topic_id = doc[0][0]
        topic_pro = doc[0][1]
        topic_id_list.append(topic_id)
        topic_pro_list.append(topic_pro)
    
    df['topic']=topic_id_list
    df['topic_pro']=topic_pro_list
    df['topic_words'] = df['topic'].apply(lambda x:topics_num_list[x])
    
    # 保存结果数据
    df = df[['content','topic','topic_pro','topic_words']]
    df.to_csv(r'contentLDA_gensim_result.csv',sep='\t',index=False,encoding='utf-8')
    
    # 保存模型
    # ldamodel.save('gensim_lda.pkl')
    
    # 调用保存好的LDA模型
    # lda = models.ldamodel.LdaModel.load('lda.model')
    
    展开全文
  • LDA模型实战常用知识点

    千次阅读 2019-07-10 16:02:55
    2019 Stata & Python 实证计量与爬虫分析暑期工作坊还有几天就要开始了。之前在公众号里分享过好几次LDA话题模型的,但考虑的问题都比较简单。这次我...

    2019 Stata & Python 实证计量与爬虫分析暑期工作坊还有几天就要开始了。之前在公众号里分享过好几次LDA话题模型的,但考虑的问题都比较简单。这次我将分享在这个notebook中,将会对以下问题进行实战:

    • 提取话题的关键词

    • gridsearch寻找最佳模型参数

    • 可视化话题模型

    • 预测新输入的文本的话题

    • 如何查看话题的特征词组

    • 如何获得每个话题的最重要的n个特征词

    1.导入数据

    这里我们使用的20newsgroups数据集

    import pandas as pd	
    df = pd.read_json('newsgroups.json')	
    df.head()

    640?wx_fmt=png

    查看target_names有哪些类别

    df.target_names.unique()

    Run

    array(['rec.autos', 'comp.sys.mac.hardware', 'rec.motorcycles',	
           'misc.forsale', 'comp.os.ms-windows.misc', 'alt.atheism',	
           'comp.graphics', 'rec.sport.baseball', 'rec.sport.hockey',	
           'sci.electronics', 'sci.space', 'talk.politics.misc', 'sci.med',	
           'talk.politics.mideast', 'soc.religion.christian',	
           'comp.windows.x', 'comp.sys.ibm.pc.hardware', 'talk.politics.guns',	
           'talk.religion.misc', 'sci.crypt'], dtype=object)

    2.英文清洗数据

    1. 使用正则表达式去除邮件和换行等多余空白字符

    2. 使用gensim库的simple_preprocess分词,得到词语列表

    3. 保留某些词性的词语 https://www.guru99.com/pos-tagging-chunking-nltk.html

    注意:

    nltk和spacy安装配置比较麻烦,可以看这篇文章。

    自然语言处理库nltk、spacy安装及配置方法其中nltk语料库和spacy的英文模型均已放置在教程文件夹内~

    import nltk	
    import gensim	
    from nltk import pos_tag	
    import re	
    from nltk.corpus import stopwords	
    #导入spacy的模型	
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])	
    def clean_text(text, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):	
        text = re.sub('\S*@\S*\s?', '', text) #去除邮件	
        text = re.sub('\s+', ' ', text)       #将连续空格、换行、制表符  替换为 空格	
        #deacc=True可以将某些非英文字母转化为英文字母,例如	
        #"Šéf chomutovských komunistů dostal poštou bílý prášek"转化为	
        #u'Sef chomutovskych komunistu dostal postou bily prasek'	
        words = gensim.utils.simple_preprocess(text, deacc=True)	
        #可以在此处加入去停词操作	
        stpwords = stopwords.words('english')	
        #保留词性为'NOUN', 'ADJ', 'VERB', 'ADV'词语	
        doc = nlp(' '.join(words)) 	
        text = " ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' 	
                         for token in doc	
                         if token.pos_ in allowed_postags])	
        return text	
    test = "From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"	
    clean_text(test)

    Run

    'where thing subject car be nntp post host rac wam umd edu organization university maryland college park line be wonder anyone out there could enlighten car see other day be door sport car look be late early be call bricklin door be really small addition front bumper be separate rest body be know anyone can tellme model name engine spec year production where car be make history info have funky look car mail thank bring neighborhood lerxst'

    将将数据content列进行批处理(数据清洗clean_text)

    df.content = df.content.apply(clean_text)	
    df.head()

    640?wx_fmt=png

    3. 构建文档词频矩阵 document-word matrix

    from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer	
    #vectorizer = TfidfVectorizer(min_df=10)#单词至少出现在10个文档中	
    vectorizer = CountVectorizer(analyzer='word',       	
                                 min_df=10,                        # minimum reqd occurences of a word	
                                 lowercase=True,                   # convert all words to lowercase	
                                 token_pattern='[a-zA-Z0-9]{3,}',  # num chars > 3	
                                 # max_features=50000,             # max number of uniq words	
                                )	
    data_vectorized = vectorizer.fit_transform(df.content)

    检查数据的稀疏性,

    data_dense = data_vectorized.todense()	
    # Compute Sparsicity = Percentage of Non-Zero cells	
    print("Sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, '%')

    Run

    Sparsicity:  0.9138563473570427 %

    4.构建LDA模型

    使用sklearn库的LatentDirichletAllocation

    from sklearn.decomposition import LatentDirichletAllocation	
    # 构建LDA话题模型	
    lda_model = LatentDirichletAllocation(n_components=20)              # 话题数	
    lda_output = lda_model.fit_transform(data_vectorized)

    模型表现

    # 越高越好	
    print(lda_model.score(data_vectorized))	
    #训练好的模型的参数	
    print(lda_model.get_params())

    Run

    -11868684.751381714	
    {'batch_size': 128, 'doc_topic_prior': None, 'evaluate_every': -1, 'learning_decay': 0.7, 'learning_method': 'batch', 'learning_offset': 10.0, 'max_doc_update_iter': 100, 'max_iter': 10, 'mean_change_tol': 0.001, 'n_components': 20, 'n_jobs': None, 'perp_tol': 0.1, 'random_state': None, 'topic_word_prior': None, 'total_samples': 1000000.0, 'verbose': 0}

    5. 如何找到最佳的话题数

    LatentDirichletAllocation中有很多参数,调整参数会使得结果发生变化。为了训练出更好的模型,这里我们使用ncomponents和learningdecay这两个参数作为示范,设置这两个参数可能的取值范围。

    运行时间 半个小时~

    from sklearn.model_selection import GridSearchCV	
    # 设置参数搜寻的范围	
    search_params = {'n_components': [10, 15, 20, 25, 30], 	
                     'learning_decay': [.5, .7, .9]}	
    # 初始化LDA模型	
    lda = LatentDirichletAllocation()	
    # 初始化GridSearchCV	
    model = GridSearchCV(lda, param_grid=search_params)	
    # 训练LDA模型	
    model.fit(data_vectorized)

    查看模型参数

    model.cv_results_

    Run

    {'mean_fit_time': array([76.23844155, 78.47619971, 75.65877469, 92.04278994, 92.47375035,	
            70.50102162, 77.17208759, 77.42245611, 78.51173854, 80.36060111,	
            64.35273759, 80.74369526, 78.33191927, 97.60522366, 91.52556197]),	
     'std_fit_time': array([ 1.90773724,  6.00546298,  2.90480388, 10.82104708,  2.15837996,	
             0.91492716,  1.78299082,  0.99124146,  0.88202007,  2.52887488,	
             1.42895102,  3.4966494 ,  4.10921772,  8.57965772,  2.97772162]),	
     'mean_score_time': array([3.03948617, 3.12327973, 3.17385236, 4.1181256 , 4.14796472,	
            2.80464379, 3.00497603, 3.18396346, 3.29176935, 3.34573205,	
            2.60685007, 3.05136299, 3.39874609, 3.77345729, 4.19327569]),	
     'std_score_time': array([0.29957093, 0.0616576 , 0.13170509, 0.4152717 , 0.58759639,	
            0.05777709, 0.17347846, 0.06664403, 0.13021069, 0.12982755,	
            0.06256295, 0.13255927, 0.43057235, 0.29978059, 0.44248399]),	
     'param_learning_decay': masked_array(data=[0.5, 0.5, 0.5, 0.5, 0.5, 0.7, 0.7, 0.7, 0.7, 0.7, 0.9,	
                        0.9, 0.9, 0.9, 0.9],	
                  mask=[False, False, False, False, False, False, False, False,	
                        False, False, False, False, False, False, False],	
            fill_value='?',	
                 dtype=object),	
     'param_n_components': masked_array(data=[10, 15, 20, 25, 30, 10, 15, 20, 25, 30, 10, 15, 20, 25,	
                        30],	
                  mask=[False, False, False, False, False, False, False, False,	
                        False, False, False, False, False, False, False],	
            fill_value='?',	
                 dtype=object),	
     'params': [{'learning_decay': 0.5, 'n_components': 10},	
      {'learning_decay': 0.5, 'n_components': 15},	
      {'learning_decay': 0.5, 'n_components': 20},	
      {'learning_decay': 0.5, 'n_components': 25},	
      {'learning_decay': 0.5, 'n_components': 30},	
      {'learning_decay': 0.7, 'n_components': 10},	
      {'learning_decay': 0.7, 'n_components': 15},	
      {'learning_decay': 0.7, 'n_components': 20},	
      {'learning_decay': 0.7, 'n_components': 25},	
      {'learning_decay': 0.7, 'n_components': 30},	
      {'learning_decay': 0.9, 'n_components': 10},	
      {'learning_decay': 0.9, 'n_components': 15},	
      {'learning_decay': 0.9, 'n_components': 20},	
      {'learning_decay': 0.9, 'n_components': 25},	
      {'learning_decay': 0.9, 'n_components': 30}],	
     'split0_test_score': array([-3874856.42190824, -3881092.28265286, -3905854.25463761,	
            -3933237.60526826, -3945083.8541135 , -3873412.75021688,	
            -3873882.90565526, -3911751.31895979, -3921171.68942096,	
            -3949413.2598192 , -3876577.95159756, -3886340.65539402,	
            -3896362.39547871, -3926181.21965185, -3950533.84046263]),	
     'split1_test_score': array([-4272638.34477011, -4294980.87988645, -4310841.4440567 ,	
            -4336244.55854965, -4341014.91687451, -4279229.66282939,	
            -4302326.23456232, -4317599.83998105, -4325020.1483235 ,	
            -4338663.90026249, -4284095.2173055 , -4294941.56802545,	
            -4299746.08581904, -4331262.03558289, -4338027.82208097]),	
     'split2_test_score': array([-4200870.80494405, -4219318.82663835, -4222122.82436968,	
            -4237003.85511169, -4258352.71194228, -4192824.54480934,	
            -4200329.40329793, -4231613.93138699, -4258255.99302186,	
            -4270014.58888107, -4199499.64459735, -4209918.86599275,	
            -4230265.99859102, -4247913.06952193, -4256046.3237088 ]),	
     'mean_test_score': array([-4116100.53270373, -4131775.17089196, -4146251.59136724,	
            -4168807.85000785, -4181462.93317874, -4115134.28591336,	
            -4125490.60725673, -4153633.64919084, -4168127.44754368,	
            -4186009.66931221, -4120036.0842904 , -4130378.79165891,	
            -4142103.10465406, -4168430.69488042, -4181515.57804474]),	
     'std_test_score': array([173105.26046897, 179953.68165447, 173824.10245002, 171450.68036995,	
            170539.38663682, 174546.8275931 , 182743.94823856, 174623.71594324,	
            176761.14575071, 169651.81366214, 175603.01769822, 176039.50084949,	
            176087.37700361, 174665.17839821, 166743.56843518]),	
     'rank_test_score': array([ 2,  6,  8, 12, 13,  1,  4,  9, 10, 15,  3,  5,  7, 11, 14],	
           dtype=int32)}

    输出参数搜寻出模型的效果并将其可视化

    import matplotlib.pyplot as plt	
    # Get Log Likelyhoods from Grid Search Output	
    n_topics = [10, 15, 20, 25, 30]	
    log_likelyhoods_5 = model.cv_results_['mean_test_score'][model.cv_results_['param_learning_decay']==0.5]	
    log_likelyhoods_7 = model.cv_results_['mean_test_score'][model.cv_results_['param_learning_decay']==0.7]	
    log_likelyhoods_9 = model.cv_results_['mean_test_score'][model.cv_results_['param_learning_decay']==0.9]	
    # Show graph	
    plt.figure(figsize=(12, 8))	
    plt.plot(n_topics, log_likelyhoods_5, label='0.5')	
    plt.plot(n_topics, log_likelyhoods_7, label='0.7')	
    plt.plot(n_topics, log_likelyhoods_9, label='0.9')	
    plt.title("Choosing Optimal LDA Model")	
    plt.xlabel("Num Topics")	
    plt.ylabel("Log Likelyhood Scores")	
    plt.legend(title='Learning decay', loc='best')	
    plt.show()

    640?wx_fmt=png


    #最佳话题模型	
    best_lda_model = model.best_estimator_	
    print("Best Model's Params: ", model.best_params_)	
    print("Best Log Likelihood Score: ", model.best_score_)

    Run

    Best Model's Params:  {'learning_decay': 0.7, 'n_components': 10}	
    Best Log Likelihood Score:  -4115134.285913357

    6. 如何查看每个文档的话题信息

    LDA会给每个文档分配一个话题分布,其中概率最大的话题最能代表该文档

    import numpy as np	
    # 构建文档-词频矩阵	
    lda_output = best_lda_model.transform(data_vectorized)	
    # 列名	
    topicnames = ["Topic" + str(i) 	
                  for i in range(best_lda_model.n_components)]	
    # 行索引名	
    docnames = ["Doc" + str(i) 	
                for i in range(len(df.content))]	
    # 转化为pd.DataFrame	
    df_document_topic = pd.DataFrame(np.round(lda_output, 2), 	
                                     columns=topicnames, 	
                                     index=docnames)	
    # Get dominant topic for each document	
    dominant_topic = np.argmax(df_document_topic.values, axis=1)	
    df_document_topic['dominant_topic'] = dominant_topic	
    # Styling	
    def color_green(val):	
        color = 'green' if val > .1 else 'black'	
        return 'color: {col}'.format(col=color)	
    def make_bold(val):	
        weight = 700 if val > .1 else 400	
        return 'font-weight: {weight}'.format(weight=weight)	
    # Apply Style	
    df_document_topics = df_document_topic.sample(10).style.applymap(color_green).applymap(make_bold)	
    df_document_topics

    640?wx_fmt=png

    查看话题分布情况
    df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")	
    df_topic_distribution.columns = ['Topic Num', 'Num Documents']	
    df_topic_distribution

    640?wx_fmt=png

    7.如何可视化LDA

    pyLDAvis可视化话题

    import pyLDAvis	
    import pyLDAvis.sklearn	
    #在notebook中显示	
    pyLDAvis.enable_notebook()	
    panel = pyLDAvis.sklearn.prepare(best_lda_model, #训练好的lda模型	
                                     data_vectorized,#训练库语料的词语特征空间(即Tfidfvecterizer或者CounterVecterizer)	
                                     vectorizer)	
    panel

    640?wx_fmt=png

    由于网络问题,这里插不了gif动图,我放之前的文章链接,大家可以看看可视化效果。手把手教你学会LDA话题模型可视化pyLDAvis库

    8. 如何查看话题的特征词组

    每个话题都是由带有权重的词组进行表征,是一个二维空间

    # 话题-关键词矩阵(Topic-Keyword Matrix)	
    df_topic_keywords = pd.DataFrame(best_lda_model.components_)	
    # 重新分配dataframe中的列名和行索引名	
    df_topic_keywords.columns = vectorizer.get_feature_names() #训练集的词语空间的词表	
    df_topic_keywords.index = topicnames	
    df_topic_keywords

    640?wx_fmt=png

    9.如何获得每个话题的最重要的n个特征词

    # 显示每个话题最重要的n个词语	
    def show_topics(vectorizer=vectorizer, lda_model=lda_model, top_n=20):	
        keywords = np.array(vectorizer.get_feature_names())	
        topic_keywords = []	
        #话题-词语权重矩阵	
        for topic_weights in lda_model.components_:	
            #获得权重最大的top_n词语的权重向量	
            top_keyword_locs = (-topic_weights).argsort()[:top_n]	
            #在keywords中找到对于的关键词	
            topic_keywords.append(keywords.take(top_keyword_locs))	
        return topic_keywords	
    topic_keywords = show_topics(vectorizer=vectorizer, 	
                                 lda_model=best_lda_model, 	
                                 top_n=10)     #最重要的10个词语	
    df_topic_keywords = pd.DataFrame(topic_keywords)	
    df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]	
    df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]	
    df_topic_keywords

    640?wx_fmt=png

    10. 如何对新文本进行话题预测

    给训练好的模型输入新文本,预测该文本的话题

    # Define function to predict topic for a given text document.	
    #nlp = spacy.load('en', disable=['parser', 'ner'])	
    def predict_topic(texts, nlp=nlp):	
        #清洗数据,如提出空格、邮箱、剔除无意义的词语、保留信息量比较大的词性	
        cleaned_texts = []	
        for text in texts:	
            cleaned_texts.append(clean_text(text))	
        doc_term_matrix = vectorizer.transform(cleaned_texts)	
        #LDA transform	
        topic_term_prob_matrix = best_lda_model.transform(doc_term_matrix)	
        #话题	
        topic_index = np.argmax(topic_term_prob_matrix)	
        topic_word = df_topic_keywords.iloc[topic_index, :].values.tolist()	
        return topic_index, topic_word, topic_term_prob_matrix	
    #预测	
    mytext = ["Some text about christianity and bible"]	
    topic_index, topic_word, topic_term_prob_matrix = predict_topic(mytext)	
    print("该文本的所属的话题是Topic",topic_index)	
    print("该话题的特征词 ", topic_word)	
    print("特征词的权重分布情况 ", topic_term_prob_matrix)

    Run

    该文本的所属的话题是Topic 5	
    该话题的特征词  ['not', 'have', 'max', 'god', 'say', 'can', 'there', 'write', 'christian', 'would']	
    特征词的权重分布情况  [[0.02500225 0.025      0.02500547 0.02500543 0.02500001 0.7749855	
      0.02500082 0.02500052 0.025      0.025     ]]


    推荐阅读

    【视频课】数据分析快速入门

    2019年7月13-18日(杭州)Stata & Python 实证计量与爬虫分析暑期工作坊

    如何用nbmerge合并多个notebook文件?   

    自然语言处理库nltk、spacy安装及配置方法

    datatable:比pandas更快的GB量级的库

    国人开发的数据可视化神库 pyecharts

    pandas_profiling:生成动态交互的数据探索报告

    cufflinks: 让pandas拥有plotly的炫酷的动态可视化能力

    使用Pandas、Jinja和WeasyPrint制作pdf报告

    使用Pandas更好的做数据科学

    使用Pandas更好的做数据科学(二)

    少有人知的python数据科学库

    folium:地图数据可视化库

    学习编程遇到问题,该如何正确的提问?

    如何用Google Colab高效的学习Python

    大神kennethreitz写出requests-html号称为人设计的解析库

    flashtext:大规模文本数据清洗利器

    640?wx_fmt=png

    后台回复“20190710”,即可下载本教程

    展开全文
  • Python LDA主题模型实战

    万次阅读 2019-07-11 22:04:39
    导入相关的包 ...采用LDA库,pip install lda import numpy as np import lda 12 X = lda.datasets.load_reuters() X.shape 12 (395, 4258) 1 这里说明X是395行4258列的数据,说明有395个训练...
     
    
    import numpy as np
    import lda 
    12
    X = lda.datasets.load_reuters()
    X.shape
    12
    (395, 4258)
    1
    
    • 这里说明X是395行4258列的数据,说明有395个训练样本
    vocab = lda.datasets.load_reuters_vocab()
    len(vocab)# 这里是所有的词汇
    12
    4258
    1
    
    • 这里说明一个有4258个不重复的词语
    1
    
    • 选取前十个训练数据看一看
    title = lda.datasets.load_reuters_titles()
    title[:10]
    12
    ('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20',
     '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21',
     "2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23",
     '3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25',
     '4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25',
     "5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25",
     '6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26',
     "7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25",
     '8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26',
     '9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26')
    12345678910
    1
    
    • 开始训练,这顶主题数目是20,迭代次数是1500次
    model = lda.LDA(n_topics = 20, n_iter = 1500, random_state = 1) #初始化模型, n_iter   迭代次数
    model.fit(X)
    

    控制台输出:

    
    INFO:lda:n_documents: 395
    INFO:lda:vocab_size: 4258
    INFO:lda:n_words: 84010
    INFO:lda:n_topics: 20
    INFO:lda:n_iter: 1500
    INFO:lda:<0> log likelihood: -1051748
    INFO:lda:<10> log likelihood: -719800
    INFO:lda:<20> log likelihood: -699115
    INFO:lda:<30> log likelihood: -689370
    INFO:lda:<40> log likelihood: -684918
    ...
    INFO:lda:<1450> log likelihood: -654884
    INFO:lda:<1460> log likelihood: -655493
    INFO:lda:<1470> log likelihood: -655415
    INFO:lda:<1480> log likelihood: -655192
    INFO:lda:<1490> log likelihood: -655728
    INFO:lda:<1499> log likelihood: -655858
    
    
    
    
    
    <lda.lda.LDA at 0x7effa0508550>
    1234567891011121314151617181920212223
    
    • 查看20个主题中的词分布
    topic_word = model.topic_word_
    print(topic_word.shape)
    topic_word
    

    查看输出:

    (20, 4258)
    
    
    
    
    
    array([[3.62505347e-06, 3.62505347e-06, 3.62505347e-06, ...,
            3.62505347e-06, 3.62505347e-06, 3.62505347e-06],
           [1.87498968e-02, 1.17916463e-06, 1.17916463e-06, ...,
            1.17916463e-06, 1.17916463e-06, 1.17916463e-06],
           [1.52206232e-03, 5.05668544e-06, 4.05040504e-03, ...,
            5.05668544e-06, 5.05668544e-06, 5.05668544e-06],
           ...,
           [4.17266923e-02, 3.93610908e-06, 9.05698699e-03, ...,
            3.93610908e-06, 3.93610908e-06, 3.93610908e-06],
           [2.37609835e-06, 2.37609835e-06, 2.37609835e-06, ...,
            2.37609835e-06, 2.37609835e-06, 2.37609835e-06],
           [3.46310752e-06, 3.46310752e-06, 3.46310752e-06, ...,
            3.46310752e-06, 3.46310752e-06, 3.46310752e-06]])
    
    • 得到每个主题的前8个词
    for i, topic_dist in enumerate(topic_word):
        print(np.array(vocab)[np.argsort(topic_dist)][:-9:-1])
    12
    ['british' 'churchill' 'sale' 'million' 'major' 'letters' 'west' 'britain']
    ['church' 'government' 'political' 'country' 'state' 'people' 'party'
     'against']
    ['elvis' 'king' 'fans' 'presley' 'life' 'concert' 'young' 'death']
    ['yeltsin' 'russian' 'russia' 'president' 'kremlin' 'moscow' 'michael'
     'operation']
    ['pope' 'vatican' 'paul' 'john' 'surgery' 'hospital' 'pontiff' 'rome']
    ['family' 'funeral' 'police' 'miami' 'versace' 'cunanan' 'city' 'service']
    ['simpson' 'former' 'years' 'court' 'president' 'wife' 'south' 'church']
    ['order' 'mother' 'successor' 'election' 'nuns' 'church' 'nirmala' 'head']
    ['charles' 'prince' 'diana' 'royal' 'king' 'queen' 'parker' 'bowles']
    ['film' 'french' 'france' 'against' 'bardot' 'paris' 'poster' 'animal']
    ['germany' 'german' 'war' 'nazi' 'letter' 'christian' 'book' 'jews']
    ['east' 'peace' 'prize' 'award' 'timor' 'quebec' 'belo' 'leader']
    ["n't" 'life' 'show' 'told' 'very' 'love' 'television' 'father']
    ['years' 'year' 'time' 'last' 'church' 'world' 'people' 'say']
    ['mother' 'teresa' 'heart' 'calcutta' 'charity' 'nun' 'hospital'
     'missionaries']
    ['city' 'salonika' 'capital' 'buddhist' 'cultural' 'vietnam' 'byzantine'
     'show']
    ['music' 'tour' 'opera' 'singer' 'israel' 'people' 'film' 'israeli']
    ['church' 'catholic' 'bernardin' 'cardinal' 'bishop' 'wright' 'death'
     'cancer']
    ['harriman' 'clinton' 'u.s' 'ambassador' 'paris' 'president' 'churchill'
     'france']
    ['city' 'museum' 'art' 'exhibition' 'century' 'million' 'churches' 'set']
    1234567891011121314151617181920212223242526
    - 得到每句话在每个主题的分布,并得到每句话的最大主题
    1
    doc_topic = model.doc_topic_
    print(doc_topic.shape)  # 主题分布式395行,20列的矩阵,其中每一行对应一个训练样本在20个主题上的分布
    print("第一个样本的主题分布是",doc_topic[0]) # 打印一下第一个样本的主题分布
    print("第一个样本的最终主题是",doc_topic[0].argmax())
    1234
    (395, 20)
    第一个样本的主题分布是 [4.34782609e-04 3.52173913e-02 4.34782609e-04 9.13043478e-03
     4.78260870e-03 4.34782609e-04 9.13043478e-03 3.08695652e-02
     5.04782609e-01 4.78260870e-03 4.34782609e-04 4.34782609e-04
     3.08695652e-02 2.17826087e-01 4.34782609e-04 4.34782609e-04
     4.34782609e-04 3.95652174e-02 4.34782609e-04 1.09130435e-01]
    第一个样本的最终主题是 8
    

    转载至:https://blog.csdn.net/jiangzhenkang/article/details/84335646

    展开全文
  • LDA主题模型实战

    千次阅读 2018-11-22 10:59:28
    导入相关的包 https://github.com/lda-project/lda 这里有lda包的文档...import lda X = lda.datasets.load_reuters() X.shape (395, 4258) 这里说明X是395行4258列的数据,说明有395个训练样本 vocab = lda....
     
    
    import numpy as np
    import lda 
    
    X = lda.datasets.load_reuters()
    X.shape
    
    (395, 4258)
    
    • 这里说明X是395行4258列的数据,说明有395个训练样本
    vocab = lda.datasets.load_reuters_vocab()
    len(vocab)# 这里是所有的词汇
    
    4258
    
    • 这里说明一个有4258个不重复的词语
    
    
    • 选取前十个训练数据看一看
    title = lda.datasets.load_reuters_titles()
    title[:10]
    
    ('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20',
     '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21',
     "2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23",
     '3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25',
     '4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25',
     "5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25",
     '6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26',
     "7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25",
     '8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26',
     '9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26')
    
    
    
    • 开始训练,这顶主题数目是20,迭代次数是1500次
    model = lda.LDA(n_topics = 20, n_iter = 1500, random_state = 1) #初始化模型, n_iter   迭代次数
    model.fit(X)
    
    INFO:lda:n_documents: 395
    INFO:lda:vocab_size: 4258
    INFO:lda:n_words: 84010
    INFO:lda:n_topics: 20
    INFO:lda:n_iter: 1500
    INFO:lda:<0> log likelihood: -1051748
    INFO:lda:<10> log likelihood: -719800
    INFO:lda:<20> log likelihood: -699115
    INFO:lda:<30> log likelihood: -689370
    INFO:lda:<40> log likelihood: -684918
    ...
    INFO:lda:<1450> log likelihood: -654884
    INFO:lda:<1460> log likelihood: -655493
    INFO:lda:<1470> log likelihood: -655415
    INFO:lda:<1480> log likelihood: -655192
    INFO:lda:<1490> log likelihood: -655728
    INFO:lda:<1499> log likelihood: -655858
    
    
    
    
    
    <lda.lda.LDA at 0x7effa0508550>
    
    
    
    topic_word = model.topic_word_
    print(topic_word.shape)
    topic_word
    
    (20, 4258)
    
    
    
    
    
    array([[3.62505347e-06, 3.62505347e-06, 3.62505347e-06, ...,
            3.62505347e-06, 3.62505347e-06, 3.62505347e-06],
           [1.87498968e-02, 1.17916463e-06, 1.17916463e-06, ...,
            1.17916463e-06, 1.17916463e-06, 1.17916463e-06],
           [1.52206232e-03, 5.05668544e-06, 4.05040504e-03, ...,
            5.05668544e-06, 5.05668544e-06, 5.05668544e-06],
           ...,
           [4.17266923e-02, 3.93610908e-06, 9.05698699e-03, ...,
            3.93610908e-06, 3.93610908e-06, 3.93610908e-06],
           [2.37609835e-06, 2.37609835e-06, 2.37609835e-06, ...,
            2.37609835e-06, 2.37609835e-06, 2.37609835e-06],
           [3.46310752e-06, 3.46310752e-06, 3.46310752e-06, ...,
            3.46310752e-06, 3.46310752e-06, 3.46310752e-06]])
    
    • 得到每个主题的前8个词
    for i, topic_dist in enumerate(topic_word):
        print(np.array(vocab)[np.argsort(topic_dist)][:-9:-1])
    
    ['british' 'churchill' 'sale' 'million' 'major' 'letters' 'west' 'britain']
    ['church' 'government' 'political' 'country' 'state' 'people' 'party'
     'against']
    ['elvis' 'king' 'fans' 'presley' 'life' 'concert' 'young' 'death']
    ['yeltsin' 'russian' 'russia' 'president' 'kremlin' 'moscow' 'michael'
     'operation']
    ['pope' 'vatican' 'paul' 'john' 'surgery' 'hospital' 'pontiff' 'rome']
    ['family' 'funeral' 'police' 'miami' 'versace' 'cunanan' 'city' 'service']
    ['simpson' 'former' 'years' 'court' 'president' 'wife' 'south' 'church']
    ['order' 'mother' 'successor' 'election' 'nuns' 'church' 'nirmala' 'head']
    ['charles' 'prince' 'diana' 'royal' 'king' 'queen' 'parker' 'bowles']
    ['film' 'french' 'france' 'against' 'bardot' 'paris' 'poster' 'animal']
    ['germany' 'german' 'war' 'nazi' 'letter' 'christian' 'book' 'jews']
    ['east' 'peace' 'prize' 'award' 'timor' 'quebec' 'belo' 'leader']
    ["n't" 'life' 'show' 'told' 'very' 'love' 'television' 'father']
    ['years' 'year' 'time' 'last' 'church' 'world' 'people' 'say']
    ['mother' 'teresa' 'heart' 'calcutta' 'charity' 'nun' 'hospital'
     'missionaries']
    ['city' 'salonika' 'capital' 'buddhist' 'cultural' 'vietnam' 'byzantine'
     'show']
    ['music' 'tour' 'opera' 'singer' 'israel' 'people' 'film' 'israeli']
    ['church' 'catholic' 'bernardin' 'cardinal' 'bishop' 'wright' 'death'
     'cancer']
    ['harriman' 'clinton' 'u.s' 'ambassador' 'paris' 'president' 'churchill'
     'france']
    ['city' 'museum' 'art' 'exhibition' 'century' 'million' 'churches' 'set']
    
    - 得到每句话在每个主题的分布,并得到每句话的最大主题
    
    doc_topic = model.doc_topic_
    print(doc_topic.shape)  # 主题分布式395行,20列的矩阵,其中每一行对应一个训练样本在20个主题上的分布
    print("第一个样本的主题分布是",doc_topic[0]) # 打印一下第一个样本的主题分布
    print("第一个样本的最终主题是",doc_topic[0].argmax())
    
    (395, 20)
    第一个样本的主题分布是 [4.34782609e-04 3.52173913e-02 4.34782609e-04 9.13043478e-03
     4.78260870e-03 4.34782609e-04 9.13043478e-03 3.08695652e-02
     5.04782609e-01 4.78260870e-03 4.34782609e-04 4.34782609e-04
     3.08695652e-02 2.17826087e-01 4.34782609e-04 4.34782609e-04
     4.34782609e-04 3.95652174e-02 4.34782609e-04 1.09130435e-01]
    第一个样本的最终主题是 8
    
    展开全文
  • (一) LDA模型的假设 上图是LDA模型作为概率图模型的板块表示。从中可以看出LDA模型的基本假设: 文本中每个位置的话题相互独立;满足P(zm∣ψ)=∏n=1NmP(zmn∣ψ)P(\textbf{z}_m|\psi) = \prod_{n=1}^{N_m} P(z_{...
  • 说明:这是一个机器学习、数据挖掘实战项目(附带数据+代码+文档+视频讲解),如需数据+代码+文档+视频讲解可以直接到文章最后获取。 前言 在21世纪人工智能大数据时代,网上购物已经成为大众生活的重要组成...
  • 在机器学习领域,LDA是两个常用模型的简称:Linear Discriminant Analysis和Latent Dirichlet Allocation。本文的LDA是指Latent Dirichlet Allocation,它在主题模型中占有非常重要的地位,常用来文本分类。 LDA由...
  • 数据集下载:... 第一步: 加载一些必要的库, 我们用的是gensim中的LDA模型,所以必须安装gensim库 import pandas as pd import re from gensim.models import doc2vec, ldamodel from gensim i...
  • LDA应用
  • LDA主题模型笔记

    千次阅读 2018-10-31 12:44:18
    1、写在前面 ...一是本文要讲的隐含狄利克雷分布(Latent Dirichlet Allocation),是一种概率主题模型,主要用来文本分类,在NLP领域有重要应用。LDA由Blei, David M.、Ng, Andrew Y.、Jordan于2003年...
  • 本文讲解利用LDA模型,建立一个主题分析模型案例。LDA具体内容请读者参考其他资料。 2、案例 ### 案例通过对自己造的5个文档进行LDA建模,这里主题数量为3个。具体如下: # from nltk import stopwords import ...
  • lda主题模型python实现篇

    万次阅读 多人点赞 2018-06-08 10:07:39
    最近在做一个动因分析的项目,自然想到了主题模型LDA。这次先把模型流程说下,原理后面再讲。 lda实现有很多开源库,这里用的是gensim. 1 文本预处理 大概说下文本的样子,LDA是无监督模型,也就是说不需要标签,...
  • 数据挖掘案例实战:利用LDA主题模型提取京东评论数据 网上购物已经成为大众生活的重要组成部分。人们在电商平台上浏览商品和购物,产生了海量的用户行为数据,其中用户对商品的评论数据对商家具有重要的意义。利用好...
  • gensim训练LDA实战

    千次阅读 2020-01-08 18:22:42
    gensim中lda模型的使用 1.首先是模型的训练 1.1 数据格式 在使用gensim训练LDA模型之前需要先训练一个词袋模型 词袋模型的输入数据是分词后的词列表 多个数据时就是列表套列表,如: [[想,买辆,汽车]] 1.2 构建词典 ...
  • 线性判别模型LDA)在模式识别领域(比如人脸识别等图形图像识别领域)中有非常广泛的应用。LDA是一种监督学习的降维技术,也就是说它的数据集的每个样本是有类别输出的。这点和PCA不同。PCA是不考虑样本类别输出的...
  • LDA主题模型学习总结

    万次阅读 2018-09-29 14:01:47
    本篇主要总结隐含狄利克雷分布(Latent Dirichlet Allocation,以下简称LDA) 1.贝叶斯定理 贝叶斯定理是关于随机事件A和B的条件概率的定理。   在贝叶斯定理中: P(A|B)是在事件B发生的条件下事件A发生的条件...
  • 使用gensim 框架 实现 LDA主题模型

    千次阅读 2019-05-13 22:53:56
    整体过程就是:首先拿到文档集合,使用分词工具进行分词,...分配好ID后,整理出各个词语的词频,使用“词ID:词频”的形式形成稀疏向量,使用LDA模型进行训练。 参考:https://www.jianshu.com/p/22c2334cf601 ...
  • 机器学习 - LDA 主题模型简单实践

    千次阅读 2018-04-26 11:56:26
    什么是主题 因为LDA是一种主题模型,那么首先必须明确知道LDA是怎么看待主题的。对于一篇新闻报道,我们看到里面讲了昨天NBA篮球比赛,那么用大腿想都知道它的主题是关于体育的。为什么我们大腿会那么聪明呢?这时...
  • LDA主题模型从分词到词云再到模型

    千次阅读 2019-11-16 10:56:50
    LDA模型 # coding=utf-8 # 存储读取语料 一行语料为一个文档 print('读取语料') corpus = [] for line in open('all_jieba_output_2.txt', 'r', encoding='utf-8').readlines(): corpus.append(line.strip()) # ...
  • 今天我们来谈谈主题模型(Latent Dirichlet Allocation),由于主题模型是生成模型,而我们常用的决策树,支持向量机,CNN等常用的机器学习模型的都是判别模型。所以笔者首先简单介绍一下判别模型和生成模型。下面笔者...
  • LDA主题模型(算法详解)

    万次阅读 2018-01-24 13:56:26
    LDA主题模型(算法详解) http://blog.csdn.net/weixin_41090915/article/details/79058768?%3E 一、LDA主题模型简介 LDA(Latent Dirichlet Allocation)中文翻译为:潜在狄利克雷分布。LDA主题模型是一种...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 2,076
精华内容 830
关键字:

lda模型实战