精华内容
下载资源
问答
  • 采用背景词汇聚类及主题词联想的方式将主题词 扩充到待分析文本之外,尝试挖掘文本的主题内涵。模型拟合基于快速 Gibbs 抽样算法进行。实验结果表明,快速 Gibbs 算法的速度约比 传统 Gibbs 算法高 5 倍,准确率和...
  • 使用LDA模型,从一篇文章中实现关键词的提取
  • 使用python::gensim包实现LDA主题模型,从文本中提取主题词(topic),使用了相关的nltk包来进行预处理

     Latent Dirichlet Allocation(LDA) 隐含分布作为目前最受欢迎的主题模型算法被广泛使用。LDA能够将文本集合转化为不同概率的主题集合。需要注意的是LDA是利用统计手段对主题词汇进行到的处理,是一种词袋(bag-of-words)方法。如:
     输入:

    第一段:“Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. It is altogether fitting and proper that we should do this.”
    第二段:‘Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.’
    第三段:"We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that nation might live. "

     输出:

    (0, u'0.032*"conceive" + 0.032*"dedicate" + 0.032*"nation" + 0.032*"life"')
    (1, u'0.059*"conceive" + 0.059*"score" + 0.059*"seven" + 0.059*"proposition"')
    (2, u'0.103*"nation" + 0.071*"dedicate" + 0.071*"great" + 0.071*"field"')
    (3, u'0.032*"conceive" + 0.032*"nation" + 0.032*"dedicate" + 0.032*"rest"')
    (4, u'0.032*"conceive" + 0.032*"nation" + 0.032*"dedicate" + 0.032*"battle"')
    

     本文将简单介绍如何使用Python 的nltk、spacy、gensim包,实现包括预处理流程在内的LDA算法。

    1. 预处理:

    1.1 分词处理

    #第一次使用需要首先下载en包:
    #python -m spacy download en
    import spacy
    spacy.load('en_core_web_sm')
    from spacy.lang.en import English
    parser = English()
    #对文章内容进行清洗并将单词统一降为小写
    def tokenize(text):
        lda_tokens = []
        tokens = parser(text)
        for token in tokens:
            if token.orth_.isspace():
                continue
            elif token.like_url:
                lda_tokens.append('URL')
            elif token.orth_.startswith('@'):
                lda_tokens.append('SCREEN_NAME')
            else:
                lda_tokens.append(token.lower_)
        return lda_tokens
    

    1.2 lemma处理

     lemma与stem都是NLP中常用的对于单词的处理:
    lemma 将变形了的单词还原为元单词 “dictionaries”–>“dictionary”
    stem 从单词中抽取词根 “dictionaries”—>“dict”

    #引入一个同义词、近义词、反义词包
    import nltk
    #第一次使用需要下载这个nltk包
    # nltk.download('wordnet')
    
    from nltk.corpus import wordnet as wn
    def get_lemma(word):
        #dogs->dog
        #aardwolves->aardwolf'
        #sichuan->sichuan
        lemma = wn.morphy(word)
        if lemma is None:
            return word
        else:
            return lemma
    

    1.3 从nltk包中引入英文停顿词停顿词处理

    #第一次使用需要下载停顿词
    # nltk.download('stopwords')
    
    en_stop = set(nltk.corpus.stopwords.words('english'))
    

    1.4 预处理流程

    预处理的过程包括以上所提及的分词、lemma处理及停顿词处理

    #定义预处理函数
    def prepare_text_for_lda(text):
        #分词处理
        tokens = tokenize(text)
        #取出长度大于4的单词
        tokens = [token for token in tokens if len(token) > 4]
        #取出非停顿词
        tokens = [token for token in tokens if token not in en_stop]
        #对词语进行还原
        tokens = [get_lemma(token) for token in tokens]
        return tokens
    

    2. LDA算法

    2.1 预处理文本集合

     通过预处理函数加载文本集合,需要注意的是,gensim:models.ldamodel 处理对象是一个文本集合而不是文本集,因此其输入应该为[[],``````,[]]结构而不是[]

        text_1 = u"Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. It is altogether fitting and proper that we should do this."
        text_2 = u'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.'
        text_3 = u"We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that nation might live. "
        text_data_1 = prepare_text_for_lda(text_1)
        text_data_2 = prepare_text_for_lda(text_2)
        text_data_3 = prepare_text_for_lda(text_3)
        text_data =[]
        text_data.append(text_data_1)
        text_data.append(text_data_2)
        text_data.append(text_data_3)
        print "text_data :",text_data
    

     通过对于三个string的预处理并组合成为一个list集合,数据如下:

    [[u'engage', u'great', u'civil', u'testing', u'whether', u'nation', u'nation', u'conceive', u'dedicate', u'endure', u'altogether', u'fitting', u'proper'], [u'score', u'seven', u'years', u'father', u'bring', u'forth', u'continent', u'nation', u'conceive', u'liberty', u'dedicate', u'proposition', u'create', u'equal'], [u'great', u'battle', u'field', u'dedicate', u'portion', u'field', u'final', u'rest', u'place', u'life', u'nation', u'might']]
    

    2.2 使用LDA算法提取主题词

     需要注意的是,如下实现LDA算法的gensim.models.ldamodel.LdaModel()与生成的corpus、dictionary密切相关。

        #加载gensim 
        #使用gensim.Dictionary从text_data中生成一个词袋(bag-of-words)
        dictionary = corpora.Dictionary(text_data)
        corpus = [dictionary.doc2bow(text) for text in text_data]
    
        #加载gensim,使用LDA算法求得前五的topic,
        #同时生成的topic在之后也会被使用到来定义文本所属主题
        
        NUM_TOPICS = 5#定义了生成的主题词的个数
        ldamodel = gensim.models.ldamodel.LdaModel(corpus,              
        	                                       num_topics = NUM_TOPICS,
        	                                       id2word=dictionary,
        	                                       passes=15)
        ldamodel.save('model5.gensim')
        topics = ldamodel.print_topics(num_words=4)
        for topic in topics:
            print(topic)
    

    3. 附录遇到的问题及修改

    3.1 来自spacy的报错

    import spacy
    spacy.load('en')
    
    Traceback (most recent call last):
      File "topial_LDA.py", line 13, in <module>
        spacy.load('en')
      File "C:\Python27\lib\site-packages\spacy\__init__.py", line 15, in load
        return util.load_model(name, **overrides)
      File "C:\Python27\lib\site-packages\spacy\util.py", line 119, in load_model
        raise IOError(Errors.E050.format(name=name))
    IOError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
    

     这条报错是因为没有向spacy指明引入的english类型的语言包具体是那个,在spacy中我们发现了如下多个包:

    修改代码代码,实现功能:

    import spacy
    spacy.load('en_core_web_sm')
    

    3.2 来自dictionary的报错

    这个报错参考2.1

    C:\Python27\lib\site-packages\gensim\utils.py:1209: UserWarning:
    detected Windows; aliasing chunkize to chunkize_serial
    warnings.warn(“detected Windows; aliasing chunkize to
    chunkize_serial”) Traceback (most recent call last): File
    “topial_LDA.py”, line 122, in
    dictionary = corpora.Dictionary(text_data_1) File “C:\Python27\lib\site-packages\gensim\corpora\dictionary.py”, line 81,
    in init
    self.add_documents(documents, prune_at=prune_at) File “C:\Python27\lib\site-packages\gensim\corpora\dictionary.py”, line
    198, in add_documents
    self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids File
    “C:\Python27\lib\site-packages\gensim\corpora\dictionary.py”, line
    236, in doc2bow
    raise TypeError(“doc2bow expects an array of unicode tokens on input, not a single string”) TypeError: doc2bow expects an array of
    unicode tokens on input, not a single string

    展开全文
  • jieba分词以及LDA主题提取(python)

    千次阅读 热门讨论 2020-12-31 19:58:13
    如果发现pip安装出现错误,可以上whl官方包手动安装whl格式的包,在网页中利用Ctrl+F快速查找到相应包,如果发现这里面没有,比如lda包,还有个网站提供python官方packagetar.gz后缀的压缩包,具体安装方

    一、环境配置

    在运行分词之前首先要确定Python已经正确安装,这里我安装的是python3.9,但建议安装低一个版本的,如python3.8,因为有些包在pip install安装的时候不支持最新版本。
    其次,本文需要用到lda、jieba、numpy、wordcloud等主要的包。如果发现pip安装出现错误,可以上whl官方包手动安装whl格式的包,在网页中利用Ctrl+F快速查找到相应包,如果发现这里面没有,比如lda包,还有个网站提供python官方packagetar.gz后缀的压缩包,具体安装方式百度,主要就是用python setup.py install安装命令。

    二、jieba分词-数据预处理

    这里采用的是jieba分词,代码如下所示,相关数据及代码文件可以在数据文本下载,在复现时需要根据自己的文件名称修改下面的文件名称。
    (3.31补充:有同学反馈github因为某些原因,有时上不去,这里给上网盘地址,一个是只有500个的短文本版本,可以快速复现,另一个是完整的30M长文本-需要久点跑完)文本-提取码1111

    import jieba
    from os import path  #用来获取文档的路径
    import jieba.analyse as anls
    from PIL import Image
    import numpy as  np
    import matplotlib.pyplot as plt
    #词云生成工具
    from wordcloud import WordCloud,ImageColorGenerator
    #需要对中文进行处理
    import matplotlib.font_manager as fm
    
    #背景图
    bg=np.array(Image.open("boy.png"))
    
    #获取当前的项目文件加的路径
    d=path.dirname(__file__)
    #读取停用词表
    stopwords = [line.strip() for line in open('cn_stopwords.txt', encoding='UTF-8').readlines()]  
    #读取要分析的文本
    text_path="answers.txt"
    #读取要分析的文本,读取格式
    text=open(path.join(d,text_path),encoding="utf8").read()
    
    text_split = jieba.cut(text)  # 未去掉停用词的分词结果   list类型
    
    #去掉停用词的分词结果  list类型
    text_split_no = []
    for word in text_split:
        if word not in stopwords:
            text_split_no.append(word)
    #print(text_split_no)
    fW = open('fencioutput.txt','w',encoding = 'UTF-8')
    fW.write(' '.join(text_split_no))
    fW.close()
    
    text_split_no_str =' '.join(text_split_no)  #list类型分为str
    
    with open('fencioutput.txt',"r",encoding = 'UTF-8') as r:
                    lines =r.readlines()
    with open('fencioutput.txt',"w",encoding = 'UTF-8') as w:
                    for line in lines:
                           if len(line) > 2:
                               w.write(line)
    
    fW = open('fencioutput1.txt','w',encoding = 'UTF-8')
    fW.write(' '.join(text_split_no))
    fW.close()
    
    text_split_no_str =' '.join(text_split_no)  #list类型分为str
    
    #基于tf-idf提取关键词
    print("基于TF-IDF提取关键词结果:")
    keywords = []
    for x, w in anls.extract_tags(text_split_no_str, topK=200, withWeight=True):
        keywords.append(x)   #前200关键词组成的list
    keywords = ' '.join(keywords)   #转为str
    print(keywords)
    print("基于词频统计结果")
    txt = open("fencioutput1.txt", "r", encoding="UTF-8").read()
    words = jieba.cut(txt)
    counts = {}
    for word in words:
        if len(word) == 1:
            continue
        else:
            rword = word
        counts[rword] = counts.get(rword, 0) + 1
    items = list(counts.items())
    items.sort(key=lambda x:x[1], reverse=True)
    for i in range(33):
        word, count=items[i]
        print((word),count)
    #生成
    wc=WordCloud(
        background_color="white",
        max_words=200,
        mask=bg,            #设置词云形状,改为mask =None;默认生成矩形图云
        max_font_size=60,
        scale=16,
        random_state=42,
        font_path='simhei.ttf'   #中文处理,用国标黑体字体,如果系统没有需将附件的字体文件放到代码目录下
        ).generate(keywords)
    #为图片设置字体
    my_font=fm.FontProperties(fname='simhei.ttf.ttf')
    #产生背景图片,基于彩色图像的颜色生成器
    image_colors=ImageColorGenerator(bg)
    #开始画图
    plt.imshow(wc,interpolation="bilinear")
    #为云图去掉坐标轴
    plt.axis("off")
    #画云图,显示
    #plt.figure()
    plt.show()
    #为背景图去掉坐标轴
    plt.axis("off")
    plt.imshow(bg,cmap=plt.cm.gray)
    #plt.show()
    
    #保存云图
    wc.to_file("ciyun.png")
    print("词云图片已保存")
    

    这里是处理前和处理后的结果
    处理前文本
    处理后文本
    在代码实例中还加入了词云导出的功能,如下所示
    词云图

    三、LDA主题提取

    基于第二步的jieba分词,可以得到分词后的文件,下面进行LDA主题提取。
    在LDA模型拟合步骤中,需要修改的参数主要是num_topic和alpha,前者num_topic,即话题数量,通过不断地尝试得到一个合适的值,一般从10到100都可以取,后者一般取成话题数量的倒数,如10个数量,取成0.1,一般偏小较好。

    import numpy as np
    from gensim import corpora, models
    
    
    if __name__ == '__main__':
        # 读入文本数据
        f = open('fencioutput.txt', encoding='utf-8')  # 输入已经预处理后的文本
        texts = [[word for word in line.split()] for line in f]
        f.close()
        M = len(texts)
        print('文本数目:%d 个' % M)
    
        # 建立词典
        dictionary = corpora.Dictionary(texts)
        V = len(dictionary)
        print('词的个数:%d 个' % V)
    
        # 计算文本向量g
        corpus = [dictionary.doc2bow(text) for text in texts]  # 每个text对应的稀疏向量
    
        # 计算文档TF-IDF
        corpus_tfidf = models.TfidfModel(corpus)[corpus]
    
        # LDA模型拟合
        num_topics = 10  # 定义主题数
        lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,
                              alpha=0.01, eta=0.01, minimum_probability=0.001,
                              update_every=1, chunksize=100, passes=1)
    
        # 所有文档的主题
        doc_topic = [a for a in lda[corpus_tfidf]]
        print('Document-Topic:')
        print(doc_topic)
    
        # 打印文档的主题分布
        num_show_topic = 5  # 每个文档显示前几个主题
        print('文档的主题分布:')
        doc_topics = lda.get_document_topics(corpus_tfidf)  # 所有文档的主题分布
        idx = np.arange(M)  # M为文本个数,生成从0开始到M-1的文本数组
        for i in idx:
            topic = np.array(doc_topics[i])
            topic_distribute = np.array(topic[:, 1])
            topic_idx = topic_distribute.argsort()[:-num_show_topic - 1:-1]  # 按照概率大小进行降序排列
            print('第%d个文档的前%d个主题:' % (i, num_show_topic))
            print(topic_idx)
            print(topic_distribute[topic_idx])
    
        # 每个主题的词分布
        num_show_term = 15  # 每个主题显示几个词
        for topic_id in range(num_topics):
            print('主题#%d:\t' % topic_id)
            term_distribute_all = lda.get_topic_terms(topicid=topic_id)  # 所有词的词分布
            term_distribute = term_distribute_all[:num_show_term]  # 只显示前几个词
            term_distribute = np.array(term_distribute)
            term_id = term_distribute[:, 0].astype(np.int)
            print('词:', end="")
            for t in term_id:
                print(dictionary.id2token[t], end=' ')
            print('概率:', end="")
            print(term_distribute[:, 1])
    
        # 将主题-词写入一个文档 topword.txt,每个主题显示20个词
        with open('ldatopic.txt', 'w', encoding='utf-8') as tw:
            for topic_id in range(num_topics):
                term_distribute_all = lda.get_topic_terms(topicid=topic_id, topn=20)
                term_distribute = np.array(term_distribute_all)
                term_id = term_distribute[:, 0].astype(np.int)
                for t in term_id:
                    tw.write(dictionary.id2token[t] + " ")
                tw.write("\n")
    

    可得结果如下:
    LDA主题提取

    参考

    『LDA主题模型』用Python实现主题模型LDA
    用WordCloud词云+LDA主题模型,带你读一读《芳华》(python实现)
    python–对文本分词去停用词提取关键词并词云展示完整代码示例

    展开全文
  • 主题提取LDA方法

    千次阅读 2018-04-11 18:08:02
    从后面对应的词分布发现,对应的一些语气词等并非主题词的词还存在,比如yeah,mm 2、测试 用未登录文档在训练出的LDA模型下测试效果: 1、将该文档分词并表示成词袋向量 test_dataset=fetch_20newsgroups...

    此处用fetch_20newsgroups数据训练
    参考:
    https://blog.csdn.net/scotfield_msn/article/details/72904651

    import gensim
    from sklearn.datasets import fetch_20newsgroups
    from gensim.utils import simple_preprocess
    from gensim.parsing.preprocessing import STOPWORDS
    from gensim.corpora import Dictionary
    import os
    from pprint import pprint

    准备数据用于训练和测试

    news_dataset = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
    documents = news_dataset.data
    print("In the dataset there are", len(documents), "textual documents")
    print("And this is the first one:\n", documents[0])

    结果:

    In the dataset there are 18846 textual documents
    And this is the first one:
    
    
    I am sure some bashers of Pens fans are pretty confused about the lack
    of any kind of posts about the recent Pens massacre of the Devils. Actually,
    I am  bit puzzled too and a bit relieved. However, I am going to put an end
    to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
    are killing those Devils worse than I thought. Jagr just showed you why
    he is much better than his regular season stats. He is also a lot
    fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
    fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
    regular season game.          PENS RULE!!!

    该训练集的使用说明如下:
    https://blog.csdn.net/panghaomingme/article/details/53486252

    预处理

    1、分词,去掉停等词

    def tokenize(text):
        return [token for token in simple_preprocess(text) if token not in STOPWORDS]   
    print("After the tokenizer, the previous document becomes:\n", tokenize(documents[0]))
    processed_docs = [tokenize(doc) for doc in documents]

    结果:

     ['sure', 'bashers', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 'pens', 'massacre', 'devils', 'actually', 'bit', 'puzzled', 'bit', 'relieved', 'going', 'end', 'non', 'pittsburghers', 'relief', 'bit', 'praise', 'pens', 'man', 'killing', 'devils', 'worse', 'thought', 'jagr', 'showed', 'better', 'regular', 'season', 'stats', 'lot', 'fo', 'fun', 'watch', 'playoffs', 'bowman', 'let', 'jagr', 'lot', 'fun', 'couple', 'games', 'pens', 'going', 'beat', 'pulp', 'jersey', 'disappointed', 'islanders', 'lose', 'final', 'regular', 'season', 'game', 'pens', 'rule']

    2、字典化

    #每一个单词关联一个唯一的ID
    word_count_dict = Dictionary(processed_docs)
    

    结果:

    In the corpus there are 95507 unique tokens
    
     Dictionary(95507 unique tokens: ['jet', 'transfer', 'stratus', 'moicheia', 'multiplies']...)

    3、对字典中的词进行过滤,去除高频低频次

    具体使用见
    https://blog.csdn.net/u014595019/article/details/52218249

    dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)
    1.去掉出现次数低于no_below的
    2.去掉出现次数高于no_above的。注意这个小数指的是百分数
    3.在1和2的基础上,保留出现频率前keep_n的单词

    word_count_dict.filter_extremes(no_below=5, no_above=0.05)  
    print("After filtering, in the corpus there are only", len(word_count_dict), "unique tokens")

    结果

    After filtering, in the corpus there are only 21757 unique tokens
    
    可以见到去除了一些词,字典中词的数量从95507个变成了个21757

    4、将文档表示成词袋向量

    bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs] corpus
    print(bag_of_words_corpus[0])

    结果

    [(104, 1), (941, 1), (1016, 1), (1366, 1), (1585, 1), (1699, 1), (1730, 2), (2122, 1), (2359, 1), (3465, 1), (3811, 1), (4121, 1), (4570, 1), (5336, 2), (5739, 1), (5885, 1), (6533, 1), (6856, 1), (8175, 1), (8519, 1), (8707, 1), (8834, 1), (9126, 2), (9746, 1), (9807, 1), (11553, 1), (11775, 1), (11930, 1), (12398, 1), (12855, 1), (13529, 5), (13958, 1), (14521, 1), (14740, 1), (14928, 2), (15185, 1), (15415, 1), (18229, 1), (18361, 1), (18915, 2), (20936, 1)]
    列表中每个元组中,第一个元素表示字典中单词的ID,第二个表示在这个句子中这个单词出现的次数。

    LDA主题模型

    具体详见
    https://blog.csdn.net/u014595019/article/details/52218249

    1、训练LDA模型

    #第一个参数为选用的文档向量,num_topics为主题个数,id2word可选是选用的字典,保存模型,方便下次不用再训练了
    model_name = "./model.lda"  
    if os.path.exists(model_name):
        lda_model = gensim.models.LdaModel.load(model_name)  
        print("loaded from old")
    else:
        # preprocess()  
        lda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=80, id2word=word_count_dict)#num_topics: the maximum numbers of topic that can provide  
        lda_model.save(model_name)  
        print("loaded from new")

    看一下训练出来的LDA模型的不同主题下的词分布

    打印前20个topic的词分布
    lda.print_topics(20)
    打印id为20的topic的词分布
    lda.print_topic(20)

    #打印前5个主题
    print(lda_model.print_topics(5))

    结果:

    [(20, '0.021*"yeah" + 0.020*"sc" + 0.019*"darren" + 0.018*"dream" + 0.017*"homosexuality" + 0.015*"skin" + 0.015*"car" + 0.015*"gays" + 0.013*"weather" + 0.012*"greatly"'), (14, '0.041*"ax" + 0.009*"ub" + 0.009*"mm" + 0.008*"mk" + 0.008*"cx" + 0.008*"pl" + 0.007*"yx" + 0.006*"mp" + 0.005*"mr" + 0.005*"max"'), (12, '0.046*"sleeve" + 0.024*"ss" + 0.021*"picture" + 0.020*"dave" + 0.018*"ed" + 0.015*"plutonium" + 0.014*"cox" + 0.014*"netcom" + 0.014*"boys" + 0.011*"frank"'), (11, '0.010*"cyl" + 0.010*"heads" + 0.008*"mfm" + 0.007*"fame" + 0.007*"exclusive" + 0.007*"club" + 0.007*"phillies" + 0.007*"hall" + 0.007*"eric" + 0.006*"mask"'), (6, '0.042*"card" + 0.032*"video" + 0.024*"cd" + 0.020*"monitor" + 0.018*"vga" + 0.014*"cards" + 0.012*"disks" + 0.011*"nec" + 0.011*"includes" + 0.010*"sale"')]

    训练出的lda_model有num_topics个主题的id和每个主题对应的词分布,这里出现了5个主题的id,分别为20、14、12、11、6
    从后面对应的词分布发现,对应的一些语气词等并非主题词的词还存在,比如yeah,mm

    2、测试

    用未登录文档在训练出的LDA模型下测试效果:
    1、将该文档分词并表示成词袋向量

    test_dataset=fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
    test_documents=test_dataset.data[0]
    test=word_count_dict.doc2bow(tokenize(test_documents))

    2、用训练好的lda模型测试该文档

    result=lda_model(test)
    print(result)
    for topic in result:
            #print_topic(x,y) x是主题的id,y是打印该主题的前y个词,词是按权重排好序的
        print(lda_model.print_topic(topic[0],2))
    

    结果:

    #此处的result前面是主题对应的id,后面是权重
    [(0, 0.055915479531474528), (1, 0.043632309200769978), (4, 0.039731045163527434), (9, 0.080996565212587676), (17, 0.10949904163764483), (28, 0.052307996592365257), (37, 0.041910355481773388), (41, 0.042178320776285916), (48, 0.054036399139048161), (53, 0.12185345365187586), (55, 0.12552429387152431), (56, 0.037461188219746117), (60, 0.070012626709205633), (63, 0.042189064087661758), (76, 0.048897694057843777)]
    下面是该文档对应主题的词分布
    0.025*"scsi" + 0.007*"director"
    0.031*"keyboard" + 0.030*"key"
    0.013*"usr" + 0.011*"apr"
    0.016*"captain" + 0.014*"myers"
    0.009*"went" + 0.007*"thought"
    0.015*"government" + 0.007*"clinton"
    0.011*"colors" + 0.010*"shell"
    0.045*"cover" + 0.030*"marriage"
    0.035*"disk" + 0.018*"jumper"
    0.016*"nuclear" + 0.014*"water"
    0.015*"server" + 0.014*"software"
    0.055*"apple" + 0.025*"modem"
    0.009*"israel" + 0.007*"israeli"
    0.055*"greek" + 0.038*"greece"
    0.012*"mhz" + 0.012*"speed"

    因为主题的权重没有排序,随便输出几个大的,这里希望输出权重前n名的主题及词分布

    ##test
    result=lda_model[test]
    #按照第二个元素从大到小的顺序排列
    result_sort=sorted(result,key=lambda tup: -1 * tup[1])
    print("下面是该文档对应主题的词分布")
    #需要打印出的该文本的前几个主题数
    topic_num=5
    count=1
    for topic in result_sort:
        if count>topic_num:
            break;
        print(lda_model.print_topic(topic[0],2),topic[1])
        count=count+1
    #前者是该主题的词分布,后者是该主题的权重
    下面是该文档对应主题的词分布
    0.009*"faq" + 0.007*"random" 0.152984684444
    0.012*"mhz" + 0.012*"speed" 0.101706532617
    0.008*"probe" + 0.007*"earth" 0.088976368815
    0.019*"lib" + 0.018*"memory" 0.0841998334088
    0.016*"israel" + 0.013*"iran" 0.0832577504835
    展开全文
  • 主题提取LDA

    2015-04-21 20:57:21
    了解使用LDA的资料,亲们可以多多下载,把文本主题提取整好
  • LDA\.classpathLDA\.projectLDA\.settings\org.eclipse.jdt.core.prefsLDA\bin\main\java\...

    LDA\.classpath

    LDA\.project

    LDA\.settings\org.eclipse.jdt.core.prefs

    LDA\bin\main\java\com\hankcs\lda\Corpus.class

    LDA\bin\main\java\com\hankcs\lda\LdaGibbsSampler.class

    LDA\bin\main\java\com\hankcs\lda\LdaUtil.class

    LDA\bin\main\java\com\hankcs\lda\Vocabulary.class

    LDA\bin\test\java\com\hankcs\lda\TestCorpus.class

    LDA\data\mini\IT_10.txt

    LDA\data\mini\IT_100.txt

    LDA\data\mini\IT_1000.txt

    LDA\data\mini\IT_1010.txt

    LDA\data\mini\IT_1020.txt

    LDA\data\mini\IT_1030.txt

    LDA\data\mini\IT_1040.txt

    LDA\data\mini\IT_1050.txt

    LDA\data\mini\IT_1060.txt

    LDA\data\mini\IT_1070.txt

    LDA\data\mini\IT_1080.txt

    LDA\data\mini\IT_1090.txt

    LDA\data\mini\IT_110.txt

    LDA\data\mini\IT_1100.txt

    LDA\data\mini\IT_1110.txt

    LDA\data\mini\IT_1120.txt

    LDA\data\mini\IT_1130.txt

    LDA\data\mini\IT_1140.txt

    LDA\data\mini\IT_1150.txt

    LDA\data\mini\IT_1160.txt

    LDA\data\mini\IT_1170.txt

    LDA\data\mini\IT_1180.txt

    LDA\data\mini\IT_1190.txt

    LDA\data\mini\IT_120.txt

    LDA\data\mini\IT_1200.txt

    LDA\data\mini\IT_1210.txt

    LDA\data\mini\IT_1220.txt

    LDA\data\mini\IT_1230.txt

    LDA\data\mini\IT_1240.txt

    LDA\data\mini\IT_1250.txt

    LDA\data\mini\IT_1260.txt

    LDA\data\mini\IT_1270.txt

    LDA\data\mini\IT_1280.txt

    LDA\data\mini\IT_1290.txt

    LDA\data\mini\IT_130.txt

    LDA\data\mini\IT_1300.txt

    LDA\data\mini\IT_1310.txt

    LDA\data\mini\IT_1320.txt

    LDA\data\mini\IT_1330.txt

    LDA\data\mini\IT_1340.txt

    LDA\data\mini\IT_1350.txt

    LDA\data\mini\IT_1360.txt

    LDA\data\mini\IT_1370.txt

    LDA\data\mini\IT_1380.txt

    LDA\data\mini\IT_1390.txt

    LDA\data\mini\IT_140.txt

    LDA\data\mini\IT_1400.txt

    LDA\data\mini\IT_1410.txt

    LDA\data\mini\IT_1420.txt

    LDA\data\mini\IT_1430.txt

    LDA\data\mini\IT_1440.txt

    LDA\data\mini\IT_1450.txt

    LDA\data\mini\IT_1460.txt

    LDA\data\mini\IT_1470.txt

    LDA\data\mini\IT_1480.txt

    LDA\data\mini\IT_1490.txt

    LDA\data\mini\IT_150.txt

    LDA\data\mini\IT_1500.txt

    LDA\data\mini\IT_1510.txt

    LDA\data\mini\IT_1520.txt

    LDA\data\mini\IT_1530.txt

    LDA\data\mini\IT_1540.txt

    LDA\data\mini\IT_1550.txt

    LDA\data\mini\IT_1560.txt

    LDA\data\mini\IT_1570.txt

    LDA\data\mini\IT_1580.txt

    LDA\data\mini\IT_1590.txt

    LDA\data\mini\IT_160.txt

    LDA\data\mini\IT_1600.txt

    LDA\data\mini\IT_1610.txt

    LDA\data\mini\IT_1620.txt

    LDA\data\mini\IT_1630.txt

    LDA\data\mini\IT_1640.txt

    LDA\data\mini\IT_1650.txt

    LDA\data\mini\IT_1660.txt

    LDA\data\mini\IT_1670.txt

    LDA\data\mini\IT_1680.txt

    LDA\data\mini\IT_1690.txt

    LDA\data\mini\IT_170.txt

    LDA\data\mini\IT_1700.txt

    LDA\data\mini\IT_1710.txt

    LDA\data\mini\IT_1720.txt

    LDA\data\mini\IT_1730.txt

    LDA\data\mini\IT_1740.txt

    LDA\data\mini\IT_1750.txt

    LDA\data\mini\IT_1760.txt

    LDA\data\mini\IT_1770.txt

    LDA\data\mini\IT_1780.txt

    LDA\data\mini\IT_1790.txt

    LDA\data\mini\IT_180.txt

    LDA\data\mini\IT_1800.txt

    LDA\data\mini\IT_1810.txt

    展开全文
  • (0, '0.018*"风" + 0.015*"设计" + 0.011*"酒店" + 0.011*"独特" + 0.009*"房间" + 0.009*"空调" + 0.009*"感觉" + 0.008*"年代" + 0.008*"民国" + 0.008*"送"')(1, '0.030*"酒店\n" + 0.019*"停车场" + 0.018*"酒店...
  • LDA主题模型提取文本中的关键词

    千次阅读 2020-10-24 10:14:25
    主题模型+TF-IDF提取文本的关键词 背景 懒得说背景!以后想起来再补充!电脑硬盘坏过,本文代码也忘了参考谁的了!原作者发现可以联系我!立马改参考! Import Dependency Jar import gensim import math import...
  • TF/idf进行关键词提取 """ 关键词提取 基于 TF-IDF 算法的关键词抽取 import jieba.analyse jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=()) sentence 为待提取的文本 topK 为返回...
  • LDA主题模型 | 原理详解与代码实战

    千次阅读 2021-01-17 12:44:17
    一是本文要讲的「隐含狄利克雷分布(Latent Dirichlet Allocation)」,是一种概率主题模型,主要用来文本分类,在NLP领域有重要应用。LDA由Blei, David M.、Ng, Andrew Y.、Jordan于2003年提出...
  • 如果是的话,其实还是有解的~首先P(z|d)也可以表示为但是P(z|w)并没有在原始的主题模型结果中,如何求得P(z|w)就成为此计算的关键了呗~根据贝叶斯公式,可以得知其中P(w)为词频,P(w|z)为主题模型中已知结果。...
  • 论文中用到的部分基础文本分析技术(包括分词、去除停用、word2vec、TF-IDF、词云图、名称提取、词性标注、LDA主题模型)
  • Moody的lda2vec的pytorch实现,这是一种使用嵌入的主题建模方法。 原始论文: 。 警告:我个人认为使lda2vec算法起作用非常困难。 有时它找到几个主题,有时却找不到。 通常,找到的很多话题都是一团糟。 该算法...
  • Steve Shao原文:Contextual Topic Identification翻译:litf内容:基于Steam 评论数据集,分别比较LDA、TF-IDF+Clustering、BERT+Clustering和BERT+LDA+Clustering 4种模型识别主题的效果,评估采用主题模型的...
  • LDA主题模型.zip

    2019-01-05 21:56:20
    LDA(Latent Dirichlet Allocation)中文...一篇文档的构造过程,首先是以一定的概率选择某个主题,然后再在这个主题下以一定的概率选出某一个,这样就生成了这篇文档的第一个。不断重复这个过程,就生成了整篇文章。
  • 语料是一个关于汽车的短文本,下面通过 Gensim 库完成基于 LDA 的关键字提取。整个过程的步骤为:文件加载 -> jieba 分词 -> 去停用 -> 构建袋模型 -> LDA 模型训练 -> 结果可视化: # -*- coding...
  • LDA主题抽取浅析

    2017-06-30 09:23:00
    LDA主题模型,依据文档-主题主题-项,文档-项的相对分布,实现了浅析语义的主题模型。  举个例子,有两个句子分别如下:  “乔布斯离我们而去了。”  “苹果价格会不会降?”  可以看到上面...
  • LDA提取标签

    2019-11-28 20:17:19
    或者根据测试文本长度来定)个主题,每个主题再选主题词向量里概率最高的N个词(N=30?),构成候选关键词集合(可以带权重,权重可以=主题概率*词概率*测试文本该词的TF-IDF), 测试文本里在候选关键词集合里的词,且...
  • 使用R语言中的jiebaR包,对中文文本进行分词,求词频,做词云图并进行LDA主题建模
  • 无监督提取文档主题——LDA模型 1.1 准备工作 1.2 调用api实现模型 LDA的可视化交互分析——pyLDAvis 2.1 安装pyLDAvis 2.2 结合gensim调用api实现可视化 p.s. 保存结果为独立网页 p.p.s. 加快prepare速度? 2.3 ...
  • 5.3 LDA主题模型主题模型尝试解答下面两个基本问题:如何确定一个词语是否属于特定主题;文档中一个主题出现的频率。一种称为Gibbs sampler的LDA主题建模比其他模型更快,在处理大型语料库时非常有用。吉布斯抽样是...
  • 函数说明1.LDA(n_topics, max_iters, random_state) 用于构建LDA主题模型,将文本分成不同的主题参数说明:n_topics 表示分为多少个主题, max_iters表示最大的迭代次数, random_state 表示随机种子2. LDA....
  • python 基于LDA算法的长文本主题提取分类并预测类别

    千次阅读 多人点赞 2020-04-26 11:53:07
    python 基于LDA长文本主题提取并预测分类 Lda算法原理,知乎上有很完善的算法介绍,这里就不多废话了。 数据准备 这一阶段主要是对你的问题本进行处理,清洗你的数据。中文文本预处理,主要包括去空格、去表单符号、...
  • LDA利用python进行主题分析提取

    万次阅读 多人点赞 2018-12-20 21:56:24
    数据科学老师布置任务,使用LDA写一个针对网页的主题提取实验。下面我把代码贴上,把所需要的文件传上。 # !/usr/bin/python # -*- coding:utf-8 -*- import numpy as np from gensim import corpora, models, ...
  • LDA (一) 文本关键词提取

    千次阅读 2018-11-30 15:39:38
    一、算法原理:使用gensim自带的...每个文本属于k个主题,把k个主题所包含的赋予该文档,便得到每个文档的候选关键词。如果文档分词后得到的词语在候选关键词中,那么将其作为关键词提取出来。(候选关键词...
  • 无监督提取文档主题——LDA模型1.1 准备工作1.2 调用api实现模型2. LDA的可视化交互分析——pyLDAvis2.1 安装pyLDAvis2.2 结合gensim调用api实现可视化p.s. 保存结果为独立网页p.p.s. 加快prepare速度?2.3 如何...
  • 主题模型是对文档中隐含的主题进行建模,考虑了上下文语义之间的关系。 一个主题就好像一个“桶”,它装了若干出现概率较高的词语。这些词语和这个主题有很强的相关性, 或者说,正是这些词语共同定义了这个主题。...
  • 例如:LDA得到文档主题为:“音乐 旋律 节奏 乐器“,目的要将该主题打上大类标签”音乐“;再如将”PM2.5,净化,污染,空气“归为”雾霾“标签,,这应该如何实现呢?求教大神!...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 3,321
精华内容 1,328
关键字:

lda提取主题词