精华内容
下载资源
问答
  • 最近学习文本挖掘:词袋模型的gensim实现,需要使用gensim,课程老师说很好安装,结果我安装再次出现了问题,还不止一个
  • anaconda离线安装gensim的依赖包,包含所依赖的所有whl文件或源文件,可在离线环境下安装gensim,不需要单独去查找和下载文件。
  • Gensim 3.4.0 情节2.5.1 概念 Poincare嵌入是一种学习图中节点的矢量表示的方法。 输入数据具有节点之间的关系(边缘)列表的形式,并且该模型尝试学习表示形式,以使节点的矢量准确地表示它们之间的距离。 所...
  • ChineseSimilarity-gensim-tfidf """ 基于gensim模块的中文句子相似度计算 思路如下: 1.文本预处理:中文分词,去除停用词 2.计算词频 3.创建字典(单词与编号之间的映射) 4.将待比较的文档转换为向量(词袋表示...
  • Gensim

    千次阅读 2020-05-16 15:47:48
    Gensim从最原始的非结构化的文本中,无监督的学习到文本隐层的主题向量表达;支持包括LDA TF-IDFLSA word2vec等主题模型算法。

    Gensim

    从最原始的非结构化的文本中,无监督的学习到文本隐层的主题向量表达;

    支持包括LDA TF-IDFLSA word2vec等主题模型算法。

    官网

    基本概念

    • 语料 Corpus
    • 向量 Vector
    • 稀疏向量 SparseVector
    • 模型 Model

    安装

    安装环境

    • Ubuntu18.04
    • Anaconda3-5.3.1
    !pip install gensim
    
    !conda list | grep gensim
    
    gensim                    3.8.3                     <pip>
    
    !pip install PyHamcrest
    !pip show PyHamcrest
    
    import gensim
    
    gensim.__version__
    
    '3.8.3'
    

    训练语料的预处理

    原始字符串 -> 稀疏向量

    原始文本 -> 分词、去除停用词等 -> 文档特征列表

    词袋模型 文档特征——word

    from gensim import corpora
    
    texts = [['a', 'b', 'c'],
        ['a', 'd', 'b']]
    dictionary = corpora.Dictionary(texts) 
    corpus = [dictionary.doc2bow(text) for text in texts] #词袋模型 doc2bow
    print(corpus)
    print()
    print(corpus[0])
    print(corpus[1])
    
    [[(0, 1), (1, 1), (2, 1)], [(0, 1), (1, 1), (3, 1)]]
    
    [(0, 1), (1, 1), (2, 1)]
    [(0, 1), (1, 1), (3, 1)]
    
    help(corpora.Dictionary)
    
    Help on class Dictionary in module gensim.corpora.dictionary:
    
    class Dictionary(gensim.utils.SaveLoad, collections.abc.Mapping)
     |  Dictionary(documents=None, prune_at=2000000)
     |  
     |  Dictionary encapsulates the mapping between normalized words and their integer ids.
     |  
     |  Notable instance attributes:
     |  
     |  Attributes
     |  ----------
     |  token2id : dict of (str, int)
     |      token -> tokenId.
     |  id2token : dict of (int, str)
     |      Reverse mapping for token2id, initialized in a lazy manner to save memory (not created until needed).
     |  cfs : dict of (int, int)
     |      Collection frequencies: token_id -> how many instances of this token are contained in the documents.
     |  dfs : dict of (int, int)
     |      Document frequencies: token_id -> how many documents contain this token.
     |  num_docs : int
     |      Number of documents processed.
     |  num_pos : int
     |      Total number of corpus positions (number of processed words).
     |  num_nnz : int
     |      Total number of non-zeroes in the BOW matrix (sum of the number of unique
     |      words per document over the entire corpus).
     |  
     |  Method resolution order:
     |      Dictionary
     |      gensim.utils.SaveLoad
     |      collections.abc.Mapping
     |      collections.abc.Collection
     |      collections.abc.Sized
     |      collections.abc.Iterable
     |      collections.abc.Container
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __getitem__(self, tokenid)
     |      Get the string token that corresponds to `tokenid`.
     |      
     |      Parameters
     |      ----------
     |      tokenid : int
     |          Id of token.
     |      
     |      Returns
     |      -------
     |      str
     |          Token corresponding to `tokenid`.
     |      
     |      Raises
     |      ------
     |      KeyError
     |          If this Dictionary doesn't contain such `tokenid`.
     |  
     |  __init__(self, documents=None, prune_at=2000000)
     |      Parameters
     |      ----------
     |      documents : iterable of iterable of str, optional
     |          Documents to be used to initialize the mapping and collect corpus statistics.
     |      prune_at : int, optional
     |          Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
     |          footprint, the correctness is not guaranteed.
     |          Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> texts = [['human', 'interface', 'computer']]
     |          >>> dct = Dictionary(texts)  # initialize a Dictionary
     |          >>> dct.add_documents([["cat", "say", "meow"], ["dog"]])  # add more document (extend the vocabulary)
     |          >>> dct.doc2bow(["dog", "computer", "non_existent_word"])
     |          [(0, 1), (6, 1)]
     |  
     |  __iter__(self)
     |      Iterate over all tokens.
     |  
     |  __len__(self)
     |      Get number of stored tokens.
     |      
     |      Returns
     |      -------
     |      int
     |          Number of stored tokens.
     |  
     |  __str__(self)
     |      Return str(self).
     |  
     |  add_documents(self, documents, prune_at=2000000)
     |      Update dictionary from a collection of `documents`.
     |      
     |      Parameters
     |      ----------
     |      documents : iterable of iterable of str
     |          Input corpus. All tokens should be already **tokenized and normalized**.
     |      prune_at : int, optional
     |          Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
     |          footprint, the correctness is not guaranteed.
     |          Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = ["máma mele maso".split(), "ema má máma".split()]
     |          >>> dct = Dictionary(corpus)
     |          >>> len(dct)
     |          5
     |          >>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]])
     |          >>> len(dct)
     |          10
     |  
     |  compactify(self)
     |      Assign new word ids to all words, shrinking any gaps.
     |  
     |  doc2bow(self, document, allow_update=False, return_missing=False)
     |      Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
     |      
     |      Parameters
     |      ----------
     |      document : list of str
     |          Input document.
     |      allow_update : bool, optional
     |          Update self, by adding new tokens from `document` and updating internal corpus statistics.
     |      return_missing : bool, optional
     |          Return missing tokens (tokens present in `document` but not in self) with frequencies?
     |      
     |      Return
     |      ------
     |      list of (int, int)
     |          BoW representation of `document`.
     |      list of (int, int), dict of (str, int)
     |          If `return_missing` is True, return BoW representation of `document` + dictionary with missing
     |          tokens and their frequencies.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
     |          >>> dct.doc2bow(["this", "is", "máma"])
     |          [(2, 1)]
     |          >>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
     |          ([(2, 1)], {u'this': 1, u'is': 1})
     |  
     |  doc2idx(self, document, unknown_word_index=-1)
     |      Convert `document` (a list of words) into a list of indexes = list of `token_id`.
     |      Replace all unknown words i.e, words not in the dictionary with the index as set via `unknown_word_index`.
     |      
     |      Parameters
     |      ----------
     |      document : list of str
     |          Input document
     |      unknown_word_index : int, optional
     |          Index to use for words not in the dictionary.
     |      
     |      Returns
     |      -------
     |      list of int
     |          Token ids for tokens in `document`, in the same order.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["a", "a", "b"], ["a", "c"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"])
     |          [0, 0, 2, -1, 2]
     |  
     |  filter_extremes(self, no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)
     |      Filter out tokens in the dictionary by their frequency.
     |      
     |      Parameters
     |      ----------
     |      no_below : int, optional
     |          Keep tokens which are contained in at least `no_below` documents.
     |      no_above : float, optional
     |          Keep tokens which are contained in no more than `no_above` documents
     |          (fraction of total corpus size, not an absolute number).
     |      keep_n : int, optional
     |          Keep only the first `keep_n` most frequent tokens.
     |      keep_tokens : iterable of str
     |          Iterable of tokens that **must** stay in dictionary after filtering.
     |      
     |      Notes
     |      -----
     |      This removes all tokens in the dictionary that are:
     |      
     |      #. Less frequent than `no_below` documents (absolute number, e.g. `5`) or 
     |      
     |      #. More frequent than `no_above` documents (fraction of the total corpus size, e.g. `0.3`).
     |      #. After (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if `keep_n=None`).
     |      
     |      After the pruning, resulting gaps in word ids are shrunk.
     |      Due to this gap shrinking, **the same word may have a different word id before and after the call
     |      to this function!**
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> len(dct)
     |          5
     |          >>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
     |          >>> len(dct)
     |          1
     |  
     |  filter_n_most_frequent(self, remove_n)
     |      Filter out the 'remove_n' most frequent tokens that appear in the documents.
     |      
     |      Parameters
     |      ----------
     |      remove_n : int
     |          Number of the most frequent tokens that will be removed.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> len(dct)
     |          5
     |          >>> dct.filter_n_most_frequent(2)
     |          >>> len(dct)
     |          3
     |  
     |  filter_tokens(self, bad_ids=None, good_ids=None)
     |      Remove the selected `bad_ids` tokens from :class:`~gensim.corpora.dictionary.Dictionary`.
     |      
     |      Alternatively, keep selected `good_ids` in :class:`~gensim.corpora.dictionary.Dictionary` and remove the rest.
     |      
     |      Parameters
     |      ----------
     |      bad_ids : iterable of int, optional
     |          Collection of word ids to be removed.
     |      good_ids : collection of int, optional
     |          Keep selected collection of word ids and remove the rest.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> 'ema' in dct.token2id
     |          True
     |          >>> dct.filter_tokens(bad_ids=[dct.token2id['ema']])
     |          >>> 'ema' in dct.token2id
     |          False
     |          >>> len(dct)
     |          4
     |          >>> dct.filter_tokens(good_ids=[dct.token2id['maso']])
     |          >>> len(dct)
     |          1
     |  
     |  iteritems(self)
     |  
     |  iterkeys = __iter__(self)
     |  
     |  itervalues(self)
     |  
     |  keys(self)
     |      Get all stored ids.
     |      
     |      Returns
     |      -------
     |      list of int
     |          List of all token ids.
     |  
     |  merge_with(self, other)
     |      Merge another dictionary into this dictionary, mapping the same tokens to the same ids
     |      and new tokens to new ids.
     |      
     |      Notes
     |      -----
     |      The purpose is to merge two corpora created using two different dictionaries: `self` and `other`.
     |      `other` can be any id=>word mapping (a dict, a Dictionary object, ...).
     |      
     |      Return a transformation object which, when accessed as `result[doc_from_other_corpus]`, will convert documents
     |      from a corpus built using the `other` dictionary into a document using the new, merged dictionary.
     |      
     |      Parameters
     |      ----------
     |      other : {dict, :class:`~gensim.corpora.dictionary.Dictionary`}
     |          Other dictionary.
     |      
     |      Return
     |      ------
     |      :class:`gensim.models.VocabTransform`
     |          Transformation object.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]]
     |          >>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2)
     |          >>> dct_1.doc2bow(corpus_2[0])
     |          [(0, 1)]
     |          >>> transformer = dct_1.merge_with(dct_2)
     |          >>> dct_1.doc2bow(corpus_2[0])
     |          [(0, 1), (3, 2)]
     |  
     |  patch_with_special_tokens(self, special_token_dict)
     |      Patch token2id and id2token using a dictionary of special tokens.
     |      
     |      
     |      **Usecase:** when doing sequence modeling (e.g. named entity recognition), one may  want to specify
     |      special tokens that behave differently than others.
     |      One example is the "unknown" token, and another is the padding token.
     |      It is usual to set the padding token to have index `0`, and patching the dictionary with `{'<PAD>': 0}`
     |      would be one way to specify this.
     |      
     |      Parameters
     |      ----------
     |      special_token_dict : dict of (str, int)
     |          dict containing the special tokens as keys and their wanted indices as values.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>>
     |          >>> special_tokens = {'pad': 0, 'space': 1}
     |          >>> print(dct.token2id)
     |          {'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4}
     |          >>>
     |          >>> dct.patch_with_special_tokens(special_tokens)
     |          >>> print(dct.token2id)
     |          {'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
     |  
     |  save_as_text(self, fname, sort_by_word=True)
     |      Save :class:`~gensim.corpora.dictionary.Dictionary` to a text file.
     |      
     |      Parameters
     |      ----------
     |      fname : str
     |          Path to output file.
     |      sort_by_word : bool, optional
     |          Sort words in lexicographical order before writing them out?
     |      
     |      Notes
     |      -----
     |      Format::
     |      
     |          num_docs
     |          id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
     |          id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
     |          ....
     |          id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]
     |      
     |      This text format is great for corpus inspection and debugging. As plaintext, it's also easily portable
     |      to other tools and frameworks. For better performance and to store the entire object state,
     |      including collected corpus statistics, use :meth:`~gensim.corpora.dictionary.Dictionary.save` and
     |      :meth:`~gensim.corpora.dictionary.Dictionary.load` instead.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.corpora.dictionary.Dictionary.load_from_text`
     |          Load :class:`~gensim.corpora.dictionary.Dictionary` from text file.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>> from gensim.test.utils import get_tmpfile
     |          >>>
     |          >>> tmp_fname = get_tmpfile("dictionary")
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>>
     |          >>> dct = Dictionary(corpus)
     |          >>> dct.save_as_text(tmp_fname)
     |          >>>
     |          >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
     |          >>> assert dct.token2id == loaded_dct.token2id
     |  
     |  ----------------------------------------------------------------------
     |  Static methods defined here:
     |  
     |  from_corpus(corpus, id2word=None)
     |      Create :class:`~gensim.corpora.dictionary.Dictionary` from an existing corpus.
     |      
     |      Parameters
     |      ----------
     |      corpus : iterable of iterable of (int, number)
     |          Corpus in BoW format.
     |      id2word : dict of (int, object)
     |          Mapping id -> word. If None, the mapping `id2word[word_id] = str(word_id)` will be used.
     |      
     |      Notes
     |      -----
     |      This can be useful if you only have a term-document BOW matrix (represented by `corpus`), but not the original
     |      text corpus. This method will scan the term-document count matrix for all word ids that appear in it,
     |      then construct :class:`~gensim.corpora.dictionary.Dictionary` which maps each `word_id -> id2word[word_id]`.
     |      `id2word` is an optional dictionary that maps the `word_id` to a token.
     |      In case `id2word` isn't specified the mapping `id2word[word_id] = str(word_id)` will be used.
     |      
     |      Returns
     |      -------
     |      :class:`~gensim.corpora.dictionary.Dictionary`
     |          Inferred dictionary from corpus.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []]
     |          >>> dct = Dictionary.from_corpus(corpus)
     |          >>> len(dct)
     |          3
     |  
     |  from_documents(documents)
     |      Create :class:`~gensim.corpora.dictionary.Dictionary` from `documents`.
     |      
     |      Equivalent to `Dictionary(documents=documents)`.
     |      
     |      Parameters
     |      ----------
     |      documents : iterable of iterable of str
     |          Input corpus.
     |      
     |      Returns
     |      -------
     |      :class:`~gensim.corpora.dictionary.Dictionary`
     |          Dictionary initialized from `documents`.
     |  
     |  load_from_text(fname)
     |      Load a previously stored :class:`~gensim.corpora.dictionary.Dictionary` from a text file.
     |      
     |      Mirror function to :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
     |      
     |      Parameters
     |      ----------
     |      fname: str
     |          Path to a file produced by :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`
     |          Save :class:`~gensim.corpora.dictionary.Dictionary` to text file.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>> from gensim.test.utils import get_tmpfile
     |          >>>
     |          >>> tmp_fname = get_tmpfile("dictionary")
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>>
     |          >>> dct = Dictionary(corpus)
     |          >>> dct.save_as_text(tmp_fname)
     |          >>>
     |          >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
     |          >>> assert dct.token2id == loaded_dct.token2id
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  __abstractmethods__ = frozenset()
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from gensim.utils.SaveLoad:
     |  
     |  save(self, fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)
     |      Save the object to a file.
     |      
     |      Parameters
     |      ----------
     |      fname_or_handle : str or file-like
     |          Path to output file or already opened file-like object. If the object is a file handle,
     |          no special array handling will be performed, all attributes will be saved to the same file.
     |      separately : list of str or None, optional
     |          If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store
     |          them into separate files. This prevent memory errors for large objects, and also allows
     |          `memory-mapping <https://en.wikipedia.org/wiki/Mmap>`_ the large arrays for efficient
     |          loading and sharing the large arrays in RAM between multiple processes.
     |      
     |          If list of str: store these attributes into separate files. The automated size check
     |          is not performed in this case.
     |      sep_limit : int, optional
     |          Don't store arrays smaller than this separately. In bytes.
     |      ignore : frozenset of str, optional
     |          Attributes that shouldn't be stored at all.
     |      pickle_protocol : int, optional
     |          Protocol number for pickle.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.utils.SaveLoad.load`
     |          Load object from file.
     |  
     |  ----------------------------------------------------------------------
     |  Class methods inherited from gensim.utils.SaveLoad:
     |  
     |  load(fname, mmap=None) from abc.ABCMeta
     |      Load an object previously saved using :meth:`~gensim.utils.SaveLoad.save` from a file.
     |      
     |      Parameters
     |      ----------
     |      fname : str
     |          Path to file that contains needed object.
     |      mmap : str, optional
     |          Memory-map option.  If the object was saved with large arrays stored separately, you can load these arrays
     |          via mmap (shared memory) using `mmap='r'.
     |          If the file being loaded is compressed (either '.gz' or '.bz2'), then `mmap=None` **must be** set.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.utils.SaveLoad.save`
     |          Save object to file.
     |      
     |      Returns
     |      -------
     |      object
     |          Object loaded from `fname`.
     |      
     |      Raises
     |      ------
     |      AttributeError
     |          When called on an object instance instead of class (this is a class method).
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from gensim.utils.SaveLoad:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from collections.abc.Mapping:
     |  
     |  __contains__(self, key)
     |  
     |  __eq__(self, other)
     |      Return self==value.
     |  
     |  get(self, key, default=None)
     |      D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.
     |  
     |  items(self)
     |      D.items() -> a set-like object providing a view on D's items
     |  
     |  values(self)
     |      D.values() -> an object providing a view on D's values
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from collections.abc.Mapping:
     |  
     |  __hash__ = None
     |  
     |  __reversed__ = None
     |  
     |  ----------------------------------------------------------------------
     |  Class methods inherited from collections.abc.Collection:
     |  
     |  __subclasshook__(C) from abc.ABCMeta
     |      Abstract classes can override this to customize issubclass().
     |      
     |      This is invoked early on by abc.ABCMeta.__subclasscheck__().
     |      It should return True, False or NotImplemented.  If it returns
     |      NotImplemented, the normal algorithm is used.  Otherwise, it
     |      overrides the normal algorithm (and the outcome is cached).
    
    help(corpora.Dictionary.doc2bow)
    
    Help on function doc2bow in module gensim.corpora.dictionary:
    
    doc2bow(self, document, allow_update=False, return_missing=False)
        Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
        
        Parameters
        ----------
        document : list of str
            Input document.
        allow_update : bool, optional
            Update self, by adding new tokens from `document` and updating internal corpus statistics.
        return_missing : bool, optional
            Return missing tokens (tokens present in `document` but not in self) with frequencies?
        
        Return
        ------
        list of (int, int)
            BoW representation of `document`.
        list of (int, int), dict of (str, int)
            If `return_missing` is True, return BoW representation of `document` + dictionary with missing
            tokens and their frequencies.
        
        Examples
        --------
        .. sourcecode:: pycon
        
            >>> from gensim.corpora import Dictionary
            >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
            >>> dct.doc2bow(["this", "is", "máma"])
            [(2, 1)]
            >>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
            ([(2, 1)], {u'this': 1, u'is': 1})
    

    主题向量的变换

    通过挖掘语料中隐藏的语义结构特征 -> 文本向量

    TF-IDF模型

    from gensim import models
    tfidf = models.TfidfModel(corpus)
    doc_bow = [(0, 1), (1, 1), (2, 1)]
    print(tfidf[doc_bow])
    
    [(2, 1.0)]
    
    help(models.TfidfModel)
    
    Help on class TfidfModel in module gensim.models.tfidfmodel:
    
    class TfidfModel(gensim.interfaces.TransformationABC)
     |  TfidfModel(corpus=None, id2word=None, dictionary=None, wlocal=<function identity at 0x7fffc30e60d0>, wglobal=<function df2idf at 0x7fffc0973a60>, normalize=True, smartirs=None, pivot=None, slope=0.25)
     |  
     |  Objects of this class realize the transformation between word-document co-occurrence matrix (int)
     |  into a locally/globally weighted TF-IDF matrix (positive floats).
     |  
     |  Examples
     |  --------
     |  .. sourcecode:: pycon
     |  
     |      >>> import gensim.downloader as api
     |      >>> from gensim.models import TfidfModel
     |      >>> from gensim.corpora import Dictionary
     |      >>>
     |      >>> dataset = api.load("text8")
     |      >>> dct = Dictionary(dataset)  # fit dictionary
     |      >>> corpus = [dct.doc2bow(line) for line in dataset]  # convert corpus to BoW format
     |      >>>
     |      >>> model = TfidfModel(corpus)  # fit model
     |      >>> vector = model[corpus[0]]  # apply model to the first corpus document
     |  
     |  Method resolution order:
     |      TfidfModel
     |      gensim.interfaces.TransformationABC
     |      gensim.utils.SaveLoad
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __getitem__(self, bow, eps=1e-12)
     |      Get the tf-idf representation of an input vector and/or corpus.
     |      
     |      bow : {list of (int, int), iterable of iterable of (int, int)}
     |          Input document in the `sparse Gensim bag-of-words format
     |          <https://radimrehurek.com/gensim/intro.html#core-concepts>`_,
     |          or a streamed corpus of such documents.
     |      eps : float
     |          Threshold value, will remove all position that have tfidf-value less than `eps`.
     |      
     |      Returns
     |      -------
     |      vector : list of (int, float)
     |          TfIdf vector, if `bow` is a single document
     |      :class:`~gensim.interfaces.TransformedCorpus`
     |          TfIdf corpus, if `bow` is a corpus.
     |  
     |  __init__(self, corpus=None, id2word=None, dictionary=None, wlocal=<function identity at 0x7fffc30e60d0>, wglobal=<function df2idf at 0x7fffc0973a60>, normalize=True, smartirs=None, pivot=None, slope=0.25)
     |      Compute TF-IDF by multiplying a local component (term frequency) with a global component
     |      (inverse document frequency), and normalizing the resulting documents to unit length.
     |      Formula for non-normalized weight of term :math:`i` in document :math:`j` in a corpus of :math:`D` documents
     |      
     |      .. math:: weight_{i,j} = frequency_{i,j} * log_2 \frac{D}{document\_freq_{i}}
     |      
     |      or, more generally
     |      
     |      .. math:: weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document\_freq_{i}, D)
     |      
     |      so you can plug in your own custom :math:`wlocal` and :math:`wglobal` functions.
     |      
     |      Parameters
     |      ----------
     |      corpus : iterable of iterable of (int, int), optional
     |          Input corpus
     |      id2word : {dict, :class:`~gensim.corpora.Dictionary`}, optional
     |          Mapping token - id, that was used for converting input data to bag of words format.
     |      dictionary : :class:`~gensim.corpora.Dictionary`
     |          If `dictionary` is specified, it must be a `corpora.Dictionary` object and it will be used.
     |          to directly construct the inverse document frequency mapping (then `corpus`, if specified, is ignored).
     |      wlocals : callable, optional
     |          Function for local weighting, default for `wlocal` is :func:`~gensim.utils.identity`
     |          (other options: :func:`numpy.sqrt`, `lambda tf: 0.5 + (0.5 * tf / tf.max())`, etc.).
     |      wglobal : callable, optional
     |          Function for global weighting, default is :func:`~gensim.models.tfidfmodel.df2idf`.
     |      normalize : {bool, callable}, optional
     |          Normalize document vectors to unit euclidean length? You can also inject your own function into `normalize`.
     |      smartirs : str, optional
     |          SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System,
     |          a mnemonic scheme for denoting tf-idf weighting variants in the vector space model.
     |          The mnemonic for representing a combination of weights takes the form XYZ,
     |          for example 'ntc', 'bpn' and so on, where the letters represents the term weighting of the document vector.
     |      
     |          Term frequency weighing:
     |              * `b` - binary,
     |              * `t` or `n` - raw,
     |              * `a` - augmented,
     |              * `l` - logarithm,
     |              * `d` - double logarithm,
     |              * `L` - log average.
     |      
     |          Document frequency weighting:
     |              * `x` or `n` - none,
     |              * `f` - idf,
     |              * `t` - zero-corrected idf,
     |              * `p` - probabilistic idf.
     |      
     |          Document normalization:
     |              * `x` or `n` - none,
     |              * `c` - cosine,
     |              * `u` - pivoted unique,
     |              * `b` - pivoted character length.
     |      
     |          Default is 'nfc'.
     |          For more information visit `SMART Information Retrieval System
     |          <https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System>`_.
     |      pivot : float or None, optional
     |          In information retrieval, TF-IDF is biased against long documents [1]_. Pivoted document length
     |          normalization solves this problem by changing the norm of a document to `slope * old_norm + (1.0 -
     |          slope) * pivot`.
     |      
     |          You can either set the `pivot` by hand, or you can let Gensim figure it out automatically with the following
     |          two steps:
     |      
     |              * Set either the `u` or `b` document normalization in the `smartirs` parameter.
     |              * Set either the `corpus` or `dictionary` parameter. The `pivot` will be automatically determined from
     |                the properties of the `corpus` or `dictionary`.
     |      
     |          If `pivot` is None and you don't follow steps 1 and 2, then pivoted document length normalization will be
     |          disabled. Default is None.
     |      
     |          See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.
     |      slope : float, optional
     |          In information retrieval, TF-IDF is biased against long documents [1]_. Pivoted document length
     |          normalization solves this problem by changing the norm of a document to `slope * old_norm + (1.0 -
     |          slope) * pivot`.
     |      
     |          Setting the `slope` to 0.0 uses only the `pivot` as the norm, and setting the `slope` to 1.0 effectively
     |          disables pivoted document length normalization. Singhal [2]_ suggests setting the `slope` between 0.2 and
     |          0.3 for best results. Default is 0.25.
     |      
     |          See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.
     |      
     |      See Also
     |      --------
     |      ~gensim.sklearn_api.tfidf.TfIdfTransformer : Class that also uses the SMART scheme.
     |      resolve_weights : Function that also uses the SMART scheme.
     |      
     |      References
     |      ----------
     |      .. [1] Singhal, A., Buckley, C., & Mitra, M. (1996). `Pivoted Document Length
     |         Normalization <http://singhal.info/pivoted-dln.pdf>`_. *SIGIR Forum*, 51, 176–184.
     |      .. [2] Singhal, A. (2001). `Modern information retrieval: A brief overview <http://singhal.info/ieee2001.pdf>`_.
     |         *IEEE Data Eng. Bull.*, 24(4), 35–43.
     |  
     |  __str__(self)
     |      Return str(self).
     |  
     |  initialize(self, corpus)
     |      Compute inverse document weights, which will be used to modify term frequencies for documents.
     |      
     |      Parameters
     |      ----------
     |      corpus : iterable of iterable of (int, int)
     |          Input corpus.
     |  
     |  ----------------------------------------------------------------------
     |  Class methods defined here:
     |  
     |  load(*args, **kwargs) from builtins.type
     |      Load a previously saved TfidfModel class. Handles backwards compatibility from
     |      older TfidfModel versions which did not use pivoted document normalization.
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from gensim.utils.SaveLoad:
     |  
     |  save(self, fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)
     |      Save the object to a file.
     |      
     |      Parameters
     |      ----------
     |      fname_or_handle : str or file-like
     |          Path to output file or already opened file-like object. If the object is a file handle,
     |          no special array handling will be performed, all attributes will be saved to the same file.
     |      separately : list of str or None, optional
     |          If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store
     |          them into separate files. This prevent memory errors for large objects, and also allows
     |          `memory-mapping <https://en.wikipedia.org/wiki/Mmap>`_ the large arrays for efficient
     |          loading and sharing the large arrays in RAM between multiple processes.
     |      
     |          If list of str: store these attributes into separate files. The automated size check
     |          is not performed in this case.
     |      sep_limit : int, optional
     |          Don't store arrays smaller than this separately. In bytes.
     |      ignore : frozenset of str, optional
     |          Attributes that shouldn't be stored at all.
     |      pickle_protocol : int, optional
     |          Protocol number for pickle.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.utils.SaveLoad.load`
     |          Load object from file.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from gensim.utils.SaveLoad:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    

    Modules API Reference

    文档相似度计算

    from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
    from gensim.models import LsiModel
    
    model = LsiModel(common_corpus[:3], id2word=common_dictionary)  # train model
    vector = model[common_corpus[4]]  # apply model to BoW document
    model.add_documents(common_corpus[4:])  # update model with new documents
    tmp_fname = get_tmpfile("lsi.model")
    model.save(tmp_fname)  # save model
    loaded_model = LsiModel.load(tmp_fname)  # load model
    
    help(common_corpus)
    
    Help on list object:
    
    class list(object)
     |  list(iterable=(), /)
     |  
     |  Built-in mutable sequence.
     |  
     |  If no argument is given, the constructor creates a new empty list.
     |  The argument must be an iterable if specified.
     |  
     |  Methods defined here:
     |  
     |  __add__(self, value, /)
     |      Return self+value.
     |  
     |  __contains__(self, key, /)
     |      Return key in self.
     |  
     |  __delitem__(self, key, /)
     |      Delete self[key].
     |  
     |  __eq__(self, value, /)
     |      Return self==value.
     |  
     |  __ge__(self, value, /)
     |      Return self>=value.
     |  
     |  __getattribute__(self, name, /)
     |      Return getattr(self, name).
     |  
     |  __getitem__(...)
     |      x.__getitem__(y) <==> x[y]
     |  
     |  __gt__(self, value, /)
     |      Return self>value.
     |  
     |  __iadd__(self, value, /)
     |      Implement self+=value.
     |  
     |  __imul__(self, value, /)
     |      Implement self*=value.
     |  
     |  __init__(self, /, *args, **kwargs)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  __iter__(self, /)
     |      Implement iter(self).
     |  
     |  __le__(self, value, /)
     |      Return self<=value.
     |  
     |  __len__(self, /)
     |      Return len(self).
     |  
     |  __lt__(self, value, /)
     |      Return self<value.
     |  
     |  __mul__(self, value, /)
     |      Return self*value.
     |  
     |  __ne__(self, value, /)
     |      Return self!=value.
     |  
     |  __repr__(self, /)
     |      Return repr(self).
     |  
     |  __reversed__(self, /)
     |      Return a reverse iterator over the list.
     |  
     |  __rmul__(self, value, /)
     |      Return value*self.
     |  
     |  __setitem__(self, key, value, /)
     |      Set self[key] to value.
     |  
     |  __sizeof__(self, /)
     |      Return the size of the list in memory, in bytes.
     |  
     |  append(self, object, /)
     |      Append object to the end of the list.
     |  
     |  clear(self, /)
     |      Remove all items from list.
     |  
     |  copy(self, /)
     |      Return a shallow copy of the list.
     |  
     |  count(self, value, /)
     |      Return number of occurrences of value.
     |  
     |  extend(self, iterable, /)
     |      Extend list by appending elements from the iterable.
     |  
     |  index(self, value, start=0, stop=9223372036854775807, /)
     |      Return first index of value.
     |      
     |      Raises ValueError if the value is not present.
     |  
     |  insert(self, index, object, /)
     |      Insert object before index.
     |  
     |  pop(self, index=-1, /)
     |      Remove and return item at index (default last).
     |      
     |      Raises IndexError if list is empty or index is out of range.
     |  
     |  remove(self, value, /)
     |      Remove first occurrence of value.
     |      
     |      Raises ValueError if the value is not present.
     |  
     |  reverse(self, /)
     |      Reverse *IN PLACE*.
     |  
     |  sort(self, /, *, key=None, reverse=False)
     |      Stable sort *IN PLACE*.
     |  
     |  ----------------------------------------------------------------------
     |  Static methods defined here:
     |  
     |  __new__(*args, **kwargs) from builtins.type
     |      Create and return a new object.  See help(type) for accurate signature.
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  __hash__ = None
    
    from pprint import pprint
    
    pprint(common_corpus)
    
    [[(0, 1), (1, 1), (2, 1)],
     [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
     [(2, 1), (5, 1), (7, 1), (8, 1)],
     [(1, 1), (5, 2), (8, 1)],
     [(3, 1), (6, 1), (7, 1)],
     [(9, 1)],
     [(9, 1), (10, 1)],
     [(9, 1), (10, 1), (11, 1)],
     [(4, 1), (10, 1), (11, 1)]]
    
    help(common_dictionary)
    
    Help on Dictionary in module gensim.corpora.dictionary object:
    
    class Dictionary(gensim.utils.SaveLoad, collections.abc.Mapping)
     |  Dictionary(documents=None, prune_at=2000000)
     |  
     |  Dictionary encapsulates the mapping between normalized words and their integer ids.
     |  
     |  Notable instance attributes:
     |  
     |  Attributes
     |  ----------
     |  token2id : dict of (str, int)
     |      token -> tokenId.
     |  id2token : dict of (int, str)
     |      Reverse mapping for token2id, initialized in a lazy manner to save memory (not created until needed).
     |  cfs : dict of (int, int)
     |      Collection frequencies: token_id -> how many instances of this token are contained in the documents.
     |  dfs : dict of (int, int)
     |      Document frequencies: token_id -> how many documents contain this token.
     |  num_docs : int
     |      Number of documents processed.
     |  num_pos : int
     |      Total number of corpus positions (number of processed words).
     |  num_nnz : int
     |      Total number of non-zeroes in the BOW matrix (sum of the number of unique
     |      words per document over the entire corpus).
     |  
     |  Method resolution order:
     |      Dictionary
     |      gensim.utils.SaveLoad
     |      collections.abc.Mapping
     |      collections.abc.Collection
     |      collections.abc.Sized
     |      collections.abc.Iterable
     |      collections.abc.Container
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __getitem__(self, tokenid)
     |      Get the string token that corresponds to `tokenid`.
     |      
     |      Parameters
     |      ----------
     |      tokenid : int
     |          Id of token.
     |      
     |      Returns
     |      -------
     |      str
     |          Token corresponding to `tokenid`.
     |      
     |      Raises
     |      ------
     |      KeyError
     |          If this Dictionary doesn't contain such `tokenid`.
     |  
     |  __init__(self, documents=None, prune_at=2000000)
     |      Parameters
     |      ----------
     |      documents : iterable of iterable of str, optional
     |          Documents to be used to initialize the mapping and collect corpus statistics.
     |      prune_at : int, optional
     |          Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
     |          footprint, the correctness is not guaranteed.
     |          Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> texts = [['human', 'interface', 'computer']]
     |          >>> dct = Dictionary(texts)  # initialize a Dictionary
     |          >>> dct.add_documents([["cat", "say", "meow"], ["dog"]])  # add more document (extend the vocabulary)
     |          >>> dct.doc2bow(["dog", "computer", "non_existent_word"])
     |          [(0, 1), (6, 1)]
     |  
     |  __iter__(self)
     |      Iterate over all tokens.
     |  
     |  __len__(self)
     |      Get number of stored tokens.
     |      
     |      Returns
     |      -------
     |      int
     |          Number of stored tokens.
     |  
     |  __str__(self)
     |      Return str(self).
     |  
     |  add_documents(self, documents, prune_at=2000000)
     |      Update dictionary from a collection of `documents`.
     |      
     |      Parameters
     |      ----------
     |      documents : iterable of iterable of str
     |          Input corpus. All tokens should be already **tokenized and normalized**.
     |      prune_at : int, optional
     |          Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
     |          footprint, the correctness is not guaranteed.
     |          Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = ["máma mele maso".split(), "ema má máma".split()]
     |          >>> dct = Dictionary(corpus)
     |          >>> len(dct)
     |          5
     |          >>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]])
     |          >>> len(dct)
     |          10
     |  
     |  compactify(self)
     |      Assign new word ids to all words, shrinking any gaps.
     |  
     |  doc2bow(self, document, allow_update=False, return_missing=False)
     |      Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
     |      
     |      Parameters
     |      ----------
     |      document : list of str
     |          Input document.
     |      allow_update : bool, optional
     |          Update self, by adding new tokens from `document` and updating internal corpus statistics.
     |      return_missing : bool, optional
     |          Return missing tokens (tokens present in `document` but not in self) with frequencies?
     |      
     |      Return
     |      ------
     |      list of (int, int)
     |          BoW representation of `document`.
     |      list of (int, int), dict of (str, int)
     |          If `return_missing` is True, return BoW representation of `document` + dictionary with missing
     |          tokens and their frequencies.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
     |          >>> dct.doc2bow(["this", "is", "máma"])
     |          [(2, 1)]
     |          >>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
     |          ([(2, 1)], {u'this': 1, u'is': 1})
     |  
     |  doc2idx(self, document, unknown_word_index=-1)
     |      Convert `document` (a list of words) into a list of indexes = list of `token_id`.
     |      Replace all unknown words i.e, words not in the dictionary with the index as set via `unknown_word_index`.
     |      
     |      Parameters
     |      ----------
     |      document : list of str
     |          Input document
     |      unknown_word_index : int, optional
     |          Index to use for words not in the dictionary.
     |      
     |      Returns
     |      -------
     |      list of int
     |          Token ids for tokens in `document`, in the same order.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["a", "a", "b"], ["a", "c"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"])
     |          [0, 0, 2, -1, 2]
     |  
     |  filter_extremes(self, no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)
     |      Filter out tokens in the dictionary by their frequency.
     |      
     |      Parameters
     |      ----------
     |      no_below : int, optional
     |          Keep tokens which are contained in at least `no_below` documents.
     |      no_above : float, optional
     |          Keep tokens which are contained in no more than `no_above` documents
     |          (fraction of total corpus size, not an absolute number).
     |      keep_n : int, optional
     |          Keep only the first `keep_n` most frequent tokens.
     |      keep_tokens : iterable of str
     |          Iterable of tokens that **must** stay in dictionary after filtering.
     |      
     |      Notes
     |      -----
     |      This removes all tokens in the dictionary that are:
     |      
     |      #. Less frequent than `no_below` documents (absolute number, e.g. `5`) or 
     |      
     |      #. More frequent than `no_above` documents (fraction of the total corpus size, e.g. `0.3`).
     |      #. After (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if `keep_n=None`).
     |      
     |      After the pruning, resulting gaps in word ids are shrunk.
     |      Due to this gap shrinking, **the same word may have a different word id before and after the call
     |      to this function!**
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> len(dct)
     |          5
     |          >>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
     |          >>> len(dct)
     |          1
     |  
     |  filter_n_most_frequent(self, remove_n)
     |      Filter out the 'remove_n' most frequent tokens that appear in the documents.
     |      
     |      Parameters
     |      ----------
     |      remove_n : int
     |          Number of the most frequent tokens that will be removed.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> len(dct)
     |          5
     |          >>> dct.filter_n_most_frequent(2)
     |          >>> len(dct)
     |          3
     |  
     |  filter_tokens(self, bad_ids=None, good_ids=None)
     |      Remove the selected `bad_ids` tokens from :class:`~gensim.corpora.dictionary.Dictionary`.
     |      
     |      Alternatively, keep selected `good_ids` in :class:`~gensim.corpora.dictionary.Dictionary` and remove the rest.
     |      
     |      Parameters
     |      ----------
     |      bad_ids : iterable of int, optional
     |          Collection of word ids to be removed.
     |      good_ids : collection of int, optional
     |          Keep selected collection of word ids and remove the rest.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> 'ema' in dct.token2id
     |          True
     |          >>> dct.filter_tokens(bad_ids=[dct.token2id['ema']])
     |          >>> 'ema' in dct.token2id
     |          False
     |          >>> len(dct)
     |          4
     |          >>> dct.filter_tokens(good_ids=[dct.token2id['maso']])
     |          >>> len(dct)
     |          1
     |  
     |  iteritems(self)
     |  
     |  iterkeys = __iter__(self)
     |  
     |  itervalues(self)
     |  
     |  keys(self)
     |      Get all stored ids.
     |      
     |      Returns
     |      -------
     |      list of int
     |          List of all token ids.
     |  
     |  merge_with(self, other)
     |      Merge another dictionary into this dictionary, mapping the same tokens to the same ids
     |      and new tokens to new ids.
     |      
     |      Notes
     |      -----
     |      The purpose is to merge two corpora created using two different dictionaries: `self` and `other`.
     |      `other` can be any id=>word mapping (a dict, a Dictionary object, ...).
     |      
     |      Return a transformation object which, when accessed as `result[doc_from_other_corpus]`, will convert documents
     |      from a corpus built using the `other` dictionary into a document using the new, merged dictionary.
     |      
     |      Parameters
     |      ----------
     |      other : {dict, :class:`~gensim.corpora.dictionary.Dictionary`}
     |          Other dictionary.
     |      
     |      Return
     |      ------
     |      :class:`gensim.models.VocabTransform`
     |          Transformation object.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]]
     |          >>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2)
     |          >>> dct_1.doc2bow(corpus_2[0])
     |          [(0, 1)]
     |          >>> transformer = dct_1.merge_with(dct_2)
     |          >>> dct_1.doc2bow(corpus_2[0])
     |          [(0, 1), (3, 2)]
     |  
     |  patch_with_special_tokens(self, special_token_dict)
     |      Patch token2id and id2token using a dictionary of special tokens.
     |      
     |      
     |      **Usecase:** when doing sequence modeling (e.g. named entity recognition), one may  want to specify
     |      special tokens that behave differently than others.
     |      One example is the "unknown" token, and another is the padding token.
     |      It is usual to set the padding token to have index `0`, and patching the dictionary with `{'<PAD>': 0}`
     |      would be one way to specify this.
     |      
     |      Parameters
     |      ----------
     |      special_token_dict : dict of (str, int)
     |          dict containing the special tokens as keys and their wanted indices as values.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>>
     |          >>> special_tokens = {'pad': 0, 'space': 1}
     |          >>> print(dct.token2id)
     |          {'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4}
     |          >>>
     |          >>> dct.patch_with_special_tokens(special_tokens)
     |          >>> print(dct.token2id)
     |          {'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
     |  
     |  save_as_text(self, fname, sort_by_word=True)
     |      Save :class:`~gensim.corpora.dictionary.Dictionary` to a text file.
     |      
     |      Parameters
     |      ----------
     |      fname : str
     |          Path to output file.
     |      sort_by_word : bool, optional
     |          Sort words in lexicographical order before writing them out?
     |      
     |      Notes
     |      -----
     |      Format::
     |      
     |          num_docs
     |          id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
     |          id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
     |          ....
     |          id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]
     |      
     |      This text format is great for corpus inspection and debugging. As plaintext, it's also easily portable
     |      to other tools and frameworks. For better performance and to store the entire object state,
     |      including collected corpus statistics, use :meth:`~gensim.corpora.dictionary.Dictionary.save` and
     |      :meth:`~gensim.corpora.dictionary.Dictionary.load` instead.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.corpora.dictionary.Dictionary.load_from_text`
     |          Load :class:`~gensim.corpora.dictionary.Dictionary` from text file.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>> from gensim.test.utils import get_tmpfile
     |          >>>
     |          >>> tmp_fname = get_tmpfile("dictionary")
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>>
     |          >>> dct = Dictionary(corpus)
     |          >>> dct.save_as_text(tmp_fname)
     |          >>>
     |          >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
     |          >>> assert dct.token2id == loaded_dct.token2id
     |  
     |  ----------------------------------------------------------------------
     |  Static methods defined here:
     |  
     |  from_corpus(corpus, id2word=None)
     |      Create :class:`~gensim.corpora.dictionary.Dictionary` from an existing corpus.
     |      
     |      Parameters
     |      ----------
     |      corpus : iterable of iterable of (int, number)
     |          Corpus in BoW format.
     |      id2word : dict of (int, object)
     |          Mapping id -> word. If None, the mapping `id2word[word_id] = str(word_id)` will be used.
     |      
     |      Notes
     |      -----
     |      This can be useful if you only have a term-document BOW matrix (represented by `corpus`), but not the original
     |      text corpus. This method will scan the term-document count matrix for all word ids that appear in it,
     |      then construct :class:`~gensim.corpora.dictionary.Dictionary` which maps each `word_id -> id2word[word_id]`.
     |      `id2word` is an optional dictionary that maps the `word_id` to a token.
     |      In case `id2word` isn't specified the mapping `id2word[word_id] = str(word_id)` will be used.
     |      
     |      Returns
     |      -------
     |      :class:`~gensim.corpora.dictionary.Dictionary`
     |          Inferred dictionary from corpus.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []]
     |          >>> dct = Dictionary.from_corpus(corpus)
     |          >>> len(dct)
     |          3
     |  
     |  from_documents(documents)
     |      Create :class:`~gensim.corpora.dictionary.Dictionary` from `documents`.
     |      
     |      Equivalent to `Dictionary(documents=documents)`.
     |      
     |      Parameters
     |      ----------
     |      documents : iterable of iterable of str
     |          Input corpus.
     |      
     |      Returns
     |      -------
     |      :class:`~gensim.corpora.dictionary.Dictionary`
     |          Dictionary initialized from `documents`.
     |  
     |  load_from_text(fname)
     |      Load a previously stored :class:`~gensim.corpora.dictionary.Dictionary` from a text file.
     |      
     |      Mirror function to :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
     |      
     |      Parameters
     |      ----------
     |      fname: str
     |          Path to a file produced by :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`
     |          Save :class:`~gensim.corpora.dictionary.Dictionary` to text file.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>> from gensim.test.utils import get_tmpfile
     |          >>>
     |          >>> tmp_fname = get_tmpfile("dictionary")
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>>
     |          >>> dct = Dictionary(corpus)
     |          >>> dct.save_as_text(tmp_fname)
     |          >>>
     |          >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
     |          >>> assert dct.token2id == loaded_dct.token2id
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  __abstractmethods__ = frozenset()
     |  
     |  __slotnames__ = []
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from gensim.utils.SaveLoad:
     |  
     |  save(self, fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)
     |      Save the object to a file.
     |      
     |      Parameters
     |      ----------
     |      fname_or_handle : str or file-like
     |          Path to output file or already opened file-like object. If the object is a file handle,
     |          no special array handling will be performed, all attributes will be saved to the same file.
     |      separately : list of str or None, optional
     |          If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store
     |          them into separate files. This prevent memory errors for large objects, and also allows
     |          `memory-mapping <https://en.wikipedia.org/wiki/Mmap>`_ the large arrays for efficient
     |          loading and sharing the large arrays in RAM between multiple processes.
     |      
     |          If list of str: store these attributes into separate files. The automated size check
     |          is not performed in this case.
     |      sep_limit : int, optional
     |          Don't store arrays smaller than this separately. In bytes.
     |      ignore : frozenset of str, optional
     |          Attributes that shouldn't be stored at all.
     |      pickle_protocol : int, optional
     |          Protocol number for pickle.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.utils.SaveLoad.load`
     |          Load object from file.
     |  
     |  ----------------------------------------------------------------------
     |  Class methods inherited from gensim.utils.SaveLoad:
     |  
     |  load(fname, mmap=None) from abc.ABCMeta
     |      Load an object previously saved using :meth:`~gensim.utils.SaveLoad.save` from a file.
     |      
     |      Parameters
     |      ----------
     |      fname : str
     |          Path to file that contains needed object.
     |      mmap : str, optional
     |          Memory-map option.  If the object was saved with large arrays stored separately, you can load these arrays
     |          via mmap (shared memory) using `mmap='r'.
     |          If the file being loaded is compressed (either '.gz' or '.bz2'), then `mmap=None` **must be** set.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.utils.SaveLoad.save`
     |          Save object to file.
     |      
     |      Returns
     |      -------
     |      object
     |          Object loaded from `fname`.
     |      
     |      Raises
     |      ------
     |      AttributeError
     |          When called on an object instance instead of class (this is a class method).
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from gensim.utils.SaveLoad:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from collections.abc.Mapping:
     |  
     |  __contains__(self, key)
     |  
     |  __eq__(self, other)
     |      Return self==value.
     |  
     |  get(self, key, default=None)
     |      D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.
     |  
     |  items(self)
     |      D.items() -> a set-like object providing a view on D's items
     |  
     |  values(self)
     |      D.values() -> an object providing a view on D's values
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from collections.abc.Mapping:
     |  
     |  __hash__ = None
     |  
     |  __reversed__ = None
     |  
     |  ----------------------------------------------------------------------
     |  Class methods inherited from collections.abc.Collection:
     |  
     |  __subclasshook__(C) from abc.ABCMeta
     |      Abstract classes can override this to customize issubclass().
     |      
     |      This is invoked early on by abc.ABCMeta.__subclasscheck__().
     |      It should return True, False or NotImplemented.  If it returns
     |      NotImplemented, the normal algorithm is used.  Otherwise, it
     |      overrides the normal algorithm (and the outcome is cached).
    
    help(get_tmpfile)
    
    Help on function get_tmpfile in module gensim.test.utils:
    
    get_tmpfile(suffix)
        Get full path to file `suffix` in temporary folder.
        This function doesn't creates file (only generate unique name).
        Also, it may return different paths in consecutive calling.
        
        Parameters
        ----------
        suffix : str
            Suffix of file.
        
        Returns
        -------
        str
            Path to `suffix` file in temporary folder.
        
        Examples
        --------
        Using this function we may get path to temporary file and use it, for example, to store temporary model.
        
        .. sourcecode:: pycon
        
            >>> from gensim.models import LsiModel
            >>> from gensim.test.utils import get_tmpfile, common_dictionary, common_corpus
            >>>
            >>> tmp_f = get_tmpfile("toy_lsi_model")
            >>>
            >>> model = LsiModel(common_corpus, id2word=common_dictionary)
            >>> model.save(tmp_f)
            >>>
            >>> loaded_model = LsiModel.load(tmp_f)
    
    help(models.LsiModel)
    
    Help on class LsiModel in module gensim.models.lsimodel:
    
    class LsiModel(gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel)
     |  LsiModel(corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100, dtype=<class 'numpy.float64'>)
     |  
     |  Model for `Latent Semantic Indexing
     |  <https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing>`_.
     |  
     |  The decomposition algorithm is described in `"Fast and Faster: A Comparison of Two Streamed
     |  Matrix Decomposition Algorithms" <https://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf>`_.
     |  
     |  Notes
     |  -----
     |  * :attr:`gensim.models.lsimodel.LsiModel.projection.u` - left singular vectors,
     |  * :attr:`gensim.models.lsimodel.LsiModel.projection.s` - singular values,
     |  * ``model[training_corpus]`` - right singular vectors (can be reconstructed if needed).
     |  
     |  See Also
     |  --------
     |  `FAQ about LSI matrices
     |  <https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ#q4-how-do-you-output-the-u-s-vt-matrices-of-lsi>`_.
     |  
     |  Examples
     |  --------
     |  .. sourcecode:: pycon
     |  
     |      >>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
     |      >>> from gensim.models import LsiModel
     |      >>>
     |      >>> model = LsiModel(common_corpus[:3], id2word=common_dictionary)  # train model
     |      >>> vector = model[common_corpus[4]]  # apply model to BoW document
     |      >>> model.add_documents(common_corpus[4:])  # update model with new documents
     |      >>> tmp_fname = get_tmpfile("lsi.model")
     |      >>> model.save(tmp_fname)  # save model
     |      >>> loaded_model = LsiModel.load(tmp_fname)  # load model
     |  
     |  Method resolution order:
     |      LsiModel
     |      gensim.interfaces.TransformationABC
     |      gensim.utils.SaveLoad
     |      gensim.models.basemodel.BaseTopicModel
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __getitem__(self, bow, scaled=False, chunksize=512)
     |      Get the latent representation for `bow`.
     |      
     |      Parameters
     |      ----------
     |      bow : {list of (int, int), iterable of list of (int, int)}
     |          Document or corpus in BoW representation.
     |      scaled : bool, optional
     |          If True - topics will be scaled by the inverse of singular values.
     |      chunksize :  int, optional
     |          Number of documents to be used in each applying chunk.
     |      
     |      Returns
     |      -------
     |      list of (int, float)
     |          Latent representation of topics in BoW format for document **OR**
     |      :class:`gensim.matutils.Dense2Corpus`
     |          Latent representation of corpus in BoW format if `bow` is corpus.
     |  
     |  __init__(self, corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100, dtype=<class 'numpy.float64'>)
     |      Construct an `LsiModel` object.
     |      
     |      Either `corpus` or `id2word` must be supplied in order to train the model.
     |      
     |      Parameters
     |      ----------
     |      corpus : {iterable of list of (int, float), scipy.sparse.csc}, optional
     |          Stream of document vectors or sparse matrix of shape (`num_documents`, `num_terms`).
     |      num_topics : int, optional
     |          Number of requested factors (latent dimensions)
     |      id2word : dict of {int: str}, optional
     |          ID to word mapping, optional.
     |      chunksize :  int, optional
     |          Number of documents to be used in each training chunk.
     |      decay : float, optional
     |          Weight of existing observations relatively to new ones.
     |      distributed : bool, optional
     |          If True - distributed mode (parallel execution on several machines) will be used.
     |      onepass : bool, optional
     |          Whether the one-pass algorithm should be used for training.
     |          Pass `False` to force a multi-pass stochastic algorithm.
     |      power_iters: int, optional
     |          Number of power iteration steps to be used.
     |          Increasing the number of power iterations improves accuracy, but lowers performance
     |      extra_samples : int, optional
     |          Extra samples to be used besides the rank `k`. Can improve accuracy.
     |      dtype : type, optional
     |          Enforces a type for elements of the decomposed matrix.
     |  
     |  __str__(self)
     |      Get a human readable representation of model.
     |      
     |      Returns
     |      -------
     |      str
     |          A human readable string of the current objects parameters.
     |  
     |  add_documents(self, corpus, chunksize=None, decay=None)
     |      Update model with new `corpus`.
     |      
     |      Parameters
     |      ----------
     |      corpus : {iterable of list of (int, float), scipy.sparse.csc}
     |          Stream of document vectors or sparse matrix of shape (`num_terms`, num_documents).
     |      chunksize : int, optional
     |          Number of documents to be used in each training chunk, will use `self.chunksize` if not specified.
     |      decay : float, optional
     |          Weight of existing observations relatively to new ones,  will use `self.decay` if not specified.
     |      
     |      Notes
     |      -----
     |      Training proceeds in chunks of `chunksize` documents at a time. The size of `chunksize` is a tradeoff
     |      between increased speed (bigger `chunksize`) vs. lower memory footprint (smaller `chunksize`).
     |      If the distributed mode is on, each chunk is sent to a different worker/computer.
     |  
     |  get_topics(self)
     |      Get the topic vectors.
     |      
     |      Notes
     |      -----
     |      The number of topics can actually be smaller than `self.num_topics`, if there were not enough factors
     |      in the matrix (real rank of input matrix smaller than `self.num_topics`).
     |      
     |      Returns
     |      -------
     |      np.ndarray
     |          The term topic matrix with shape (`num_topics`, `vocabulary_size`)
     |  
     |  print_debug(self, num_topics=5, num_words=10)
     |      Print (to log) the most salient words of the first `num_topics` topics.
     |      
     |      Unlike :meth:`~gensim.models.lsimodel.LsiModel.print_topics`, this looks for words that are significant for
     |      a particular topic *and* not for others. This *should* result in a
     |      more human-interpretable description of topics.
     |      
     |      Alias for :func:`~gensim.models.lsimodel.print_debug`.
     |      
     |      Parameters
     |      ----------
     |      num_topics : int, optional
     |          The number of topics to be selected (ordered by significance).
     |      num_words : int, optional
     |          The number of words to be included per topics (ordered by significance).
     |  
     |  save(self, fname, *args, **kwargs)
     |      Save the model to a file.
     |      
     |      Notes
     |      -----
     |      Large internal arrays may be stored into separate files, with `fname` as prefix.
     |      
     |      Warnings
     |      --------
     |      Do not save as a compressed file if you intend to load the file back with `mmap`.
     |      
     |      Parameters
     |      ----------
     |      fname : str
     |          Path to output file.
     |      *args
     |          Variable length argument list, see :meth:`gensim.utils.SaveLoad.save`.
     |      **kwargs
     |          Arbitrary keyword arguments, see :meth:`gensim.utils.SaveLoad.save`.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.models.lsimodel.LsiModel.load`
     |  
     |  show_topic(self, topicno, topn=10)
     |      Get the words that define a topic along with their contribution.
     |      
     |      This is actually the left singular vector of the specified topic.
     |      
     |      The most important words in defining the topic (greatest absolute value) are included
     |      in the output, along with their contribution to the topic.
     |      
     |      Parameters
     |      ----------
     |      topicno : int
     |          The topics id number.
     |      topn : int
     |          Number of words to be included to the result.
     |      
     |      Returns
     |      -------
     |      list of (str, float)
     |          Topic representation in BoW format.
     |  
     |  show_topics(self, num_topics=-1, num_words=10, log=False, formatted=True)
     |      Get the most significant topics.
     |      
     |      Parameters
     |      ----------
     |      num_topics : int, optional
     |          The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
     |      num_words : int, optional
     |          The number of words to be included per topics (ordered by significance).
     |      log : bool, optional
     |          If True - log topics with logger.
     |      formatted : bool, optional
     |          If True - each topic represented as string, otherwise - in BoW format.
     |      
     |      Returns
     |      -------
     |      list of (int, str)
     |          If `formatted=True`, return sequence with (topic_id, string representation of topics) **OR**
     |      list of (int, list of (str, float))
     |          Otherwise, return sequence with (topic_id, [(word, value), ... ]).
     |  
     |  ----------------------------------------------------------------------
     |  Class methods defined here:
     |  
     |  load(fname, *args, **kwargs) from builtins.type
     |      Load a previously saved object using :meth:`~gensim.models.lsimodel.LsiModel.save` from file.
     |      
     |      Notes
     |      -----
     |      Large arrays can be memmap'ed back as read-only (shared memory) by setting the `mmap='r'` parameter.
     |      
     |      Parameters
     |      ----------
     |      fname : str
     |          Path to file that contains LsiModel.
     |      *args
     |          Variable length argument list, see :meth:`gensim.utils.SaveLoad.load`.
     |      **kwargs
     |          Arbitrary keyword arguments, see :meth:`gensim.utils.SaveLoad.load`.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.models.lsimodel.LsiModel.save`
     |      
     |      Returns
     |      -------
     |      :class:`~gensim.models.lsimodel.LsiModel`
     |          Loaded instance.
     |      
     |      Raises
     |      ------
     |      IOError
     |          When methods are called on instance (should be called from class).
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  __slotnames__ = []
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from gensim.utils.SaveLoad:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from gensim.models.basemodel.BaseTopicModel:
     |  
     |  print_topic(self, topicno, topn=10)
     |      Get a single topic as a formatted string.
     |      
     |      Parameters
     |      ----------
     |      topicno : int
     |          Topic id.
     |      topn : int
     |          Number of words from topic that will be used.
     |      
     |      Returns
     |      -------
     |      str
     |          String representation of topic, like '-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + ... '.
     |  
     |  print_topics(self, num_topics=20, num_words=10)
     |      Get the most significant topics (alias for `show_topics()` method).
     |      
     |      Parameters
     |      ----------
     |      num_topics : int, optional
     |          The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
     |      num_words : int, optional
     |          The number of words to be included per topics (ordered by significance).
     |      
     |      Returns
     |      -------
     |      list of (int, list of (str, float))
     |          Sequence with (topic_id, [(word, value), ... ]).
    
    展开全文
  • python gensim

    2018-05-16 10:42:11
    直接把解压后的文档里的gensim文件放进python27下 的lib库里
  • glove-gensim, 将手套矢量转换为word2vec格式,便于使用 Gensim 手套 gensim将手套矢量转换为word2vec格式,便于使用 Gensimword2vec嵌入以行( 令牌)的数目和文件的维数开始。 这允许gensim为查询模型相应地分配内存...
  • linux裸机安装自然语言处理开源库gensim,因为没有网络,很多人苦恼安装gensim时因为版本不匹配,依赖包过多,安装繁琐。现在我将工作中总结的资料整理,方便大家使用。
  • Win10下Gensim 3.8编译包,能大大提高词向量训练速度。 Python3.7 64位,VS2015编译。 解压后,用python setup.py install命令安装。 多台电脑测试可用。
  • Gensim是一个相当专业的主题模型Python工具包。在文本处理中,比如商品评论挖掘,有时需要了解每个评论分别和商品的描述之间的相似度,以此衡量评论的客观性。评论和商品描述的相似度越高,说明评论的用语比较官方,...
  • gensim-3.6.0

    2018-11-15 09:52:20
    python的gensim最新包,windows下安装python setup.py install
  • Gensim官方教程中译版

    2018-01-31 16:11:38
    Gensim是一个易学易用的机器学习工具包,基于Python,主打topic model,当然也包含其它NLP中常用模型(如LSA、word2vec等)。此文档是笔者基于gensim官网所提供教程翻译而来,限于水平,难免有纰漏,望读者不吝指正...
  • gensim官方文档教程

    2017-09-22 21:13:40
    资源为Gensim官方文档对应教程,其中包含语料集,常见算法的Jupyter形式教程,比如Word2vec及相似度比较等。
  • 、使用python gensim库用LDA处理20newsgroups数据集。 此代码使用gensim库将LDA(潜在Dirichlet分配)应用于20newsgroups数据集。
  • gensim-3.8.3-cp37-cp37m-win_amd64
  • 文本分类 使用gensim python库为文本分类编写的测试脚本
  • Python gensim-0.12.1.rar

    2018-04-26 10:32:48
    直接把解压后的文档里的gensim文件放进python27下 的lib库里就可以了 直接把解压后的文档里的gensim文件放进python27下 的lib库里就可以了 直接把解压后的文档里的gensim文件放进python27下 的lib库里就可以了 直接...
  • 由于gensim-3.8.1-cp37-cp37m-win_amd64.whl官网下载太慢,现提供资源可快速下载,如需更多资源可评论上传
  • gensim-3.8.3-cp39-cp39-win_amd64
  • gensim包,setup安装

    2018-12-14 19:51:48
    gensim包,setup安装
  • 推荐算法中的word2vec大牛直接用的该库源码,然后这个玩意目前我无法得到,所以有看到word2vec库,然而这个库不能直接在win下用pip安装,所以我先试试在服务器能不能行? 服务器是可以的,依赖环境Cython。...
  • gensim window 64位包 版本 2.1 nlp常用库
  • gensim包的tfidf方法计算中文文本相似度,代码可直接运行,包含中文停用词,方便。
  • python的第三方库gensim-3.8.1-cp27-cp27m-win_amd64.whl文件,从官网下太慢了,因此放上来分享给需要的人。
  • 预训练词嵌入gensim-torchtext 这可以帮助预训练单词嵌入(使用gensim API)以获取与Torchtext兼容格式的数据 因此,现在我们假设您有一个与Torchtext兼容的自定义数据集。 通过查看本您可能会熟悉该数据集的外观。...
  • Word2vec使用gensim语料库并轻松创建训练模块。 入门 使用或创建数据集(您可以使用项目中的数据集) 运行源代码ClearDataGensim以获得清晰的html标签 运行源PreprocessingData和脚本ClearDataGensim的数据结果...
  • gensim-3.8.1-cp37-cp37m-win_amd64 轮子文件,不用去官网下载了。可以直接本地安装使用
  • gensim-3.8.0-cp36-cp36m-manylinux1_x86_64.whl!!!下载速度超快!!!
  • gensim-sklearn-wrapper gensim 包的 scikit-learn 包装器。 通过 scikit-learn 的 Pipeline 和 GridSearchCV 类轻松使用。 目前,仅实现了潜在狄利克雷分配 (LDA) 和潜在语义索引 (LSI) 算法的 transform() 和 fit...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 13,342
精华内容 5,336
关键字:

gensim