精华内容
下载资源
问答
  • Gensim

    千次阅读 2020-05-16 15:47:48
    Gensim从最原始的非结构化的文本中,无监督的学习到文本隐层的主题向量表达;支持包括LDA TF-IDFLSA word2vec等主题模型算法。

    Gensim

    从最原始的非结构化的文本中,无监督的学习到文本隐层的主题向量表达;

    支持包括LDA TF-IDFLSA word2vec等主题模型算法。

    官网

    基本概念

    • 语料 Corpus
    • 向量 Vector
    • 稀疏向量 SparseVector
    • 模型 Model

    安装

    安装环境

    • Ubuntu18.04
    • Anaconda3-5.3.1
    !pip install gensim
    
    !conda list | grep gensim
    
    gensim                    3.8.3                     <pip>
    
    !pip install PyHamcrest
    !pip show PyHamcrest
    
    import gensim
    
    gensim.__version__
    
    '3.8.3'
    

    训练语料的预处理

    原始字符串 -> 稀疏向量

    原始文本 -> 分词、去除停用词等 -> 文档特征列表

    词袋模型 文档特征——word

    from gensim import corpora
    
    texts = [['a', 'b', 'c'],
        ['a', 'd', 'b']]
    dictionary = corpora.Dictionary(texts) 
    corpus = [dictionary.doc2bow(text) for text in texts] #词袋模型 doc2bow
    print(corpus)
    print()
    print(corpus[0])
    print(corpus[1])
    
    [[(0, 1), (1, 1), (2, 1)], [(0, 1), (1, 1), (3, 1)]]
    
    [(0, 1), (1, 1), (2, 1)]
    [(0, 1), (1, 1), (3, 1)]
    
    help(corpora.Dictionary)
    
    Help on class Dictionary in module gensim.corpora.dictionary:
    
    class Dictionary(gensim.utils.SaveLoad, collections.abc.Mapping)
     |  Dictionary(documents=None, prune_at=2000000)
     |  
     |  Dictionary encapsulates the mapping between normalized words and their integer ids.
     |  
     |  Notable instance attributes:
     |  
     |  Attributes
     |  ----------
     |  token2id : dict of (str, int)
     |      token -> tokenId.
     |  id2token : dict of (int, str)
     |      Reverse mapping for token2id, initialized in a lazy manner to save memory (not created until needed).
     |  cfs : dict of (int, int)
     |      Collection frequencies: token_id -> how many instances of this token are contained in the documents.
     |  dfs : dict of (int, int)
     |      Document frequencies: token_id -> how many documents contain this token.
     |  num_docs : int
     |      Number of documents processed.
     |  num_pos : int
     |      Total number of corpus positions (number of processed words).
     |  num_nnz : int
     |      Total number of non-zeroes in the BOW matrix (sum of the number of unique
     |      words per document over the entire corpus).
     |  
     |  Method resolution order:
     |      Dictionary
     |      gensim.utils.SaveLoad
     |      collections.abc.Mapping
     |      collections.abc.Collection
     |      collections.abc.Sized
     |      collections.abc.Iterable
     |      collections.abc.Container
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __getitem__(self, tokenid)
     |      Get the string token that corresponds to `tokenid`.
     |      
     |      Parameters
     |      ----------
     |      tokenid : int
     |          Id of token.
     |      
     |      Returns
     |      -------
     |      str
     |          Token corresponding to `tokenid`.
     |      
     |      Raises
     |      ------
     |      KeyError
     |          If this Dictionary doesn't contain such `tokenid`.
     |  
     |  __init__(self, documents=None, prune_at=2000000)
     |      Parameters
     |      ----------
     |      documents : iterable of iterable of str, optional
     |          Documents to be used to initialize the mapping and collect corpus statistics.
     |      prune_at : int, optional
     |          Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
     |          footprint, the correctness is not guaranteed.
     |          Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> texts = [['human', 'interface', 'computer']]
     |          >>> dct = Dictionary(texts)  # initialize a Dictionary
     |          >>> dct.add_documents([["cat", "say", "meow"], ["dog"]])  # add more document (extend the vocabulary)
     |          >>> dct.doc2bow(["dog", "computer", "non_existent_word"])
     |          [(0, 1), (6, 1)]
     |  
     |  __iter__(self)
     |      Iterate over all tokens.
     |  
     |  __len__(self)
     |      Get number of stored tokens.
     |      
     |      Returns
     |      -------
     |      int
     |          Number of stored tokens.
     |  
     |  __str__(self)
     |      Return str(self).
     |  
     |  add_documents(self, documents, prune_at=2000000)
     |      Update dictionary from a collection of `documents`.
     |      
     |      Parameters
     |      ----------
     |      documents : iterable of iterable of str
     |          Input corpus. All tokens should be already **tokenized and normalized**.
     |      prune_at : int, optional
     |          Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
     |          footprint, the correctness is not guaranteed.
     |          Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = ["máma mele maso".split(), "ema má máma".split()]
     |          >>> dct = Dictionary(corpus)
     |          >>> len(dct)
     |          5
     |          >>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]])
     |          >>> len(dct)
     |          10
     |  
     |  compactify(self)
     |      Assign new word ids to all words, shrinking any gaps.
     |  
     |  doc2bow(self, document, allow_update=False, return_missing=False)
     |      Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
     |      
     |      Parameters
     |      ----------
     |      document : list of str
     |          Input document.
     |      allow_update : bool, optional
     |          Update self, by adding new tokens from `document` and updating internal corpus statistics.
     |      return_missing : bool, optional
     |          Return missing tokens (tokens present in `document` but not in self) with frequencies?
     |      
     |      Return
     |      ------
     |      list of (int, int)
     |          BoW representation of `document`.
     |      list of (int, int), dict of (str, int)
     |          If `return_missing` is True, return BoW representation of `document` + dictionary with missing
     |          tokens and their frequencies.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
     |          >>> dct.doc2bow(["this", "is", "máma"])
     |          [(2, 1)]
     |          >>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
     |          ([(2, 1)], {u'this': 1, u'is': 1})
     |  
     |  doc2idx(self, document, unknown_word_index=-1)
     |      Convert `document` (a list of words) into a list of indexes = list of `token_id`.
     |      Replace all unknown words i.e, words not in the dictionary with the index as set via `unknown_word_index`.
     |      
     |      Parameters
     |      ----------
     |      document : list of str
     |          Input document
     |      unknown_word_index : int, optional
     |          Index to use for words not in the dictionary.
     |      
     |      Returns
     |      -------
     |      list of int
     |          Token ids for tokens in `document`, in the same order.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["a", "a", "b"], ["a", "c"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"])
     |          [0, 0, 2, -1, 2]
     |  
     |  filter_extremes(self, no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)
     |      Filter out tokens in the dictionary by their frequency.
     |      
     |      Parameters
     |      ----------
     |      no_below : int, optional
     |          Keep tokens which are contained in at least `no_below` documents.
     |      no_above : float, optional
     |          Keep tokens which are contained in no more than `no_above` documents
     |          (fraction of total corpus size, not an absolute number).
     |      keep_n : int, optional
     |          Keep only the first `keep_n` most frequent tokens.
     |      keep_tokens : iterable of str
     |          Iterable of tokens that **must** stay in dictionary after filtering.
     |      
     |      Notes
     |      -----
     |      This removes all tokens in the dictionary that are:
     |      
     |      #. Less frequent than `no_below` documents (absolute number, e.g. `5`) or 
     |      
     |      #. More frequent than `no_above` documents (fraction of the total corpus size, e.g. `0.3`).
     |      #. After (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if `keep_n=None`).
     |      
     |      After the pruning, resulting gaps in word ids are shrunk.
     |      Due to this gap shrinking, **the same word may have a different word id before and after the call
     |      to this function!**
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> len(dct)
     |          5
     |          >>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
     |          >>> len(dct)
     |          1
     |  
     |  filter_n_most_frequent(self, remove_n)
     |      Filter out the 'remove_n' most frequent tokens that appear in the documents.
     |      
     |      Parameters
     |      ----------
     |      remove_n : int
     |          Number of the most frequent tokens that will be removed.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> len(dct)
     |          5
     |          >>> dct.filter_n_most_frequent(2)
     |          >>> len(dct)
     |          3
     |  
     |  filter_tokens(self, bad_ids=None, good_ids=None)
     |      Remove the selected `bad_ids` tokens from :class:`~gensim.corpora.dictionary.Dictionary`.
     |      
     |      Alternatively, keep selected `good_ids` in :class:`~gensim.corpora.dictionary.Dictionary` and remove the rest.
     |      
     |      Parameters
     |      ----------
     |      bad_ids : iterable of int, optional
     |          Collection of word ids to be removed.
     |      good_ids : collection of int, optional
     |          Keep selected collection of word ids and remove the rest.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> 'ema' in dct.token2id
     |          True
     |          >>> dct.filter_tokens(bad_ids=[dct.token2id['ema']])
     |          >>> 'ema' in dct.token2id
     |          False
     |          >>> len(dct)
     |          4
     |          >>> dct.filter_tokens(good_ids=[dct.token2id['maso']])
     |          >>> len(dct)
     |          1
     |  
     |  iteritems(self)
     |  
     |  iterkeys = __iter__(self)
     |  
     |  itervalues(self)
     |  
     |  keys(self)
     |      Get all stored ids.
     |      
     |      Returns
     |      -------
     |      list of int
     |          List of all token ids.
     |  
     |  merge_with(self, other)
     |      Merge another dictionary into this dictionary, mapping the same tokens to the same ids
     |      and new tokens to new ids.
     |      
     |      Notes
     |      -----
     |      The purpose is to merge two corpora created using two different dictionaries: `self` and `other`.
     |      `other` can be any id=>word mapping (a dict, a Dictionary object, ...).
     |      
     |      Return a transformation object which, when accessed as `result[doc_from_other_corpus]`, will convert documents
     |      from a corpus built using the `other` dictionary into a document using the new, merged dictionary.
     |      
     |      Parameters
     |      ----------
     |      other : {dict, :class:`~gensim.corpora.dictionary.Dictionary`}
     |          Other dictionary.
     |      
     |      Return
     |      ------
     |      :class:`gensim.models.VocabTransform`
     |          Transformation object.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]]
     |          >>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2)
     |          >>> dct_1.doc2bow(corpus_2[0])
     |          [(0, 1)]
     |          >>> transformer = dct_1.merge_with(dct_2)
     |          >>> dct_1.doc2bow(corpus_2[0])
     |          [(0, 1), (3, 2)]
     |  
     |  patch_with_special_tokens(self, special_token_dict)
     |      Patch token2id and id2token using a dictionary of special tokens.
     |      
     |      
     |      **Usecase:** when doing sequence modeling (e.g. named entity recognition), one may  want to specify
     |      special tokens that behave differently than others.
     |      One example is the "unknown" token, and another is the padding token.
     |      It is usual to set the padding token to have index `0`, and patching the dictionary with `{'<PAD>': 0}`
     |      would be one way to specify this.
     |      
     |      Parameters
     |      ----------
     |      special_token_dict : dict of (str, int)
     |          dict containing the special tokens as keys and their wanted indices as values.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>>
     |          >>> special_tokens = {'pad': 0, 'space': 1}
     |          >>> print(dct.token2id)
     |          {'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4}
     |          >>>
     |          >>> dct.patch_with_special_tokens(special_tokens)
     |          >>> print(dct.token2id)
     |          {'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
     |  
     |  save_as_text(self, fname, sort_by_word=True)
     |      Save :class:`~gensim.corpora.dictionary.Dictionary` to a text file.
     |      
     |      Parameters
     |      ----------
     |      fname : str
     |          Path to output file.
     |      sort_by_word : bool, optional
     |          Sort words in lexicographical order before writing them out?
     |      
     |      Notes
     |      -----
     |      Format::
     |      
     |          num_docs
     |          id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
     |          id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
     |          ....
     |          id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]
     |      
     |      This text format is great for corpus inspection and debugging. As plaintext, it's also easily portable
     |      to other tools and frameworks. For better performance and to store the entire object state,
     |      including collected corpus statistics, use :meth:`~gensim.corpora.dictionary.Dictionary.save` and
     |      :meth:`~gensim.corpora.dictionary.Dictionary.load` instead.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.corpora.dictionary.Dictionary.load_from_text`
     |          Load :class:`~gensim.corpora.dictionary.Dictionary` from text file.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>> from gensim.test.utils import get_tmpfile
     |          >>>
     |          >>> tmp_fname = get_tmpfile("dictionary")
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>>
     |          >>> dct = Dictionary(corpus)
     |          >>> dct.save_as_text(tmp_fname)
     |          >>>
     |          >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
     |          >>> assert dct.token2id == loaded_dct.token2id
     |  
     |  ----------------------------------------------------------------------
     |  Static methods defined here:
     |  
     |  from_corpus(corpus, id2word=None)
     |      Create :class:`~gensim.corpora.dictionary.Dictionary` from an existing corpus.
     |      
     |      Parameters
     |      ----------
     |      corpus : iterable of iterable of (int, number)
     |          Corpus in BoW format.
     |      id2word : dict of (int, object)
     |          Mapping id -> word. If None, the mapping `id2word[word_id] = str(word_id)` will be used.
     |      
     |      Notes
     |      -----
     |      This can be useful if you only have a term-document BOW matrix (represented by `corpus`), but not the original
     |      text corpus. This method will scan the term-document count matrix for all word ids that appear in it,
     |      then construct :class:`~gensim.corpora.dictionary.Dictionary` which maps each `word_id -> id2word[word_id]`.
     |      `id2word` is an optional dictionary that maps the `word_id` to a token.
     |      In case `id2word` isn't specified the mapping `id2word[word_id] = str(word_id)` will be used.
     |      
     |      Returns
     |      -------
     |      :class:`~gensim.corpora.dictionary.Dictionary`
     |          Inferred dictionary from corpus.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []]
     |          >>> dct = Dictionary.from_corpus(corpus)
     |          >>> len(dct)
     |          3
     |  
     |  from_documents(documents)
     |      Create :class:`~gensim.corpora.dictionary.Dictionary` from `documents`.
     |      
     |      Equivalent to `Dictionary(documents=documents)`.
     |      
     |      Parameters
     |      ----------
     |      documents : iterable of iterable of str
     |          Input corpus.
     |      
     |      Returns
     |      -------
     |      :class:`~gensim.corpora.dictionary.Dictionary`
     |          Dictionary initialized from `documents`.
     |  
     |  load_from_text(fname)
     |      Load a previously stored :class:`~gensim.corpora.dictionary.Dictionary` from a text file.
     |      
     |      Mirror function to :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
     |      
     |      Parameters
     |      ----------
     |      fname: str
     |          Path to a file produced by :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`
     |          Save :class:`~gensim.corpora.dictionary.Dictionary` to text file.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>> from gensim.test.utils import get_tmpfile
     |          >>>
     |          >>> tmp_fname = get_tmpfile("dictionary")
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>>
     |          >>> dct = Dictionary(corpus)
     |          >>> dct.save_as_text(tmp_fname)
     |          >>>
     |          >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
     |          >>> assert dct.token2id == loaded_dct.token2id
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  __abstractmethods__ = frozenset()
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from gensim.utils.SaveLoad:
     |  
     |  save(self, fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)
     |      Save the object to a file.
     |      
     |      Parameters
     |      ----------
     |      fname_or_handle : str or file-like
     |          Path to output file or already opened file-like object. If the object is a file handle,
     |          no special array handling will be performed, all attributes will be saved to the same file.
     |      separately : list of str or None, optional
     |          If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store
     |          them into separate files. This prevent memory errors for large objects, and also allows
     |          `memory-mapping <https://en.wikipedia.org/wiki/Mmap>`_ the large arrays for efficient
     |          loading and sharing the large arrays in RAM between multiple processes.
     |      
     |          If list of str: store these attributes into separate files. The automated size check
     |          is not performed in this case.
     |      sep_limit : int, optional
     |          Don't store arrays smaller than this separately. In bytes.
     |      ignore : frozenset of str, optional
     |          Attributes that shouldn't be stored at all.
     |      pickle_protocol : int, optional
     |          Protocol number for pickle.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.utils.SaveLoad.load`
     |          Load object from file.
     |  
     |  ----------------------------------------------------------------------
     |  Class methods inherited from gensim.utils.SaveLoad:
     |  
     |  load(fname, mmap=None) from abc.ABCMeta
     |      Load an object previously saved using :meth:`~gensim.utils.SaveLoad.save` from a file.
     |      
     |      Parameters
     |      ----------
     |      fname : str
     |          Path to file that contains needed object.
     |      mmap : str, optional
     |          Memory-map option.  If the object was saved with large arrays stored separately, you can load these arrays
     |          via mmap (shared memory) using `mmap='r'.
     |          If the file being loaded is compressed (either '.gz' or '.bz2'), then `mmap=None` **must be** set.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.utils.SaveLoad.save`
     |          Save object to file.
     |      
     |      Returns
     |      -------
     |      object
     |          Object loaded from `fname`.
     |      
     |      Raises
     |      ------
     |      AttributeError
     |          When called on an object instance instead of class (this is a class method).
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from gensim.utils.SaveLoad:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from collections.abc.Mapping:
     |  
     |  __contains__(self, key)
     |  
     |  __eq__(self, other)
     |      Return self==value.
     |  
     |  get(self, key, default=None)
     |      D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.
     |  
     |  items(self)
     |      D.items() -> a set-like object providing a view on D's items
     |  
     |  values(self)
     |      D.values() -> an object providing a view on D's values
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from collections.abc.Mapping:
     |  
     |  __hash__ = None
     |  
     |  __reversed__ = None
     |  
     |  ----------------------------------------------------------------------
     |  Class methods inherited from collections.abc.Collection:
     |  
     |  __subclasshook__(C) from abc.ABCMeta
     |      Abstract classes can override this to customize issubclass().
     |      
     |      This is invoked early on by abc.ABCMeta.__subclasscheck__().
     |      It should return True, False or NotImplemented.  If it returns
     |      NotImplemented, the normal algorithm is used.  Otherwise, it
     |      overrides the normal algorithm (and the outcome is cached).
    
    help(corpora.Dictionary.doc2bow)
    
    Help on function doc2bow in module gensim.corpora.dictionary:
    
    doc2bow(self, document, allow_update=False, return_missing=False)
        Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
        
        Parameters
        ----------
        document : list of str
            Input document.
        allow_update : bool, optional
            Update self, by adding new tokens from `document` and updating internal corpus statistics.
        return_missing : bool, optional
            Return missing tokens (tokens present in `document` but not in self) with frequencies?
        
        Return
        ------
        list of (int, int)
            BoW representation of `document`.
        list of (int, int), dict of (str, int)
            If `return_missing` is True, return BoW representation of `document` + dictionary with missing
            tokens and their frequencies.
        
        Examples
        --------
        .. sourcecode:: pycon
        
            >>> from gensim.corpora import Dictionary
            >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
            >>> dct.doc2bow(["this", "is", "máma"])
            [(2, 1)]
            >>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
            ([(2, 1)], {u'this': 1, u'is': 1})
    

    主题向量的变换

    通过挖掘语料中隐藏的语义结构特征 -> 文本向量

    TF-IDF模型

    from gensim import models
    tfidf = models.TfidfModel(corpus)
    doc_bow = [(0, 1), (1, 1), (2, 1)]
    print(tfidf[doc_bow])
    
    [(2, 1.0)]
    
    help(models.TfidfModel)
    
    Help on class TfidfModel in module gensim.models.tfidfmodel:
    
    class TfidfModel(gensim.interfaces.TransformationABC)
     |  TfidfModel(corpus=None, id2word=None, dictionary=None, wlocal=<function identity at 0x7fffc30e60d0>, wglobal=<function df2idf at 0x7fffc0973a60>, normalize=True, smartirs=None, pivot=None, slope=0.25)
     |  
     |  Objects of this class realize the transformation between word-document co-occurrence matrix (int)
     |  into a locally/globally weighted TF-IDF matrix (positive floats).
     |  
     |  Examples
     |  --------
     |  .. sourcecode:: pycon
     |  
     |      >>> import gensim.downloader as api
     |      >>> from gensim.models import TfidfModel
     |      >>> from gensim.corpora import Dictionary
     |      >>>
     |      >>> dataset = api.load("text8")
     |      >>> dct = Dictionary(dataset)  # fit dictionary
     |      >>> corpus = [dct.doc2bow(line) for line in dataset]  # convert corpus to BoW format
     |      >>>
     |      >>> model = TfidfModel(corpus)  # fit model
     |      >>> vector = model[corpus[0]]  # apply model to the first corpus document
     |  
     |  Method resolution order:
     |      TfidfModel
     |      gensim.interfaces.TransformationABC
     |      gensim.utils.SaveLoad
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __getitem__(self, bow, eps=1e-12)
     |      Get the tf-idf representation of an input vector and/or corpus.
     |      
     |      bow : {list of (int, int), iterable of iterable of (int, int)}
     |          Input document in the `sparse Gensim bag-of-words format
     |          <https://radimrehurek.com/gensim/intro.html#core-concepts>`_,
     |          or a streamed corpus of such documents.
     |      eps : float
     |          Threshold value, will remove all position that have tfidf-value less than `eps`.
     |      
     |      Returns
     |      -------
     |      vector : list of (int, float)
     |          TfIdf vector, if `bow` is a single document
     |      :class:`~gensim.interfaces.TransformedCorpus`
     |          TfIdf corpus, if `bow` is a corpus.
     |  
     |  __init__(self, corpus=None, id2word=None, dictionary=None, wlocal=<function identity at 0x7fffc30e60d0>, wglobal=<function df2idf at 0x7fffc0973a60>, normalize=True, smartirs=None, pivot=None, slope=0.25)
     |      Compute TF-IDF by multiplying a local component (term frequency) with a global component
     |      (inverse document frequency), and normalizing the resulting documents to unit length.
     |      Formula for non-normalized weight of term :math:`i` in document :math:`j` in a corpus of :math:`D` documents
     |      
     |      .. math:: weight_{i,j} = frequency_{i,j} * log_2 \frac{D}{document\_freq_{i}}
     |      
     |      or, more generally
     |      
     |      .. math:: weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document\_freq_{i}, D)
     |      
     |      so you can plug in your own custom :math:`wlocal` and :math:`wglobal` functions.
     |      
     |      Parameters
     |      ----------
     |      corpus : iterable of iterable of (int, int), optional
     |          Input corpus
     |      id2word : {dict, :class:`~gensim.corpora.Dictionary`}, optional
     |          Mapping token - id, that was used for converting input data to bag of words format.
     |      dictionary : :class:`~gensim.corpora.Dictionary`
     |          If `dictionary` is specified, it must be a `corpora.Dictionary` object and it will be used.
     |          to directly construct the inverse document frequency mapping (then `corpus`, if specified, is ignored).
     |      wlocals : callable, optional
     |          Function for local weighting, default for `wlocal` is :func:`~gensim.utils.identity`
     |          (other options: :func:`numpy.sqrt`, `lambda tf: 0.5 + (0.5 * tf / tf.max())`, etc.).
     |      wglobal : callable, optional
     |          Function for global weighting, default is :func:`~gensim.models.tfidfmodel.df2idf`.
     |      normalize : {bool, callable}, optional
     |          Normalize document vectors to unit euclidean length? You can also inject your own function into `normalize`.
     |      smartirs : str, optional
     |          SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System,
     |          a mnemonic scheme for denoting tf-idf weighting variants in the vector space model.
     |          The mnemonic for representing a combination of weights takes the form XYZ,
     |          for example 'ntc', 'bpn' and so on, where the letters represents the term weighting of the document vector.
     |      
     |          Term frequency weighing:
     |              * `b` - binary,
     |              * `t` or `n` - raw,
     |              * `a` - augmented,
     |              * `l` - logarithm,
     |              * `d` - double logarithm,
     |              * `L` - log average.
     |      
     |          Document frequency weighting:
     |              * `x` or `n` - none,
     |              * `f` - idf,
     |              * `t` - zero-corrected idf,
     |              * `p` - probabilistic idf.
     |      
     |          Document normalization:
     |              * `x` or `n` - none,
     |              * `c` - cosine,
     |              * `u` - pivoted unique,
     |              * `b` - pivoted character length.
     |      
     |          Default is 'nfc'.
     |          For more information visit `SMART Information Retrieval System
     |          <https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System>`_.
     |      pivot : float or None, optional
     |          In information retrieval, TF-IDF is biased against long documents [1]_. Pivoted document length
     |          normalization solves this problem by changing the norm of a document to `slope * old_norm + (1.0 -
     |          slope) * pivot`.
     |      
     |          You can either set the `pivot` by hand, or you can let Gensim figure it out automatically with the following
     |          two steps:
     |      
     |              * Set either the `u` or `b` document normalization in the `smartirs` parameter.
     |              * Set either the `corpus` or `dictionary` parameter. The `pivot` will be automatically determined from
     |                the properties of the `corpus` or `dictionary`.
     |      
     |          If `pivot` is None and you don't follow steps 1 and 2, then pivoted document length normalization will be
     |          disabled. Default is None.
     |      
     |          See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.
     |      slope : float, optional
     |          In information retrieval, TF-IDF is biased against long documents [1]_. Pivoted document length
     |          normalization solves this problem by changing the norm of a document to `slope * old_norm + (1.0 -
     |          slope) * pivot`.
     |      
     |          Setting the `slope` to 0.0 uses only the `pivot` as the norm, and setting the `slope` to 1.0 effectively
     |          disables pivoted document length normalization. Singhal [2]_ suggests setting the `slope` between 0.2 and
     |          0.3 for best results. Default is 0.25.
     |      
     |          See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.
     |      
     |      See Also
     |      --------
     |      ~gensim.sklearn_api.tfidf.TfIdfTransformer : Class that also uses the SMART scheme.
     |      resolve_weights : Function that also uses the SMART scheme.
     |      
     |      References
     |      ----------
     |      .. [1] Singhal, A., Buckley, C., & Mitra, M. (1996). `Pivoted Document Length
     |         Normalization <http://singhal.info/pivoted-dln.pdf>`_. *SIGIR Forum*, 51, 176–184.
     |      .. [2] Singhal, A. (2001). `Modern information retrieval: A brief overview <http://singhal.info/ieee2001.pdf>`_.
     |         *IEEE Data Eng. Bull.*, 24(4), 35–43.
     |  
     |  __str__(self)
     |      Return str(self).
     |  
     |  initialize(self, corpus)
     |      Compute inverse document weights, which will be used to modify term frequencies for documents.
     |      
     |      Parameters
     |      ----------
     |      corpus : iterable of iterable of (int, int)
     |          Input corpus.
     |  
     |  ----------------------------------------------------------------------
     |  Class methods defined here:
     |  
     |  load(*args, **kwargs) from builtins.type
     |      Load a previously saved TfidfModel class. Handles backwards compatibility from
     |      older TfidfModel versions which did not use pivoted document normalization.
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from gensim.utils.SaveLoad:
     |  
     |  save(self, fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)
     |      Save the object to a file.
     |      
     |      Parameters
     |      ----------
     |      fname_or_handle : str or file-like
     |          Path to output file or already opened file-like object. If the object is a file handle,
     |          no special array handling will be performed, all attributes will be saved to the same file.
     |      separately : list of str or None, optional
     |          If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store
     |          them into separate files. This prevent memory errors for large objects, and also allows
     |          `memory-mapping <https://en.wikipedia.org/wiki/Mmap>`_ the large arrays for efficient
     |          loading and sharing the large arrays in RAM between multiple processes.
     |      
     |          If list of str: store these attributes into separate files. The automated size check
     |          is not performed in this case.
     |      sep_limit : int, optional
     |          Don't store arrays smaller than this separately. In bytes.
     |      ignore : frozenset of str, optional
     |          Attributes that shouldn't be stored at all.
     |      pickle_protocol : int, optional
     |          Protocol number for pickle.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.utils.SaveLoad.load`
     |          Load object from file.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from gensim.utils.SaveLoad:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    

    Modules API Reference

    文档相似度计算

    from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
    from gensim.models import LsiModel
    
    model = LsiModel(common_corpus[:3], id2word=common_dictionary)  # train model
    vector = model[common_corpus[4]]  # apply model to BoW document
    model.add_documents(common_corpus[4:])  # update model with new documents
    tmp_fname = get_tmpfile("lsi.model")
    model.save(tmp_fname)  # save model
    loaded_model = LsiModel.load(tmp_fname)  # load model
    
    help(common_corpus)
    
    Help on list object:
    
    class list(object)
     |  list(iterable=(), /)
     |  
     |  Built-in mutable sequence.
     |  
     |  If no argument is given, the constructor creates a new empty list.
     |  The argument must be an iterable if specified.
     |  
     |  Methods defined here:
     |  
     |  __add__(self, value, /)
     |      Return self+value.
     |  
     |  __contains__(self, key, /)
     |      Return key in self.
     |  
     |  __delitem__(self, key, /)
     |      Delete self[key].
     |  
     |  __eq__(self, value, /)
     |      Return self==value.
     |  
     |  __ge__(self, value, /)
     |      Return self>=value.
     |  
     |  __getattribute__(self, name, /)
     |      Return getattr(self, name).
     |  
     |  __getitem__(...)
     |      x.__getitem__(y) <==> x[y]
     |  
     |  __gt__(self, value, /)
     |      Return self>value.
     |  
     |  __iadd__(self, value, /)
     |      Implement self+=value.
     |  
     |  __imul__(self, value, /)
     |      Implement self*=value.
     |  
     |  __init__(self, /, *args, **kwargs)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  __iter__(self, /)
     |      Implement iter(self).
     |  
     |  __le__(self, value, /)
     |      Return self<=value.
     |  
     |  __len__(self, /)
     |      Return len(self).
     |  
     |  __lt__(self, value, /)
     |      Return self<value.
     |  
     |  __mul__(self, value, /)
     |      Return self*value.
     |  
     |  __ne__(self, value, /)
     |      Return self!=value.
     |  
     |  __repr__(self, /)
     |      Return repr(self).
     |  
     |  __reversed__(self, /)
     |      Return a reverse iterator over the list.
     |  
     |  __rmul__(self, value, /)
     |      Return value*self.
     |  
     |  __setitem__(self, key, value, /)
     |      Set self[key] to value.
     |  
     |  __sizeof__(self, /)
     |      Return the size of the list in memory, in bytes.
     |  
     |  append(self, object, /)
     |      Append object to the end of the list.
     |  
     |  clear(self, /)
     |      Remove all items from list.
     |  
     |  copy(self, /)
     |      Return a shallow copy of the list.
     |  
     |  count(self, value, /)
     |      Return number of occurrences of value.
     |  
     |  extend(self, iterable, /)
     |      Extend list by appending elements from the iterable.
     |  
     |  index(self, value, start=0, stop=9223372036854775807, /)
     |      Return first index of value.
     |      
     |      Raises ValueError if the value is not present.
     |  
     |  insert(self, index, object, /)
     |      Insert object before index.
     |  
     |  pop(self, index=-1, /)
     |      Remove and return item at index (default last).
     |      
     |      Raises IndexError if list is empty or index is out of range.
     |  
     |  remove(self, value, /)
     |      Remove first occurrence of value.
     |      
     |      Raises ValueError if the value is not present.
     |  
     |  reverse(self, /)
     |      Reverse *IN PLACE*.
     |  
     |  sort(self, /, *, key=None, reverse=False)
     |      Stable sort *IN PLACE*.
     |  
     |  ----------------------------------------------------------------------
     |  Static methods defined here:
     |  
     |  __new__(*args, **kwargs) from builtins.type
     |      Create and return a new object.  See help(type) for accurate signature.
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  __hash__ = None
    
    from pprint import pprint
    
    pprint(common_corpus)
    
    [[(0, 1), (1, 1), (2, 1)],
     [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
     [(2, 1), (5, 1), (7, 1), (8, 1)],
     [(1, 1), (5, 2), (8, 1)],
     [(3, 1), (6, 1), (7, 1)],
     [(9, 1)],
     [(9, 1), (10, 1)],
     [(9, 1), (10, 1), (11, 1)],
     [(4, 1), (10, 1), (11, 1)]]
    
    help(common_dictionary)
    
    Help on Dictionary in module gensim.corpora.dictionary object:
    
    class Dictionary(gensim.utils.SaveLoad, collections.abc.Mapping)
     |  Dictionary(documents=None, prune_at=2000000)
     |  
     |  Dictionary encapsulates the mapping between normalized words and their integer ids.
     |  
     |  Notable instance attributes:
     |  
     |  Attributes
     |  ----------
     |  token2id : dict of (str, int)
     |      token -> tokenId.
     |  id2token : dict of (int, str)
     |      Reverse mapping for token2id, initialized in a lazy manner to save memory (not created until needed).
     |  cfs : dict of (int, int)
     |      Collection frequencies: token_id -> how many instances of this token are contained in the documents.
     |  dfs : dict of (int, int)
     |      Document frequencies: token_id -> how many documents contain this token.
     |  num_docs : int
     |      Number of documents processed.
     |  num_pos : int
     |      Total number of corpus positions (number of processed words).
     |  num_nnz : int
     |      Total number of non-zeroes in the BOW matrix (sum of the number of unique
     |      words per document over the entire corpus).
     |  
     |  Method resolution order:
     |      Dictionary
     |      gensim.utils.SaveLoad
     |      collections.abc.Mapping
     |      collections.abc.Collection
     |      collections.abc.Sized
     |      collections.abc.Iterable
     |      collections.abc.Container
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __getitem__(self, tokenid)
     |      Get the string token that corresponds to `tokenid`.
     |      
     |      Parameters
     |      ----------
     |      tokenid : int
     |          Id of token.
     |      
     |      Returns
     |      -------
     |      str
     |          Token corresponding to `tokenid`.
     |      
     |      Raises
     |      ------
     |      KeyError
     |          If this Dictionary doesn't contain such `tokenid`.
     |  
     |  __init__(self, documents=None, prune_at=2000000)
     |      Parameters
     |      ----------
     |      documents : iterable of iterable of str, optional
     |          Documents to be used to initialize the mapping and collect corpus statistics.
     |      prune_at : int, optional
     |          Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
     |          footprint, the correctness is not guaranteed.
     |          Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> texts = [['human', 'interface', 'computer']]
     |          >>> dct = Dictionary(texts)  # initialize a Dictionary
     |          >>> dct.add_documents([["cat", "say", "meow"], ["dog"]])  # add more document (extend the vocabulary)
     |          >>> dct.doc2bow(["dog", "computer", "non_existent_word"])
     |          [(0, 1), (6, 1)]
     |  
     |  __iter__(self)
     |      Iterate over all tokens.
     |  
     |  __len__(self)
     |      Get number of stored tokens.
     |      
     |      Returns
     |      -------
     |      int
     |          Number of stored tokens.
     |  
     |  __str__(self)
     |      Return str(self).
     |  
     |  add_documents(self, documents, prune_at=2000000)
     |      Update dictionary from a collection of `documents`.
     |      
     |      Parameters
     |      ----------
     |      documents : iterable of iterable of str
     |          Input corpus. All tokens should be already **tokenized and normalized**.
     |      prune_at : int, optional
     |          Dictionary will try to keep no more than `prune_at` words in its mapping, to limit its RAM
     |          footprint, the correctness is not guaranteed.
     |          Use :meth:`~gensim.corpora.dictionary.Dictionary.filter_extremes` to perform proper filtering.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = ["máma mele maso".split(), "ema má máma".split()]
     |          >>> dct = Dictionary(corpus)
     |          >>> len(dct)
     |          5
     |          >>> dct.add_documents([["this", "is", "sparta"], ["just", "joking"]])
     |          >>> len(dct)
     |          10
     |  
     |  compactify(self)
     |      Assign new word ids to all words, shrinking any gaps.
     |  
     |  doc2bow(self, document, allow_update=False, return_missing=False)
     |      Convert `document` into the bag-of-words (BoW) format = list of `(token_id, token_count)` tuples.
     |      
     |      Parameters
     |      ----------
     |      document : list of str
     |          Input document.
     |      allow_update : bool, optional
     |          Update self, by adding new tokens from `document` and updating internal corpus statistics.
     |      return_missing : bool, optional
     |          Return missing tokens (tokens present in `document` but not in self) with frequencies?
     |      
     |      Return
     |      ------
     |      list of (int, int)
     |          BoW representation of `document`.
     |      list of (int, int), dict of (str, int)
     |          If `return_missing` is True, return BoW representation of `document` + dictionary with missing
     |          tokens and their frequencies.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>> dct = Dictionary(["máma mele maso".split(), "ema má máma".split()])
     |          >>> dct.doc2bow(["this", "is", "máma"])
     |          [(2, 1)]
     |          >>> dct.doc2bow(["this", "is", "máma"], return_missing=True)
     |          ([(2, 1)], {u'this': 1, u'is': 1})
     |  
     |  doc2idx(self, document, unknown_word_index=-1)
     |      Convert `document` (a list of words) into a list of indexes = list of `token_id`.
     |      Replace all unknown words i.e, words not in the dictionary with the index as set via `unknown_word_index`.
     |      
     |      Parameters
     |      ----------
     |      document : list of str
     |          Input document
     |      unknown_word_index : int, optional
     |          Index to use for words not in the dictionary.
     |      
     |      Returns
     |      -------
     |      list of int
     |          Token ids for tokens in `document`, in the same order.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["a", "a", "b"], ["a", "c"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> dct.doc2idx(["a", "a", "c", "not_in_dictionary", "c"])
     |          [0, 0, 2, -1, 2]
     |  
     |  filter_extremes(self, no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)
     |      Filter out tokens in the dictionary by their frequency.
     |      
     |      Parameters
     |      ----------
     |      no_below : int, optional
     |          Keep tokens which are contained in at least `no_below` documents.
     |      no_above : float, optional
     |          Keep tokens which are contained in no more than `no_above` documents
     |          (fraction of total corpus size, not an absolute number).
     |      keep_n : int, optional
     |          Keep only the first `keep_n` most frequent tokens.
     |      keep_tokens : iterable of str
     |          Iterable of tokens that **must** stay in dictionary after filtering.
     |      
     |      Notes
     |      -----
     |      This removes all tokens in the dictionary that are:
     |      
     |      #. Less frequent than `no_below` documents (absolute number, e.g. `5`) or 
     |      
     |      #. More frequent than `no_above` documents (fraction of the total corpus size, e.g. `0.3`).
     |      #. After (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if `keep_n=None`).
     |      
     |      After the pruning, resulting gaps in word ids are shrunk.
     |      Due to this gap shrinking, **the same word may have a different word id before and after the call
     |      to this function!**
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> len(dct)
     |          5
     |          >>> dct.filter_extremes(no_below=1, no_above=0.5, keep_n=1)
     |          >>> len(dct)
     |          1
     |  
     |  filter_n_most_frequent(self, remove_n)
     |      Filter out the 'remove_n' most frequent tokens that appear in the documents.
     |      
     |      Parameters
     |      ----------
     |      remove_n : int
     |          Number of the most frequent tokens that will be removed.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> len(dct)
     |          5
     |          >>> dct.filter_n_most_frequent(2)
     |          >>> len(dct)
     |          3
     |  
     |  filter_tokens(self, bad_ids=None, good_ids=None)
     |      Remove the selected `bad_ids` tokens from :class:`~gensim.corpora.dictionary.Dictionary`.
     |      
     |      Alternatively, keep selected `good_ids` in :class:`~gensim.corpora.dictionary.Dictionary` and remove the rest.
     |      
     |      Parameters
     |      ----------
     |      bad_ids : iterable of int, optional
     |          Collection of word ids to be removed.
     |      good_ids : collection of int, optional
     |          Keep selected collection of word ids and remove the rest.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>> 'ema' in dct.token2id
     |          True
     |          >>> dct.filter_tokens(bad_ids=[dct.token2id['ema']])
     |          >>> 'ema' in dct.token2id
     |          False
     |          >>> len(dct)
     |          4
     |          >>> dct.filter_tokens(good_ids=[dct.token2id['maso']])
     |          >>> len(dct)
     |          1
     |  
     |  iteritems(self)
     |  
     |  iterkeys = __iter__(self)
     |  
     |  itervalues(self)
     |  
     |  keys(self)
     |      Get all stored ids.
     |      
     |      Returns
     |      -------
     |      list of int
     |          List of all token ids.
     |  
     |  merge_with(self, other)
     |      Merge another dictionary into this dictionary, mapping the same tokens to the same ids
     |      and new tokens to new ids.
     |      
     |      Notes
     |      -----
     |      The purpose is to merge two corpora created using two different dictionaries: `self` and `other`.
     |      `other` can be any id=>word mapping (a dict, a Dictionary object, ...).
     |      
     |      Return a transformation object which, when accessed as `result[doc_from_other_corpus]`, will convert documents
     |      from a corpus built using the `other` dictionary into a document using the new, merged dictionary.
     |      
     |      Parameters
     |      ----------
     |      other : {dict, :class:`~gensim.corpora.dictionary.Dictionary`}
     |          Other dictionary.
     |      
     |      Return
     |      ------
     |      :class:`gensim.models.VocabTransform`
     |          Transformation object.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus_1, corpus_2 = [["a", "b", "c"]], [["a", "f", "f"]]
     |          >>> dct_1, dct_2 = Dictionary(corpus_1), Dictionary(corpus_2)
     |          >>> dct_1.doc2bow(corpus_2[0])
     |          [(0, 1)]
     |          >>> transformer = dct_1.merge_with(dct_2)
     |          >>> dct_1.doc2bow(corpus_2[0])
     |          [(0, 1), (3, 2)]
     |  
     |  patch_with_special_tokens(self, special_token_dict)
     |      Patch token2id and id2token using a dictionary of special tokens.
     |      
     |      
     |      **Usecase:** when doing sequence modeling (e.g. named entity recognition), one may  want to specify
     |      special tokens that behave differently than others.
     |      One example is the "unknown" token, and another is the padding token.
     |      It is usual to set the padding token to have index `0`, and patching the dictionary with `{'<PAD>': 0}`
     |      would be one way to specify this.
     |      
     |      Parameters
     |      ----------
     |      special_token_dict : dict of (str, int)
     |          dict containing the special tokens as keys and their wanted indices as values.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>> dct = Dictionary(corpus)
     |          >>>
     |          >>> special_tokens = {'pad': 0, 'space': 1}
     |          >>> print(dct.token2id)
     |          {'maso': 0, 'mele': 1, 'máma': 2, 'ema': 3, 'má': 4}
     |          >>>
     |          >>> dct.patch_with_special_tokens(special_tokens)
     |          >>> print(dct.token2id)
     |          {'maso': 6, 'mele': 7, 'máma': 2, 'ema': 3, 'má': 4, 'pad': 0, 'space': 1}
     |  
     |  save_as_text(self, fname, sort_by_word=True)
     |      Save :class:`~gensim.corpora.dictionary.Dictionary` to a text file.
     |      
     |      Parameters
     |      ----------
     |      fname : str
     |          Path to output file.
     |      sort_by_word : bool, optional
     |          Sort words in lexicographical order before writing them out?
     |      
     |      Notes
     |      -----
     |      Format::
     |      
     |          num_docs
     |          id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
     |          id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
     |          ....
     |          id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]
     |      
     |      This text format is great for corpus inspection and debugging. As plaintext, it's also easily portable
     |      to other tools and frameworks. For better performance and to store the entire object state,
     |      including collected corpus statistics, use :meth:`~gensim.corpora.dictionary.Dictionary.save` and
     |      :meth:`~gensim.corpora.dictionary.Dictionary.load` instead.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.corpora.dictionary.Dictionary.load_from_text`
     |          Load :class:`~gensim.corpora.dictionary.Dictionary` from text file.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>> from gensim.test.utils import get_tmpfile
     |          >>>
     |          >>> tmp_fname = get_tmpfile("dictionary")
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>>
     |          >>> dct = Dictionary(corpus)
     |          >>> dct.save_as_text(tmp_fname)
     |          >>>
     |          >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
     |          >>> assert dct.token2id == loaded_dct.token2id
     |  
     |  ----------------------------------------------------------------------
     |  Static methods defined here:
     |  
     |  from_corpus(corpus, id2word=None)
     |      Create :class:`~gensim.corpora.dictionary.Dictionary` from an existing corpus.
     |      
     |      Parameters
     |      ----------
     |      corpus : iterable of iterable of (int, number)
     |          Corpus in BoW format.
     |      id2word : dict of (int, object)
     |          Mapping id -> word. If None, the mapping `id2word[word_id] = str(word_id)` will be used.
     |      
     |      Notes
     |      -----
     |      This can be useful if you only have a term-document BOW matrix (represented by `corpus`), but not the original
     |      text corpus. This method will scan the term-document count matrix for all word ids that appear in it,
     |      then construct :class:`~gensim.corpora.dictionary.Dictionary` which maps each `word_id -> id2word[word_id]`.
     |      `id2word` is an optional dictionary that maps the `word_id` to a token.
     |      In case `id2word` isn't specified the mapping `id2word[word_id] = str(word_id)` will be used.
     |      
     |      Returns
     |      -------
     |      :class:`~gensim.corpora.dictionary.Dictionary`
     |          Inferred dictionary from corpus.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>>
     |          >>> corpus = [[(1, 1.0)], [], [(0, 5.0), (2, 1.0)], []]
     |          >>> dct = Dictionary.from_corpus(corpus)
     |          >>> len(dct)
     |          3
     |  
     |  from_documents(documents)
     |      Create :class:`~gensim.corpora.dictionary.Dictionary` from `documents`.
     |      
     |      Equivalent to `Dictionary(documents=documents)`.
     |      
     |      Parameters
     |      ----------
     |      documents : iterable of iterable of str
     |          Input corpus.
     |      
     |      Returns
     |      -------
     |      :class:`~gensim.corpora.dictionary.Dictionary`
     |          Dictionary initialized from `documents`.
     |  
     |  load_from_text(fname)
     |      Load a previously stored :class:`~gensim.corpora.dictionary.Dictionary` from a text file.
     |      
     |      Mirror function to :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
     |      
     |      Parameters
     |      ----------
     |      fname: str
     |          Path to a file produced by :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.corpora.dictionary.Dictionary.save_as_text`
     |          Save :class:`~gensim.corpora.dictionary.Dictionary` to text file.
     |      
     |      Examples
     |      --------
     |      .. sourcecode:: pycon
     |      
     |          >>> from gensim.corpora import Dictionary
     |          >>> from gensim.test.utils import get_tmpfile
     |          >>>
     |          >>> tmp_fname = get_tmpfile("dictionary")
     |          >>> corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
     |          >>>
     |          >>> dct = Dictionary(corpus)
     |          >>> dct.save_as_text(tmp_fname)
     |          >>>
     |          >>> loaded_dct = Dictionary.load_from_text(tmp_fname)
     |          >>> assert dct.token2id == loaded_dct.token2id
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  __abstractmethods__ = frozenset()
     |  
     |  __slotnames__ = []
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from gensim.utils.SaveLoad:
     |  
     |  save(self, fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)
     |      Save the object to a file.
     |      
     |      Parameters
     |      ----------
     |      fname_or_handle : str or file-like
     |          Path to output file or already opened file-like object. If the object is a file handle,
     |          no special array handling will be performed, all attributes will be saved to the same file.
     |      separately : list of str or None, optional
     |          If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store
     |          them into separate files. This prevent memory errors for large objects, and also allows
     |          `memory-mapping <https://en.wikipedia.org/wiki/Mmap>`_ the large arrays for efficient
     |          loading and sharing the large arrays in RAM between multiple processes.
     |      
     |          If list of str: store these attributes into separate files. The automated size check
     |          is not performed in this case.
     |      sep_limit : int, optional
     |          Don't store arrays smaller than this separately. In bytes.
     |      ignore : frozenset of str, optional
     |          Attributes that shouldn't be stored at all.
     |      pickle_protocol : int, optional
     |          Protocol number for pickle.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.utils.SaveLoad.load`
     |          Load object from file.
     |  
     |  ----------------------------------------------------------------------
     |  Class methods inherited from gensim.utils.SaveLoad:
     |  
     |  load(fname, mmap=None) from abc.ABCMeta
     |      Load an object previously saved using :meth:`~gensim.utils.SaveLoad.save` from a file.
     |      
     |      Parameters
     |      ----------
     |      fname : str
     |          Path to file that contains needed object.
     |      mmap : str, optional
     |          Memory-map option.  If the object was saved with large arrays stored separately, you can load these arrays
     |          via mmap (shared memory) using `mmap='r'.
     |          If the file being loaded is compressed (either '.gz' or '.bz2'), then `mmap=None` **must be** set.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.utils.SaveLoad.save`
     |          Save object to file.
     |      
     |      Returns
     |      -------
     |      object
     |          Object loaded from `fname`.
     |      
     |      Raises
     |      ------
     |      AttributeError
     |          When called on an object instance instead of class (this is a class method).
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from gensim.utils.SaveLoad:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from collections.abc.Mapping:
     |  
     |  __contains__(self, key)
     |  
     |  __eq__(self, other)
     |      Return self==value.
     |  
     |  get(self, key, default=None)
     |      D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.
     |  
     |  items(self)
     |      D.items() -> a set-like object providing a view on D's items
     |  
     |  values(self)
     |      D.values() -> an object providing a view on D's values
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from collections.abc.Mapping:
     |  
     |  __hash__ = None
     |  
     |  __reversed__ = None
     |  
     |  ----------------------------------------------------------------------
     |  Class methods inherited from collections.abc.Collection:
     |  
     |  __subclasshook__(C) from abc.ABCMeta
     |      Abstract classes can override this to customize issubclass().
     |      
     |      This is invoked early on by abc.ABCMeta.__subclasscheck__().
     |      It should return True, False or NotImplemented.  If it returns
     |      NotImplemented, the normal algorithm is used.  Otherwise, it
     |      overrides the normal algorithm (and the outcome is cached).
    
    help(get_tmpfile)
    
    Help on function get_tmpfile in module gensim.test.utils:
    
    get_tmpfile(suffix)
        Get full path to file `suffix` in temporary folder.
        This function doesn't creates file (only generate unique name).
        Also, it may return different paths in consecutive calling.
        
        Parameters
        ----------
        suffix : str
            Suffix of file.
        
        Returns
        -------
        str
            Path to `suffix` file in temporary folder.
        
        Examples
        --------
        Using this function we may get path to temporary file and use it, for example, to store temporary model.
        
        .. sourcecode:: pycon
        
            >>> from gensim.models import LsiModel
            >>> from gensim.test.utils import get_tmpfile, common_dictionary, common_corpus
            >>>
            >>> tmp_f = get_tmpfile("toy_lsi_model")
            >>>
            >>> model = LsiModel(common_corpus, id2word=common_dictionary)
            >>> model.save(tmp_f)
            >>>
            >>> loaded_model = LsiModel.load(tmp_f)
    
    help(models.LsiModel)
    
    Help on class LsiModel in module gensim.models.lsimodel:
    
    class LsiModel(gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel)
     |  LsiModel(corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100, dtype=<class 'numpy.float64'>)
     |  
     |  Model for `Latent Semantic Indexing
     |  <https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing>`_.
     |  
     |  The decomposition algorithm is described in `"Fast and Faster: A Comparison of Two Streamed
     |  Matrix Decomposition Algorithms" <https://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf>`_.
     |  
     |  Notes
     |  -----
     |  * :attr:`gensim.models.lsimodel.LsiModel.projection.u` - left singular vectors,
     |  * :attr:`gensim.models.lsimodel.LsiModel.projection.s` - singular values,
     |  * ``model[training_corpus]`` - right singular vectors (can be reconstructed if needed).
     |  
     |  See Also
     |  --------
     |  `FAQ about LSI matrices
     |  <https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ#q4-how-do-you-output-the-u-s-vt-matrices-of-lsi>`_.
     |  
     |  Examples
     |  --------
     |  .. sourcecode:: pycon
     |  
     |      >>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
     |      >>> from gensim.models import LsiModel
     |      >>>
     |      >>> model = LsiModel(common_corpus[:3], id2word=common_dictionary)  # train model
     |      >>> vector = model[common_corpus[4]]  # apply model to BoW document
     |      >>> model.add_documents(common_corpus[4:])  # update model with new documents
     |      >>> tmp_fname = get_tmpfile("lsi.model")
     |      >>> model.save(tmp_fname)  # save model
     |      >>> loaded_model = LsiModel.load(tmp_fname)  # load model
     |  
     |  Method resolution order:
     |      LsiModel
     |      gensim.interfaces.TransformationABC
     |      gensim.utils.SaveLoad
     |      gensim.models.basemodel.BaseTopicModel
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __getitem__(self, bow, scaled=False, chunksize=512)
     |      Get the latent representation for `bow`.
     |      
     |      Parameters
     |      ----------
     |      bow : {list of (int, int), iterable of list of (int, int)}
     |          Document or corpus in BoW representation.
     |      scaled : bool, optional
     |          If True - topics will be scaled by the inverse of singular values.
     |      chunksize :  int, optional
     |          Number of documents to be used in each applying chunk.
     |      
     |      Returns
     |      -------
     |      list of (int, float)
     |          Latent representation of topics in BoW format for document **OR**
     |      :class:`gensim.matutils.Dense2Corpus`
     |          Latent representation of corpus in BoW format if `bow` is corpus.
     |  
     |  __init__(self, corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100, dtype=<class 'numpy.float64'>)
     |      Construct an `LsiModel` object.
     |      
     |      Either `corpus` or `id2word` must be supplied in order to train the model.
     |      
     |      Parameters
     |      ----------
     |      corpus : {iterable of list of (int, float), scipy.sparse.csc}, optional
     |          Stream of document vectors or sparse matrix of shape (`num_documents`, `num_terms`).
     |      num_topics : int, optional
     |          Number of requested factors (latent dimensions)
     |      id2word : dict of {int: str}, optional
     |          ID to word mapping, optional.
     |      chunksize :  int, optional
     |          Number of documents to be used in each training chunk.
     |      decay : float, optional
     |          Weight of existing observations relatively to new ones.
     |      distributed : bool, optional
     |          If True - distributed mode (parallel execution on several machines) will be used.
     |      onepass : bool, optional
     |          Whether the one-pass algorithm should be used for training.
     |          Pass `False` to force a multi-pass stochastic algorithm.
     |      power_iters: int, optional
     |          Number of power iteration steps to be used.
     |          Increasing the number of power iterations improves accuracy, but lowers performance
     |      extra_samples : int, optional
     |          Extra samples to be used besides the rank `k`. Can improve accuracy.
     |      dtype : type, optional
     |          Enforces a type for elements of the decomposed matrix.
     |  
     |  __str__(self)
     |      Get a human readable representation of model.
     |      
     |      Returns
     |      -------
     |      str
     |          A human readable string of the current objects parameters.
     |  
     |  add_documents(self, corpus, chunksize=None, decay=None)
     |      Update model with new `corpus`.
     |      
     |      Parameters
     |      ----------
     |      corpus : {iterable of list of (int, float), scipy.sparse.csc}
     |          Stream of document vectors or sparse matrix of shape (`num_terms`, num_documents).
     |      chunksize : int, optional
     |          Number of documents to be used in each training chunk, will use `self.chunksize` if not specified.
     |      decay : float, optional
     |          Weight of existing observations relatively to new ones,  will use `self.decay` if not specified.
     |      
     |      Notes
     |      -----
     |      Training proceeds in chunks of `chunksize` documents at a time. The size of `chunksize` is a tradeoff
     |      between increased speed (bigger `chunksize`) vs. lower memory footprint (smaller `chunksize`).
     |      If the distributed mode is on, each chunk is sent to a different worker/computer.
     |  
     |  get_topics(self)
     |      Get the topic vectors.
     |      
     |      Notes
     |      -----
     |      The number of topics can actually be smaller than `self.num_topics`, if there were not enough factors
     |      in the matrix (real rank of input matrix smaller than `self.num_topics`).
     |      
     |      Returns
     |      -------
     |      np.ndarray
     |          The term topic matrix with shape (`num_topics`, `vocabulary_size`)
     |  
     |  print_debug(self, num_topics=5, num_words=10)
     |      Print (to log) the most salient words of the first `num_topics` topics.
     |      
     |      Unlike :meth:`~gensim.models.lsimodel.LsiModel.print_topics`, this looks for words that are significant for
     |      a particular topic *and* not for others. This *should* result in a
     |      more human-interpretable description of topics.
     |      
     |      Alias for :func:`~gensim.models.lsimodel.print_debug`.
     |      
     |      Parameters
     |      ----------
     |      num_topics : int, optional
     |          The number of topics to be selected (ordered by significance).
     |      num_words : int, optional
     |          The number of words to be included per topics (ordered by significance).
     |  
     |  save(self, fname, *args, **kwargs)
     |      Save the model to a file.
     |      
     |      Notes
     |      -----
     |      Large internal arrays may be stored into separate files, with `fname` as prefix.
     |      
     |      Warnings
     |      --------
     |      Do not save as a compressed file if you intend to load the file back with `mmap`.
     |      
     |      Parameters
     |      ----------
     |      fname : str
     |          Path to output file.
     |      *args
     |          Variable length argument list, see :meth:`gensim.utils.SaveLoad.save`.
     |      **kwargs
     |          Arbitrary keyword arguments, see :meth:`gensim.utils.SaveLoad.save`.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.models.lsimodel.LsiModel.load`
     |  
     |  show_topic(self, topicno, topn=10)
     |      Get the words that define a topic along with their contribution.
     |      
     |      This is actually the left singular vector of the specified topic.
     |      
     |      The most important words in defining the topic (greatest absolute value) are included
     |      in the output, along with their contribution to the topic.
     |      
     |      Parameters
     |      ----------
     |      topicno : int
     |          The topics id number.
     |      topn : int
     |          Number of words to be included to the result.
     |      
     |      Returns
     |      -------
     |      list of (str, float)
     |          Topic representation in BoW format.
     |  
     |  show_topics(self, num_topics=-1, num_words=10, log=False, formatted=True)
     |      Get the most significant topics.
     |      
     |      Parameters
     |      ----------
     |      num_topics : int, optional
     |          The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
     |      num_words : int, optional
     |          The number of words to be included per topics (ordered by significance).
     |      log : bool, optional
     |          If True - log topics with logger.
     |      formatted : bool, optional
     |          If True - each topic represented as string, otherwise - in BoW format.
     |      
     |      Returns
     |      -------
     |      list of (int, str)
     |          If `formatted=True`, return sequence with (topic_id, string representation of topics) **OR**
     |      list of (int, list of (str, float))
     |          Otherwise, return sequence with (topic_id, [(word, value), ... ]).
     |  
     |  ----------------------------------------------------------------------
     |  Class methods defined here:
     |  
     |  load(fname, *args, **kwargs) from builtins.type
     |      Load a previously saved object using :meth:`~gensim.models.lsimodel.LsiModel.save` from file.
     |      
     |      Notes
     |      -----
     |      Large arrays can be memmap'ed back as read-only (shared memory) by setting the `mmap='r'` parameter.
     |      
     |      Parameters
     |      ----------
     |      fname : str
     |          Path to file that contains LsiModel.
     |      *args
     |          Variable length argument list, see :meth:`gensim.utils.SaveLoad.load`.
     |      **kwargs
     |          Arbitrary keyword arguments, see :meth:`gensim.utils.SaveLoad.load`.
     |      
     |      See Also
     |      --------
     |      :meth:`~gensim.models.lsimodel.LsiModel.save`
     |      
     |      Returns
     |      -------
     |      :class:`~gensim.models.lsimodel.LsiModel`
     |          Loaded instance.
     |      
     |      Raises
     |      ------
     |      IOError
     |          When methods are called on instance (should be called from class).
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  __slotnames__ = []
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from gensim.utils.SaveLoad:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from gensim.models.basemodel.BaseTopicModel:
     |  
     |  print_topic(self, topicno, topn=10)
     |      Get a single topic as a formatted string.
     |      
     |      Parameters
     |      ----------
     |      topicno : int
     |          Topic id.
     |      topn : int
     |          Number of words from topic that will be used.
     |      
     |      Returns
     |      -------
     |      str
     |          String representation of topic, like '-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + ... '.
     |  
     |  print_topics(self, num_topics=20, num_words=10)
     |      Get the most significant topics (alias for `show_topics()` method).
     |      
     |      Parameters
     |      ----------
     |      num_topics : int, optional
     |          The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
     |      num_words : int, optional
     |          The number of words to be included per topics (ordered by significance).
     |      
     |      Returns
     |      -------
     |      list of (int, list of (str, float))
     |          Sequence with (topic_id, [(word, value), ... ]).
    
    展开全文
  • gensim

    2020-09-08 22:30:13
    核心概念 Document: some text. Corpus: a collection of documents. Vector: a mathematically convenient representation of a document. ... Model: an algorithm for ...在Gensim中,Document是

    Gensim 中文文档https://gensim.apachecn.org/#/blog/Introduction/README

    核心概念

    1. Document: some text.

    2. Corpus: a collection of documents.

    3. Vector: a mathematically convenient representation of a document.   original vector representation

    4. Model: an algorithm for transforming vectors from one representation to another.  eg. tf-idf

    Document和Vector之间的区别在于,前者是文本,而后者是数学上方便的文本表示形式。 有时,人们会互换使用这些术语:例如,给定任意文档D,而不是说 “the vector that corresponds to document D”,而是说“the vector D”或“document D”。 这以含糊不清为代价实现了简洁。只要您记得Document存在于文档空间中,并且Vector存在于向量量空间中,上述歧义是可以接受的。

    1. Document

    在Gensim中,Document是文本序列类型的对象(在Python 3中通常称为str)。Document可以是简短的140个字符的推文、单个段落(即期刊文章摘要)、新闻文章或书籍中的任何内容。如: 

    document = "Human machine interface for lab abc computer applications"

    2. Corpus 

    Corpus是Document对象的集合, 语料库在Gensim中扮演两个角色: 

    (1)用于训练模型的输入。在训练过程中,模型会使用该训练语料库来查找常见的主题和话题,从而初始化其内部模型参数。Gensim专注于无监督模型,因此不需要人工干预,例如昂贵的注释或手工标记文档。

    (2)Documents to organize。训练后,可以使用主题模型从新文档(训练语料库中未显示的文档)中提取主题。 这样的语料库可以进行相似性查询、语义相似度查询、聚类等

     这是一个示例语料库。它由9个文档组成,其中每个文档都是一个由单个句子组成的字符串。

    text_corpus = [
        "Human machine interface for lab abc computer applications",
        "A survey of user opinion of computer system response time",
        "The EPS user interface management system",
        "System and human system engineering testing of EPS",
        "Relation of user perceived response time to error measurement",
        "The generation of random binary unordered trees",
        "The intersection graph of paths in trees",
        "Graph minors IV Widths of trees and well quasi ordering",
        "Graph minors A survey",
    ]

    :上面的例子将整个语料库加载到内存中。在实践中,语料库可能非常大,因此无法将其加载到内存中。Gensim通过一次传输一个文档来智能地处理这些语料库。有关详细信息,请参阅 Corpus Streaming – One Document at a Time

    2.1 预处理:

    收集完语料库后,通常需要执行许多预处理步骤。 我们将使其保持简单,仅删除一些常用的英语单词(例如“ the”)和仅在语料库中出现一次的单词。 在此过程中,我们将标记数据。 标记化将文档分解为单词(在这种情况下,使用空格作为分隔符)。 

    # Create a set of frequent words
    stoplist = set('for a of the and to in'.split(' '))
    # Lowercase each document, split it by white space and filter out stopwords
    texts = [[word for word in document.lower().split() if word not in stoplist]
             for document in text_corpus]
    
    # Count word frequencies
    from collections import defaultdict
    frequency = defaultdict(int)     # 所有key的value初始化为0,即frequency[word] = 0
    for text in texts:
        for token in text:
            frequency[token] += 1
    
    # Only keep words that appear more than once
    processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
    pprint.pprint(processed_corpus)

    结果:

    [['human', 'interface', 'computer'], 
     ['survey', 'user', 'computer', 'system', 'response', 'time'], 
     ['eps', 'user', 'interface', 'system'], 
     ['system', 'human', 'system', 'eps'], 
     ['user', 'response', 'time'], ['trees'], 
     ['graph', 'trees'], 
     ['graph', 'minors', 'trees'], 
     ['graph', 'minors', 'survey']
    ]

    有更好的方法来执行预处理: gensim.utils.simple_preprocess()函数。

    2.2 建立词汇表:gensim.corpora.Dictionary

    在继续之前,我们希望将语料库中的每个单词与唯一的整数ID相关联问题和ID之间的映射称为dictionary,可以使用gensim.corpora.Dictionary类来做到这一点。该词典定义了我们处理过程所知道的所有单词的词汇表vocabulary 。

    from gensim import corpora
    
    dictionary = corpora.Dictionary(processed_corpus)
    dictionary.save('/tmp/deerwester.dict')  # store the dictionary, for future reference
    print(dictionary)

    在这里,我们使用gensim.corpora.dictionary.Dictionary类为语料库中出现的所有单词分配了唯一的整数id。 这遍及所有文本,收集了字数统计和相关统计信息。 最后,我们看到处理后的语料库中有12个不同的词,这意味着每个文档将由12个数字表示(即,由12维向量表示)。

    输出:

    Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)

    由于我们的语料库很小,因此gensim.corpora.Dictionary中只有12个不同的标记。对于较大的语料库,包含成千上万个标记的字典是很常见的。

    要查看单词及其ID之间的映射: 

    2.3 dictionary.token2id:输出每个token与ID的对应

    toid = dictionary.token2id
    print(toid)

    结果:“token”: id 键值对

    {'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

    3. Vector

    为了推断语料库中的潜在结构,我们需要一种使用数学方式操作的文档表示方式。

    一种方法是将每个文档表示为特征向量:

    • dense vector:(1, 0.0), (2, 2.0), (3, 5.0)
    • sparse vector or bag-of-words vector:(2, 2.0), (3, 5.0)      gensim方式

    另一种方法是:bag-of-words model 词袋模型

    即每个文档都由一个向量表示,该向量包含字典中每个单词的频率计数。

    dictionary:['coffee', 'milk', 'sugar', 'spoon'].

    A document: "coffee milk coffee" 

    Represented by the vector: [2, 1, 0, 0]

    词袋模型的主要特性之一是,它完全忽略了编码文档中标记的顺序,这就是词袋名称的来源。

    我们处理的语料库中有12个唯一的词,这意味着在bag-of-words模型下,每个文档将由12维向量表示。我们可以使用dictionary将标记化的文档转换为这些12维向量。

    3.1 dictionary.doc2bow:将标记化文档转换为向量

    为文档创建bag-of-words表示法:The function doc2bow() simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. The sparse vector [(0,1),(1,1)] therefore reads: in the document “Human computer interaction”, the words computer (id 0) and human (id 1) appear once; the other ten dictionary words appear (implicitly) zero times.

    例如,假设我们要对短语“Human computer interaction”进行向量化处理(请注意,该短语不在我们的原始语料库中)。 我们可以使用dictionary的doc2bow方法为文档创建bag-of-words表示法,该方法将返回单词计数的稀疏表示法:

    new_doc = "Human computer minors interaction minors"
    new_vec = dictionary.doc2bow(new_doc.lower().split())
    print(new_vec)

    结果vectorization:按 dictionary.token2id 的ID值排序

    [(0, 1), (1, 1), (11, 2)]      每个元组中的第一个值对应于dictionary中token的ID,第二个值对应于此token的计数。

    请注意,“interaction ”在原始语料库中没有出现,因此没有包含在vectorization中。还要注意,这个向量只包含文档中实际出现的单词的值。因为任何给定的文档只包含dictionary中多个单词中的几个单词,所以没有出现在vectorization中的单词被隐式地表示为零,以节省空间。

    4. Model

    现在,我们已经对语料库进行了矢量化处理,我们可以开始使用模型对其进行转换了。 我们使用模型作为抽象术语,指的是从一种文档表示形式到另一种文档表示形式的转换。 在gensim中,文档表示为向量,因此可以将Model视为两个向量空间之间的转换。 当模型读取训练语料库时,将在训练过程中学习此转换的详细信息。

    4.1 tf-idf 模型

    Model的一个简单示例是tf-idf,将向量从词袋模型转换为向量空间:

    这是一个简单的例子,让我们初始化tf-idf模型,在语料库上对其进行训练,并转换字符串“ system minors”

    from gensim import models
    
    # 训练: 输入训练集的词袋表示
    bow_corpus = [dictionary.doc2bow(word) for word in processed_corpus]   # 词袋向量模式
    pprint.pprint(bow_corpus)
    corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)  # store to disk, for later use
    
    tfidf = models.TfidfModel(bow_corpus)    # 词袋向量模式 ——> tfidf向量化模式
    
    # 测试
    words = "system minors".lower().split()   # ['system', 'minors']
    bow_test = dictionary.doc2bow(words)      # [(5, 1), (11, 1)]
    result = tfidf[bow_test]                  # [(5, 0.5898341626740045), (11, 0.8075244024440723)]

    tfidf模型再次返回一个元组列表,其中第一个元素是token的ID,第二个元素是tf-idf权重。 

     训练集的词袋表示:

    [[(0, 1), (1, 1), (2, 1)],
     [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
     [(2, 1), (5, 1), (7, 1), (8, 1)],
     [(1, 1), (5, 2), (8, 1)],
     [(3, 1), (6, 1), (7, 1)],
     [(9, 1)],
     [(9, 1), (10, 1)],
     [(9, 1), (10, 1), (11, 1)],
     [(4, 1), (10, 1), (11, 1)]] 

    请注意,与“system”相对应的ID(在原始语料库中出现过4次)的权重已低于与“minors”相对应的ID(仅出现过两次)的权重。

    您可以将经过训练的模型保存到磁盘上,然后再加载回去,以继续对新的训练文档进行训练或转换新文档。 gensim提供了许多不同的模型/转换。 有关更多信息,请参见 Topics and Transformations

    调用model [corpus]只会在旧的语料库文档流周围创建一个包装器-实际的转换是在文档迭代过程中即时完成的。在调用corpus_transformed = model [corpus]时,我们无法转换整个语料库,因为这将意味着将结果存储在主内存中,并且与gensim的独立于内存的目标相矛盾。如果您要遍历转换的corpus_transformed多次,并且转换成本很高,请首先将生成的语料库序列化到磁盘,然后继续使用它。

    4.2 Corpus Streaming

    请注意,以上语料库作为纯Python列表完全位于内存中。 在这个简单的示例中,这没什么大不了,但是为了清楚起见,我们假设语料库中有数百万个文档。 将它们全部存储在RAM中是行不通的。 而是假设documents 存储在磁盘上的文件中,每行一个document。Gensim仅要求语料库一次必须能够返回一个document 向量: 

     

    Gensim的灵活性:即语料库不必必须是列表、NumPy数组或Pandas数据框等。 Gensim接受在迭代时能连续产生document的任何对象。

    这种灵活性允许我们创建自己的语料库类,直接从磁盘、网络、数据库等流式传输文档。Gensim中模型不需要所有的向量同时保存在RAM中,你甚至可以动态创建文档!

    假设每个文档在一个文件中占据一行并不重要;您可以将__iter__函数塑造为适合自己的输入格式,无论它是什么格式。遍历目录、解析XML、访问网络…只需将你的输入解析为每个文档的token“列表,然后通过dictionary将这些tokens转换为ID,并在__iter__中生成稀疏向量。

     

    2. 模型创建完成的后续操作—gensim.similarities

    创建模型后,您可以使用它进行各种有趣的操作。例如,要通过TfIdf转换整个语料并对其进行索引,以准备相似性查询:

    # 相似性
    from gensim import similarities
    
    index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)
    
    # 查询我们的文档query_document与语料库中每个文档的相似性:
    query_document = 'system engineering'.split()
    query_bow = dictionary.doc2bow(query_document)
    sims = index[tfidf[query_bow]]  # ndarray
    print(list(enumerate(sims)))
    # [(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]
    # Document 3 has a similarity score of 0.718=72%, document 2 has a similarity score of 42% etc.
    
    for doc, score in sorted(enumerate(sims), key=lambda x:x[1], reverse=True):
        print(doc, ' ', score)

    结果: 

    3   0.7184812
    2   0.41707572
    1   0.32448703
    0   0.0
    4   0.0
    5   0.0
    6   0.0
    7   0.0
    8   0.0

     

    展开全文
  • GENSIM

    2016-07-17 09:57:07
    gensim试用 gensim: http://radimrehurek.com/gensim/index.html Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise)...

    gensim试用

    gensim: http://radimrehurek.com/gensim/index.html

    Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

    gensim安装

    sudo apt-get install python-numpy python-scipy

    pip install gensim

    lsi计算文档相似度

    先准备数据,我爬了约2w篇豆瓣日记作为这次试验的数据,数据和代码可以在这里https://github.com/largetalk/yaseg 找到。

    主要代码如下:

    import jieba
    from gensim import corpora, models, similarities
    import os
    import random
    from pprint import pprint
    import string
    import re
    
    RESULT_DIR = 'douban_result'
    regex = re.compile(ur"[^\u4e00-\u9f5aa-zA-Z0-9]")
    
    class DoubanDoc(object):
        def __init__(self, root_dir='douban'):
            self.root_dir = root_dir
    
        def __iter__(self):
            for name in os.listdir(self.root_dir):
                if os.path.isfile(os.path.join(self.root_dir, name)):
                    data = open(os.path.join(self.root_dir, name), 'rb').read()
                    title = data[:data.find('\r\n')]
                    yield (name, title, data)
    
    
    class DoubanCorpus(object):
        def __init__(self, root_dir, dictionary):
            self.root_dir = root_dir
            self.dictionary = dictionary
    
        def __iter__(self):
            for name, title, data in DoubanDoc(self.root_dir):
                yield self.dictionary.doc2bow(jieba.cut(data, cut_all=False))
    
    def random_doc():
        name = random.choice(os.listdir('douban'))
        data = open('douban/%s'%name, 'rb').read()
        print 'random choice ', name
        return name, data
    
    texts = []
    for name, title, data in DoubanDoc():
        def etl(s): #remove 标点和特殊字符
            s = regex.sub('', s)
            return s
    
        seg = filter(lambda x: len(x) > 0, map(etl, jieba.cut(data, cut_all=False)))
        texts.append(seg)
    
    # remove words that appear only once
    all_tokens = sum(texts, [])
    token_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
    texts = [[word for word in text if word not in token_once] for text in texts]
    dictionary = corpora.Dictionary(texts)
    
    corpus = list(DoubanCorpus('douban', dictionary))
    
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=30)
    
    i = 0
    for t in lsi.print_topics(30):
        print '[topic #%s]: '%i, t
        i+=1
    
    index = similarities.MatrixSimilarity(lsi[corpus])
    
    _, doc = random_doc()
    vec_bow = dictionary.doc2bow(jieba.cut(doc, cut_all=False))
    vec_lsi = lsi[vec_bow]
    print 'topic probability:'
    pprint(vec_lsi)
    sims = sorted(enumerate(index[vec_lsi]), key=lambda item: -item[1])
    print 'top 10 similary notes:'
    pprint(sims[:10])
    

    一共有这么些步:

    • 计算词袋(bag of word), 即这里的dictionary
    • 计算corpus
    • 训练TF-IDF模型
    • 计算tf-idf向量
    • 训练LSI模型
    • 对文档用LSI模型分类并建立索引
    • 查寻

    结果

    [topic #0]:  0.277*"我" + 0.268*"你" + 0.196*"的" + 0.165*"他" + 0.146*"了" + 0.138*"她" + 0.124*"是" + 0.116*"自己" + 0.111*"在" + 0.107*"人"
    [topic #1]:  0.504*"the" + 0.303*"to" + 0.268*"and" + 0.265*"of" + 0.235*"I" + 0.235*"a" + 0.219*"you" + 0.178*"in" + 0.175*"is" + 0.139*"that"
    [topic #2]:  -0.732*"你" + -0.172*"我" + -0.119*"爱" + 0.107*"的" + 0.088*"中国" + 0.076*"和" + 0.075*"年" + 0.065*"与" + 0.061*"他们" + 0.061*"中"
    [topic #3]:  -0.620*"她" + -0.288*"他" + 0.281*"你" + -0.160*"我" + -0.099*"说" + -0.098*"了" + -0.092*"啊" + 0.089*"与" + 0.080*"的" + 0.067*"中国"
    [topic #4]:  0.524*"她" + -0.264*"我" + 0.246*"他" + 0.186*"你" + -0.160*"啊" + -0.138*"了" + 0.110*"女人" + 0.097*"爱" + 0.095*"男人" + 0.093*"与"
    [topic #5]:  -0.741*"他" + 0.459*"她" + 0.155*"你" + 0.097*"月" + 0.076*"日" + 0.072*"啊" + 0.068*"1" + 0.067*"2" + 0.062*"年" + -0.062*"我"
    [topic #6]:  -0.367*"他" + -0.331*"你" + 0.188*"自己" + 0.140*"她" + 0.130*"生活" + -0.128*"啊" + -0.128*"月" + -0.119*"日" + -0.117*"1" + 0.116*"我"
    [topic #7]:  0.162*"自己" + -0.153*"着" + -0.138*"在" + 0.120*"做" + -0.116*"它" + 0.113*"别人" + -0.112*"我们" + -0.112*"里" + 0.109*"工作" + 0.104*"啊"
    [topic #8]:  0.521*"I" + 0.445*"you" + -0.386*"the" + -0.253*"of" + 0.190*"me" + 0.160*"my" + 0.144*"t" + 0.128*"love" + -0.113*"and" + 0.092*"your"
    [topic #9]:  0.302*"說" + 0.198*"我們" + 0.193*"對" + 0.187*"來" + 0.181*"一個" + 0.166*"會" + 0.164*"於" + 0.156*"後" + 0.145*"沒" + 0.136*"為"
    [topic #10]:  -0.300*"月" + -0.287*"日" + -0.215*"年" + -0.176*"爱" + -0.141*"2012" + 0.140*"啊" + -0.132*"2011" + -0.129*"他" + 0.124*"你" + -0.119*"11"
    [topic #11]:  -0.547*"我" + 0.202*"爱情" + 0.189*"男人" + 0.186*"女人" + 0.174*"吃" + -0.141*"中国" + 0.125*"爱" + 0.123*"啊" + -0.107*"企业" + 0.092*"不要"
    [topic #12]:  -0.376*"爱" + -0.290*"啊" + -0.240*"爱情" + 0.194*"孩子" + 0.183*"妈妈" + -0.153*"或者" + -0.140*"我" + 0.131*"你" + -0.127*"女人" + -0.124*"男人"
    [topic #13]:  0.264*"啊" + -0.245*"爱" + -0.231*"或者" + -0.188*"妈妈" + -0.178*"吃" + -0.177*"那里" + -0.176*"孩子" + -0.167*"我" + -0.119*"不念" + -0.118*"不增"
    [topic #14]:  -0.349*"孩子" + -0.300*"妈妈" + -0.244*"我们" + -0.220*"啊" + -0.206*"你们" + 0.204*"喜欢" + -0.179*"他们" + -0.131*"父母" + -0.130*"爸爸" + 0.119*"他"
    [topic #15]:  0.322*"我们" + -0.210*"孩子" + 0.161*"爱情" + -0.152*"日" + 0.148*"企业" + -0.145*"月" + 0.138*"客户" + 0.133*"元" + 0.126*"产品" + -0.123*"或者"
    [topic #16]:  0.347*"我" + -0.249*"我们" + -0.212*"或者" + 0.188*"女人" + 0.165*"男人" + -0.165*"那里" + -0.116*"工作" + -0.111*"不见" + -0.110*"不念" + -0.109*"不增"
    [topic #17]:  0.281*"妈妈" + -0.257*"女人" + -0.251*"男人" + 0.239*"豆瓣" + 0.239*"爱" + 0.231*"孩子" + 0.212*"喜欢" + 0.130*"啊" + 0.128*"电影" + -0.125*"月"
    [topic #18]:  0.404*"啊" + -0.325*"男人" + -0.324*"女人" + -0.202*"喜欢" + -0.165*"豆瓣" + -0.136*"电影" + 0.116*"她" + -0.109*"孩子" + -0.104*"妈妈" + 0.100*"他"
    [topic #19]:  -0.357*"我们" + 0.254*"啊" + -0.192*"你们" + 0.163*"女人" + 0.152*"企业" + 0.146*"男人" + -0.139*"喜欢" + -0.136*"吃" + 0.120*"自己" + -0.113*"他们"
    [topic #20]:  -0.312*"豆瓣" + 0.259*"爱情" + 0.219*"妈妈" + -0.218*"你们" + 0.179*"中国" + -0.169*"男人" + -0.168*"女人" + 0.160*"爱" + -0.153*"您" + -0.138*"我们"
    [topic #21]:  0.395*"爱情" + -0.341*"喜欢" + 0.231*"豆瓣" + -0.171*"啊" + -0.143*"中国" + -0.143*"元" + -0.135*"人" + -0.112*"你们" + 0.110*"阅读" + 0.106*"了"
    [topic #22]:  -0.304*"你们" + 0.296*"爱情" + 0.288*"孩子" + -0.240*"吃" + -0.220*"2012" + -0.167*"爱" + -0.158*"豆瓣" + -0.135*"一年" + 0.113*"他们" + 0.092*"元"
    [topic #23]:  0.305*"我们" + 0.261*"妈妈" + -0.237*"爱" + -0.189*"爱情" + 0.188*"女人" + -0.160*"他们" + -0.159*"工作" + 0.140*"男人" + -0.126*"孩子" + -0.123*"我"
    [topic #24]:  0.275*"爱" + -0.269*"啊" + 0.240*"豆瓣" + 0.231*"中国" + -0.213*"爱情" + -0.182*"工作" + -0.159*"喜欢" + -0.155*"我" + -0.123*"生活" + -0.109*"2012"
    [topic #25]:  0.355*"你们" + -0.210*"我们" + 0.205*"孩子" + -0.166*"妈妈" + 0.142*"2012" + -0.139*"我" + -0.134*"啊" + -0.128*"爱" + -0.110*"电影" + -0.109*"人生"
    [topic #26]:  -0.304*"豆瓣" + 0.277*"孩子" + -0.270*"妈妈" + -0.168*"日" + -0.166*"他们" + 0.150*"2012" + -0.132*"您" + -0.130*"月" + -0.126*"元" + -0.113*"生活"
    [topic #27]:  -0.361*"元" + -0.214*"您" + 0.188*"豆瓣" + 0.172*"啊" + 0.167*"喜欢" + 0.141*"他们" + 0.117*"月" + 0.115*"日" + -0.114*"原价" + 0.112*"你们"
    [topic #28]:  -0.340*"2012" + 0.321*"你们" + -0.315*"您" + -0.226*"爱" + 0.195*"爱情" + -0.168*"我们" + 0.163*"中国" + 0.151*"妈妈" + -0.133*"孩子" + -0.115*"它"
    [topic #29]:  0.276*"你们" + 0.245*"妈妈" + 0.219*"2012" + -0.186*"孩子" + -0.162*"豆瓣" + -0.156*"吃" + -0.154*"中国" + -0.143*"生活" + 0.131*"电影" + -0.113*"啊"

    这是分类出来的30个topic, 看起来区分度不大,这估计和豆瓣本身特质相光。

    展开全文
  • python︱gensim训练word2vec及相关函数与功能理解

    万次阅读 多人点赞 2017-04-09 11:23:56
    一、gensim介绍 gensim是一款强大的自然语言处理工具,里面包括N多常见模型: - 基本的语料处理工具 - LSI - LDA - HDP - DTM - DIM - TF-IDF - word2vec、paragraph2vec . 二、...


    一、gensim介绍

    gensim是一款强大的自然语言处理工具,里面包括N多常见模型:

    • 基本的语料处理工具
    • LSI
    • LDA
    • HDP
    • DTM
    • DIM
    • TF-IDF
    • word2vec、paragraph2vec
      .

    二、训练模型

    1、训练

    最简单的训练方式:

    # 最简单的开始
    import gensim
    sentences = [['first', 'sentence'], ['second', 'sentence','is']]
    
    # 模型训练
    model = gensim.models.Word2Vec(sentences, min_count=1)
        # min_count,频数阈值,大于等于1的保留
        # size,神经网络 NN 层单元数,它也对应了训练算法的自由程度
        # workers=4,default = 1 worker = no parallelization 只有在机器已安装 Cython 情况下才会起到作用。如没有 Cython,则只能单核运行。
    
    

    第二种训练方式:

    # 第二种训练方式
    new_model = gensim.models.Word2Vec(min_count=1)  # 先启动一个空模型 an empty model
    new_model.build_vocab(sentences)                 # can be a non-repeatable, 1-pass generator     
    new_model.train(sentences, total_examples=new_model.corpus_count, epochs=new_model.iter)                       
    # can be a non-repeatable, 1-pass generator
    

    案例:

    #encoding=utf-8
    from gensim.models import word2vec
    sentences=word2vec.Text8Corpus(u'分词后的爽肤水评论.txt')
    model=word2vec.Word2Vec(sentences, size=50)
    
    y2=model.similarity(u"好", u"还行")
    print(y2)
    
    for i in model.most_similar(u"滋润"):
        print i[0],i[1]
    

    txt文件是已经分好词的5W条评论,训练模型只需一句话:

    model=word2vec.Word2Vec(sentences,min_count=5,size=50)
    

    第一个参数是训练语料,第二个参数是小于该数的单词会被剔除,默认值为5,
    第三个参数是神经网络的隐藏层单元数,默认为100

    2、模型使用

    # 根据词向量求相似
    model.similarity('first','is')    # 两个词的相似性距离
    model.most_similar(positive=['first', 'second'], negative=['sentence'], topn=1)  # 类比的防护四
    model.doesnt_match("input is lunch he sentence cat".split())                   # 找出不匹配的词语
    

    如何查看模型内部词向量内容:

    # 词向量查询
    model['first'] 
    

    .
    3、模型导出与导入

    最简单的导入与导出

    # 模型保存与载入
    model.save('/tmp/mymodel')
    new_model = gensim.models.Word2Vec.load('/tmp/mymodel')
    odel = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False)  # 载入 .txt文件
    # using gzipped/bz2 input works too, no need to unzip:
    model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)  # 载入 .bin文件
    
    
    word2vec = gensim.models.word2vec.Word2Vec(sentences(), size=256, window=10, min_count=64, sg=1, hs=1, iter=10, workers=25)
    word2vec.save('word2vec_wx')
    

    word2vec.save即可导出文件,这边没有导出为.bin
    .

    model = gensim.models.Word2Vec.load('xxx/word2vec_wx')
    pd.Series(model.most_similar(u'微信',topn = 360000))
    

    gensim.models.Word2Vec.load的办法导入

    其中的Numpy,可以用numpy.load:

    import numpy
    word_2x = numpy.load('xxx/word2vec_wx.wv.syn0.npy')
    

    还有其他的导入方式:

    import gensim
    word_vectors = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)  # C text format
    word_vectors = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)  # C binary format
    

    导入txt格式+bin格式。

    其他导出方式:

    from gensim.models import KeyedVectors
    # save
    model.save(fname) # 只有这样存才能继续训练! 
    model.wv.save_word2vec_format(outfile + '.model.bin', binary=True)  # C binary format 磁盘空间比上一方法减半
    model.wv.save_word2vec_format(outfile + '.model.txt', binary=False) # C text format 磁盘空间大,与方法一样
    
    # load
    model = gensim.models.Word2Vec.load(fname)  
    word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)
    word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)
    
    # 最省内存的加载方法
    model = gensim.models.Word2Vec.load('model path')
    word_vectors = model.wv
    del model
    word_vectors.init_sims(replace=True)
    

    來源:简书,其中:如果你不打算进一步训练模型,调用init_sims将使得模型的存储更加高效

    .

    4、增量训练

    model = gensim.models.Word2Vec.load('/tmp/mymodel')
    model.train(more_sentences)
    

    不能对C生成的模型进行再训练.
    .

    # 增量训练
    model = gensim.models.Word2Vec.load(temp_path)
    more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']]
    model.build_vocab(more_sentences, update=True)
    model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)
    

    .

    5、bow2vec + TFIDF模型

    5.1 Bow2vec

    主要内容为:
    拆分句子为单词颗粒,记号化;
    生成词典;
    生成稀疏文档矩阵

    documents = ["Human machine interface for lab abc computer applications",
                 "A survey of user opinion of computer system response time",
                 "The EPS user interface management system",
                 "System and human system engineering testing of EPS",              
                 "Relation of user perceived response time to error measurement",
                 "The generation of random binary unordered trees",
                 "The intersection graph of paths in trees",
                 "Graph minors IV Widths of trees and well quasi ordering",
                 "Graph minors A survey"]
    
    # 分词并根据词频剔除
    # remove common words and tokenize
    stoplist = set('for a of the and to in'.split())
    texts = [[word for word in document.lower().split() if word not in stoplist]
             for document in documents]
    

    生成词语列表:

    [['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
     ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
     ['eps', 'user', 'interface', 'management', 'system'],
     ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
     ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
     ['generation', 'random', 'binary', 'unordered', 'trees'],
     ['intersection', 'graph', 'paths', 'trees'],
     ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
     ['graph', 'minors', 'survey']]
    

    生成词典:

    # 词典生成
    dictionary = corpora.Dictionary(texts)
    dictionary.save(os.path.join(TEMP_FOLDER, 'deerwester.dict'))  # store the dictionary, for future reference
    print(dictionary)
    print(dictionary.token2id)  # 查看词典中所有词
    

    稀疏文档矩阵的生成:

    # 单句bow 生成
    new_doc = "Human computer interaction Human"
    new_vec = dictionary.doc2bow(new_doc.lower().split())
    print(new_vec)  
        # the word "interaction" does not appear in the dictionary and is ignored
        # [(0, 1), (1, 1)] ,词典(dictionary)中第0个词,出现的频数为1(当前句子),
        # 第1个词,出现的频数为1
        
    # 多句bow 生成
    [dictionary.doc2bow(text) for text in texts]  # 当前句子的词ID + 词频
    

    5.2 tfidf

    from gensim import corpora, models, similarities
    corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
              [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
              [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
              [(0, 1.0), (4, 2.0), (7, 1.0)],
              [(3, 1.0), (5, 1.0), (6, 1.0)],
              [(9, 1.0)],
              [(9, 1.0), (10, 1.0)],
              [(9, 1.0), (10, 1.0), (11, 1.0)],
              [(8, 1.0), (10, 1.0), (11, 1.0)]]
    tfidf = models.TfidfModel(corpus)
    
    # 词袋模型,实践
    vec = [(0, 1), (4, 1),(9, 1)]
    print(tfidf[vec])
    >>>  [(0, 0.695546419520037), (4, 0.5080429008916749), (9, 0.5080429008916749)]
    

    查找vec中,0,4,9号三个词的TFIDF值。同时进行转化,把之前的文档矩阵中的词频变成了TFIDF值。

    利用tfidf求相似:

    # 求相似
    index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
    vec = [(0, 1), (4, 1),(9, 1)]
    sims = index[tfidf[vec]]
    print(list(enumerate(sims)))
    >>>[(0, 0.40157393), (1, 0.16485332), (2, 0.21189235), (3, 0.70710677), (4, 0.0), (5, 0.5080429), (6, 0.35924056), (7, 0.25810757), (8, 0.0)]
    

    对corpus的9个文档建立文档级别的索引,vec是一个新文档的词语的词袋内容,sim就是该vec向量对corpus中的九个文档的相似性。

    索引的导出与载入:

    index.save('/tmp/deerwester.index')
    index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')
    

    5.3 继续转换

    潜在语义索引(LSI) 将Tf-Idf语料转化为一个潜在2-D空间

    lsi = models.LsiModel(tfidf[corpus], id2word=dictionary, num_topics=2) # 初始化一个LSI转换
    corpus_lsi = lsi[tfidf[corpus]] # 在原始语料库上加上双重包装: bow->tfidf->fold-in-lsi
    

    设置了num_topics=2,
    利用models.LsiModel.print_topics()来检查一下这个过程到底产生了什么变化吧:

    lsi.print_topics(2)
    

    根据LSI来看,“tree”、“graph”、“minors”都是相关的词语(而且在第一主题的方向上贡献最多),而第二主题实际上与所有的词语都有关系。如我们所料,前五个文档与第二个主题的关联更强,而其他四个文档与第一个主题关联最强:

    >>> for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    ...     print(doc)
    [(0, -0.066), (1, 0.520)] # "Human machine interface for lab abc computer applications"
    [(0, -0.197), (1, 0.761)] # "A survey of user opinion of computer system response time"
    [(0, -0.090), (1, 0.724)] # "The EPS user interface management system"
    [(0, -0.076), (1, 0.632)] # "System and human system engineering testing of EPS"
    [(0, -0.102), (1, 0.574)] # "Relation of user perceived response time to error measurement"
    [(0, -0.703), (1, -0.161)] # "The generation of random binary unordered trees"
    [(0, -0.877), (1, -0.168)] # "The intersection graph of paths in trees"
    [(0, -0.910), (1, -0.141)] # "Graph minors IV Widths of trees and well quasi ordering"
    [(0, -0.617), (1, 0.054)] # "Graph minors A survey"
    

    相关转换

    参考:《 Gensim官方教程翻译(三)——主题与转换(Topics and Transformations)》

    词频-逆文档频(Term Frequency * Inverse Document Frequency, Tf-Idf)

    一个词袋形式(整数值)的训练语料库来实现初始化。

    model = tfidfmodel.TfidfModel(bow_corpus, normalize=True)
    

    潜在语义索引(Latent Semantic Indexing,LSI,or sometimes LSA)

    将文档从词袋或TfIdf权重空间(更好)转化为一个低维的潜在空间。在真正的语料库上,推荐200-500的目标维度为“金标准”。

    >>> model = lsimodel.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)
    

    LSI训练的独特之处是我们能在任何继续“训练”,仅需提供更多的训练文本。这是通过对底层模型进行增量更新,这个过程被称为“在线训练”。正因为它的这个特性,输入文档流可以是无限大——我们能在以只读的方式使用计算好的模型的同时,还能在新文档到达时一直“喂食”给LSI“消化”!

    >>> model.add_documents(another_tfidf_corpus) # 现在LSI已经使用tfidf_corpus + another_tfidf_corpus进行过训练了
    >>> lsi_vec = model[tfidf_vec] # 将新文档转化到LSI空间不会影响该模型
    >>> ...
    >>> model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
    >>> lsi_vec = model[tfidf_vec]
    

    随机映射(Random Projections,RP)

    目的在于减小空维度。这是一个非常高效(对CPU和内存都很友好)方法,通过抛出一点点随机性,来近似得到两个文档之间的Tfidf距离。推荐目标维度也是成百上千,具体数值要视你的数据集大小而定。

    
    >>> model = rpmodel.RpModel(tfidf_corpus, num_topics=500)
    

    隐含狄利克雷分配(Latent Dirichlet Allocation, LDA)

    也是将词袋计数转化为一个低维主题空间的转换。LDA是LSA(也叫多项式PCA)的概率扩展,因此LDA的主题可以被解释为词语的概率分布。这些分布式从训练语料库中自动推断的,就像LSA一样。相应地,文档可以被解释为这些主题的一个(软)混合(又是就像LSA一样)。

    >>> model = ldamodel.LdaModel(bow_corpus, id2word=dictionary, num_topics=100)
    

    gensim使用一个基于[2]的快速的在线LDA参数估计实现,修改并使其可以在计算机集群上以分布式模式运行。

    分层狄利克雷过程(Hierarchical Dirichlet Process,HDP)
    是一个无参数贝叶斯方法(注意:这里没有num_topics参数):

    >>> model = hdpmodel.HdpModel(bow_corpus, id2word=dictionary)
    

    gensim使用一种基于[3]的快速在线来实现。该算法是新加入gensim的,并且还是一种粗糙的学术边缘产品——小心使用。
    增加新的VSM转化(例如新的权重方案)相当平常;参见API参考或者直接参考我们的源代码以获取信息与帮助。


    三、gensim训练好的word2vec使用

    1、相似性

    持数种单词相似度任务:
    相似词+相似系数(model.most_similar)、model.doesnt_match、model.similarity(两两相似)

    model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
    [('queen', 0.50882536)]
    
    model.most_similar(positive=‘woman’, topn=topn, restrict_vocab=restrict_vocab)  # 直接给入词
    model.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab)  # 直接给入向量
    
    model.doesnt_match("breakfast cereal dinner lunch".split())
    'cereal'
    
    model.similarity('woman', 'man')
    .73723527
    
    
    model.n_similarity(['word1','word2'],['word3'])
    0.999
    

    其中,n_similarity,就是两个list之间的求相似

    .
    2、词向量

    通过以下方式来得到单词的向量:

    model['computer']  # raw NumPy vector of a word
    array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)
    

    3、 词向量表

    model.wv.vocab.keys()
    
    model.wv.index2word
    

    .


    案例一:800万微信语料训练

    来源于:【不可思议的Word2Vec】 2.训练好的模型
    这里写图片描述

    训练过程:

    import gensim, logging
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    
    import pymongo
    import hashlib
    
    db = pymongo.MongoClient('172.16.0.101').weixin.text_articles_words
    md5 = lambda s: hashlib.md5(s).hexdigest()
    class sentences:
        def __iter__(self):
            texts_set = set()
            for a in db.find(no_cursor_timeout=True):
                if md5(a['text'].encode('utf-8')) in texts_set:
                    continue
                else:
                    texts_set.add(md5(a['text'].encode('utf-8')))
                    yield a['words']
            print u'最终计算了%s篇文章'%len(texts_set)
    
    word2vec = gensim.models.word2vec.Word2Vec(sentences(), size=256, window=10, min_count=64, sg=1, hs=1, iter=10, workers=25)
    word2vec.save('word2vec_wx')
    

    这里引入hashlib.md5是为了对文章进行去重(本来1000万篇文章,去重后得到800万),而这个步骤不是必要的。
    .


    案例二:字向量与词向量的训练

    来源github:https://github.com/nlpjoe/daguan-classify-2018/blob/master/src/preprocess/EDA.ipynb

    # 训练词向量
    def train_w2v_model(type='article', min_freq=5, size=100):
        sentences = []
    
        if type == 'char':
            corpus = pd.concat((train_df['article'], test_df['article']))
        elif type == 'word':
            corpus = pd.concat((train_df['word_seg'], test_df['word_seg']))
        for e in tqdm(corpus):
            sentences.append([i for i in e.strip().split() if i])
        print('训练集语料:', len(corpus))
        print('总长度: ', len(sentences))
        model = Word2Vec(sentences, size=size, window=5, min_count=min_freq)
        model.itos = {}
        model.stoi = {}
        model.embedding = {}
        
        print('保存模型...')
        for k in tqdm(model.wv.vocab.keys()):
            model.itos[model.wv.vocab[k].index] = k
            model.stoi[k] = model.wv.vocab[k].index
            model.embedding[model.wv.vocab[k].index] = model.wv[k]
    
        model.save('../../data/word2vec-models/word2vec.{}.{}d.mfreq{}.model'.format(type, size, min_freq))
        return model
    model = train_w2v_model(type='char', size=100)
    model = train_w2v_model(type='word', size=100)
    # model.wv.save_word2vec_format('../../data/laozhu-word-300d', binary=False)
    # train_df[:3]
    print('OK')
    

    参考于:

    基于python的gensim word2vec训练词向量
    Gensim Word2vec 使用教程
    官方教程:http://radimrehurek.com/gensim/models/word2vec.html


    报错:

    ImportError: DLL load failed: The specified module could not be found.
    

    问题描述:

    C extension did not be loaded

    gensim需要gcc、g++进行加速,

    之前大多给的解决方案是

    conda install libpython mingw

    但是提示缺乏镜像源,最终找到了合适的镜像源进行添加之后就可以了

    解决办法:
    https://blog.csdn.net/weixin_39056447/article/details/100770804

    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
     
    # 添加镜像源
     
    conda install mingw libpython
     
    # libpython 和 mingw 有先后,但是conda会自动依赖顺序安装
    

    展开全文
  • python gensim

    2018-04-17 00:57:03
    直接把解压后的文档里的gensim文件放进python27下 的lib库里
  • gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 4,253
精华内容 1,701
关键字:

gensim