精华内容
下载资源
问答
  • 集成IK分词器 solr安装参见博文—-Apache solr入门 下载分词器jar包,github地址 将ik-analyzer-solr7-7.x.jar包上传到 $SOLR_INSTALL_HOME/server/solr-webapp/webapp/WEB-INF/lib目录下 在$SOLR_INSTALL_HOME/...
  • 因为es本身的分词器对中文不是特别友好,所以使用ik分词器,分为 两种 模式,一种是粗 模式,一种是细模式,还希望能帮助到刚刚接触的人
  • 文章目录Elasticsearch7 分词器(内置分词器和自定义分词器)analysis概览char_filter...

    Elasticsearch7 分词器(内置分词器和自定义分词器)

    analysis

    概览

    "settings":{
        "analysis": { # 自定义分词
          "filter": {
          	"自定义过滤器": {
                "type": "edge_ngram",  # 过滤器类型
                "min_gram": "1",  # 最小边界 
                "max_gram": "6"  # 最大边界
            }
          },  # 过滤器
          "char_filter": {},  # 字符过滤器
          "tokenizer": {},   # 分词
          "analyzer": {
          	"自定义分词器名称": {
              "type": "custom",
              "tokenizer": "上述自定义分词名称或自带分词",
              "filter": [
                "上述自定义过滤器名称或自带过滤器"
              ],
              "char_filter": [
              	"上述自定义字符过滤器名称或自带字符过滤器"
              ]
            }
          }  # 分词器
        }
    }
    

    查询分词效果:

    1.查询指定索引库的分词器效果
    POST /discovery-user/_analyze
    {
      "analyzer": "analyzer_ngram", 
      "text":"i like cats"
    }
    2.查询所有索引库通用的分词器效果
    POST _analyze
    {
      "analyzer": "standard",  # english,ik_max_word,ik_smart
      "text":"i like cats"
    }
    

    char_filter

    定义:字符过滤器将原始文本作为字符流来接收,并可以新增,移除或修改字符转换字符流
    A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.
    可去除HTML元素或转换0123为零一二三

    一个分词器可应用0或多个字符过滤器,按顺序生效
    An analyzer may have zero or more character filters, which are applied in order.

    es7自带字符过滤器:

    • HTML Strip Character Filter:html_strip
    去除HTML元素
    The html_strip character filter strips out HTML elements like <b> and decodes HTML entities like &amp;.
    
    • Mapping Character Filter:mapping
    符合映射关系的字符进行替换  
    The mapping character filter replaces any occurrences of the specified strings with the specified replacements.
    
    • Pattern Replace Character Filter:pattern_replace
    符合正则表达式的字符替换为指定的字符
    The pattern_replace character filter replaces any characters matching a regular expression with the specified replacement.
    

    html_strip

    html_strip接受escaped_tags参数

    "char_filter": {
            "my_char_filter": {
              "type": "html_strip",
              "escaped_tags": ["b"]
            }
    }
    escaped_tags:An array of HTML tags which should not be stripped from the original text.
    即忽略的HTML标签
    POST my_index/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "<p>I&apos;m so <b>happy</b>!</p>"
    }
    I'm so <b>happy</b>!  # 忽略了b标签
    

    mapping

    The mapping character filter accepts a map of keys and values. Whenever it encounters a string of characters that is the same as a key, it replaces them with the value associated with that key.
    Replacements are allowed to be the empty string允许空值

    The mapping character filter accepts the following parameters:映射有以下两个参数,且必选其一
    mappings

    A array of mappings, with each element having the form key => value
    映射的数组,每个映射的格式为 key => value
    

    mappings_path

    A path, either absolute or relative to the config directory, to a UTF-8 encoded text mappings file containing a key => value mapping per line.
    文件映射,路径是绝对路径或相对于config文件夹的相对路径,文件需utf-8编码且每行的映射格式为key => value
    
    "char_filter": {
            "my_char_filter": {
              "type": "mapping",
              "mappings": [
                "一 => 0",
                "二 => 1",
                "# => ",  # 映射值可以为空
                "一二三 => 老虎"  # 映射可以多个字符
              ]
            }
    }
    

    pattern_replace

    The pattern_replace character filter uses a regular expression to match characters which should be replaced with the specified replacement string. The replacement string can refer to capture groups in the regular expression.

    Beware of Pathological Regular Expressions
    使用正则需要注意低效率的正则表达式,此类表达式可能引起StackOverflowError,es7的正则表达式遵从Java 的Pattern

    正则表达式有以下参数:
    pattern:必选

    A Java regular expression. Required.
    

    replacement:

    The replacement string, which can reference capture groups using the $1..$9 syntax
    要替换的字符串,通过
    

    flags:

    Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS".
    
    123-456-789 → 123_456_789:
    "char_filter": {
            "my_char_filter": {
              "type": "pattern_replace",
              "pattern": "(\\d+)-(?=\\d)",
              "replacement": "$1_"
            }
    }
    

    Using a replacement string that changes the length of the original text will work for search purposes, but will result in incorrect highlighting
    正则过滤改变长度可能导致高亮结果有误

    filter

    A token filter receives the token stream and may add, remove, or change tokens. For example, a lowercase token filter converts all tokens to lowercase, a stop token filter removes common words (stop words) like the from the token stream, and a synonym token filter introduces synonyms into the token stream.

    Token filters are not allowed to change the position or character offsets of each token.

    An analyzer may have zero or more token filters, which are applied in order.

    asciifolding

    A token filter of type asciifolding that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the “Basic Latin” Unicode block) into their ASCII equivalents, if one exists

    Accepts preserve_original setting which defaults to false but if true will keep the original token as well as emit the folded token
    将前127个ASCII字符(基本拉丁语的Unicode块)中不包含的字母、数字和符号Unicode字符转换为对应的ASCII字符(如果存在的话)

    length

    过滤掉太短或太长的单词
    Setting Description

    min
    	The minimum number. Defaults to 0.
    
    max
    	The maximum number. Defaults to Integer.MAX_VALUE, which is 2^31-1 or 2147483647
    

    lowercase

    标准化文本为小写
    参数language指定除了英语的其他语种

    uppercase

    标准化文本为大写

    ngram

    ngram过滤器,将分词进行ngram过滤处理,可实现中文分词器中对英文的分词
    Setting Description

    min_gram
    	Defaults to 1.
    
    max_gram
    	Defaults to 2.
    
    index.max_ngram_diff:即最大最小的差额
    The index level setting index.max_ngram_diff controls the maximum allowed difference between max_gram and min_gram.
    

    edge_ngram

    边界ngram过滤 123过滤为1,12,123没有2,23
    Setting Description

    min_gram
    	Defaults to 1.
    
    max_gram
    	Defaults to 2.
    
    side
    	deprecated. Either front or back. Defaults to front.
    

    decimal_digit

    decimal_digit的作用是将unicode数字转化为0-9
    \u0032 转成2

    tokenizer

    A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens.

    The tokenizer is also responsible for recording the order or position of each term and the start and end character offsets of the original word which the term represents.

    An analyzer must have exactly one tokenizer.

    测试tokenzer效果

    POST _analyze
    {
      "tokenizer": "tokenzer名称",
      "text": "分词文本:The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    

    Word Oriented Tokenizers

    The following tokenizers are usually used for tokenizing full text into individual words
    单词取词通常将整个文本切成独立的单词

    Standard tokenizer

    configuration参数:
    max_token_length

    The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255
    超过此长度切割  如长度3,abcd分成abc d
    
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "standard",
              "max_token_length": 5
            }
          }
        }
      }
    }
    

    Partial Word Tokenizers

    These tokenizers break up text or words into small fragments
    部分词取词,将文本或单词切分成更小的片段

    NGram Tokenizer

    The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.
    ngram取词会将文本切成单词,然后每个单词是指定长度区间的ngram片段

    取词效果:

    POST _analyze
    {
      "tokenizer": "ngram",
      "text": "Quick Fox"
    }
    The above sentence would produce the following terms:
    
    [ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]
    

    With the default settings, the ngram tokenizer treats the initial text as a single token and produces N-grams with minimum length 1 and maximum length 2
    ngram默认最小长度1,最大长度2

    Configuration
    min_gram

    Minimum length of characters in a gram. Defaults to 1
    片段的最小长度 默认1
    

    max_gram

    Maximum length of characters in a gram. Defaults to 2.
    片段的最大长度 默认2
    

    token_chars

    默认取值[]保留所有字符 指定的不包含
    Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to [] (keep all characters).
    
    Character classes may be any of the following:
    letter —  for example a, b, ï or 京
    digit —  for example 3 or 7
    whitespace —  for example " " or "\n"
    punctuation — for example ! or "
    symbol —  for example $ or √
    
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "ngram",
              "min_gram": 3,
              "max_gram": 3,
              "token_chars": [
                "letter",
                "digit"
              ]
            }
          }
        }
      }
    }
    POST my_index/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "2 Quick Foxes."
    }
    结果不包含digit和letter
    [ Qui, uic, ick, Fox, oxe, xes ]
    
    Edge NGram Tokenizer

    The edge_ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word.
    边界ngram,固定从每个单词的开始生成指定长度的ngram 如abc生成ab和abc不会有bc
    参数同ngram

    Structured Text Tokenizers

    The following tokenizers are usually used with structured text like identifiers, email addresses, zip codes, and paths, rather than with full text
    身份验证,邮箱地址,路径等有结构的文书取词

    analyzer

    built-in analyzers: 内置分词器

    • Standard Analyzer:standard
      The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation, lowercases terms, and supports removing stop words.
    • Simple Analyzer:simple
      The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms.
    • Whitespace Analyzer:whitespace
      The whitespace analyzer divides text into terms whenever it encounters any whitespace character. It does not lowercase terms.
    • Stop Analyzer:stop
      The stop analyzer is like the simple analyzer, but also supports removal of stop words.
    • Keyword Analyzer:keyword
      The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term.
    • Pattern Analyzer:pattern
      The pattern analyzer uses a regular expression to split the text into terms. It supports lower-casing and stop words.
    • Language Analyzers:english等
      Elasticsearch provides many language-specific analyzers like english or french.
    • Fingerprint Analyzer:fingerprint
      The fingerprint analyzer is a specialist analyzer which creates a fingerprint which can be used for duplicate detection.

    custom analyzer:自定义分词器
    If you do not find an analyzer suitable for your needs, you can create a custom analyzer which combines the appropriate character filters, tokenizer, and token filters.

    The built-in analyzers can be used directly without any configuration.
    内置分词器开箱即用无需配置
    Some of them, however, support configuration options to alter their behaviour
    一些内置分词器支持有选择的配置

    standard分词器搭配stopwords参数
    "analysis": {
          "analyzer": {
            "自定义分词器名称": { 
              "type":      "standard",
              "stopwords": "_english_"  # 支持英语停用词 即分词忽略the a等
            }
          }
    }
    

    standard /Standard Tokenizer;Lower Case Token Filter,Stop Token Filter

    默认分词器
    The standard analyzer is the default analyzer which is used if none is specified
    分词效果:

    POST _analyze
    {
      "analyzer": "standard",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    The above sentence would produce the following terms:
    
    [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
    

    standard参数:
    max_token_length

    The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.
    分词器分成token的最大长度,例如为5,则jumped分成jumpe d ;此参数最大255
    

    stopwords

    A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to _none_
    可以设置为_english_ 或自定义数组["a", "the"],默认_none_
    

    stopwords_path

    The path to a file containing stop words.文件方式
    
    "analyzer": {
            "my_english_analyzer": {
              "type": "standard",
              "max_token_length": 5,  # token最长为5
              "stopwords": "_english_"  # 忽略英语停用词
            }
    }
    

    definition定义
    standard分词器默认组成:
    -Tokenizer

      • Standard Tokenizer
    • Token Filters
      • Lower Case Token Filter
      • Stop Token Filter (disabled by default)

    If you need to customize the standard analyzer beyond the configuration parameters then you need to recreate it as a custom analyzer and modify it, usually by adding token filters.
    若需要自定义standard分词器需要指定type为custom

     "analysis": {
          "analyzer": {
            "rebuilt_standard": {
              "type": "custom",
              "tokenizer": "standard",
              "filter": [
                "lowercase"       
              ]
            }
          }
     }
     自定义standard分词器无法使用max_token_length,stopwords等参数,需要自定义Token Filters过滤器
      Lower Case Token Filter
      Stop Token Filter (disabled by default)
    

    simple /Lower Case Tokenizer

    The simple analyzer breaks text into terms whenever it encounters a character which is not a letter. All terms are lower cased.
    遇到
    分词效果:

    POST _analyze
    {
      "analyzer": "simple",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    The above sentence would produce the following terms:
    
    [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
    

    The simple analyzer is not configurable.
    simple分词器没有参数

    definition:

    • Tokenizer
      • Lower Case Tokenizer

    whitespace /Whitespace Tokenizer

    The whitespace analyzer breaks text into terms whenever it encounters a whitespace character
    根据空格进行分词
    分词效果:

    POST _analyze
    {
      "analyzer": "whitespace",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    The above sentence would produce the following terms:
    
    [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
    

    没有可选参数
    Definition

    • Tokenizer
      • Whitespace Tokenizer

    stop /Lower Case Tokenizer;Stop Token Filter

    The stop analyzer is the same as the simple analyzer but adds support for removing stop words. It defaults to using the english stop words.
    与simple分词器类似,但是默认提供停止词
    分词效果:

    POST _analyze
    {
      "analyzer": "stop",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    The above sentence would produce the following terms:
    
    [ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
    

    可选参数:
    stopwords

    A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to _english_.
    

    stopwords_path

    The path to a file containing stop words. This path is relative to the Elasticsearch config directory.
    
    "analyzer": {
            "my_stop_analyzer": {
              "type": "stop",
              "stopwords": ["the", "over"]
            }
     }
    

    definition:

    • Tokenizer
      • Lower Case Tokenizer
    • Token filters
      • Stop Token Filter

    自定义stop

    "settings": {
        "analysis": {
          "filter": {
            "english_stop": {
              "type":       "stop",
              "stopwords":  "_english_" 
            }
          },
          "analyzer": {
            "rebuilt_stop": {
              "tokenizer": "lowercase",
              "filter": [
                "english_stop"          
              ]
            }
          }
        }
      }
    

    keyword /Keyword Tokenizer

    The keyword analyzer is a “noop” analyzer which returns the entire input string as a single token
    无操作的分词器,输入即输出
    分词效果

    POST _analyze
    {
      "analyzer": "keyword",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    The above sentence would produce the following single term:
    
    [ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]
    

    keyword分词器无可选参数

    definition

    • Tokenizer
      • Keyword Tokenizer

    pattern /Pattern Tokenizer;Lower Case Token Filter,Stop Token Filter

    The pattern analyzer uses a regular expression to split the text into terms. The regular expression should match the token separators not the tokens themselves. The regular expression defaults to \W+ (or all non-word characters).
    正则表达式分词器默认所有非单词字符
    分词效果:

    POST _analyze
    {
      "analyzer": "pattern",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    The above sentence would produce the following terms:
    
    [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
    

    configuration:可选配置参数
    pattern

    A Java regular expression, defaults to \W+.
    java正则表达式,有默认
    

    flags

    Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS".
    

    lowercase

    Should terms be lowercased or not. Defaults to true.
    是否小写分词,默认true
    

    stopwords

    A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to _none_.
    停止词,默认_none_
    

    stopwords_path

    The path to a file containing stop words.
    
    "analyzer": {
            "my_email_analyzer": {
              "type":      "pattern",
              "pattern":   "\\W|_", 
              "lowercase": true
            }
     }
    

    示例:驼峰分词

    PUT my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "camel": {
              "type": "pattern",
              "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
            }
          }
        }
      }
    }
    
    GET my_index/_analyze
    {
      "analyzer": "camel",
      "text": "MooseX::FTPClass2_beta"
    }
    
    [ moose, x, ftp, class, 2, beta ]
    

    一般正则表达式:

      ([^\p{L}\d]+)                 # swallow non letters and numbers,
    | (?<=\D)(?=\d)                 # or non-number followed by number,
    | (?<=\d)(?=\D)                 # or number followed by non-number,
    | (?<=[ \p{L} && [^\p{Lu}]])    # or lower case
      (?=\p{Lu})                    #   followed by upper case,
    | (?<=\p{Lu})                   # or upper case
      (?=\p{Lu}                     #   followed by upper case
        [\p{L}&&[^\p{Lu}]]          #   then lower case
      )
    

    definition

    • Tokenizer
      • Pattern Tokenizer
    • Token Filters
      • Lower Case Token Filter
      • Stop Token Filter (disabled by default)

    自定义正则分词器

    {
      "settings": {
        "analysis": {
          "tokenizer": {
            "split_on_non_word": {
              "type":       "pattern",
              "pattern":    "\\W+" 
            }
          },
          "analyzer": {
            "rebuilt_pattern": {
              "tokenizer": "split_on_non_word",
              "filter": [
                "lowercase"       
              ]
            }
          }
        }
      }
    }
    

    Language Analyzers

    各种语言的分词器

    configure可选参数有:
    stopwords 停止词
    stem_exclusion:The stem_exclusion parameter allows you to specify an array of lowercase words that should not be stemmed. Internally, this functionality is implemented by adding the keyword_marker token filter with the keywords set to the value of the stem_exclusion parameter

    english analyzer
    The english analyzer could be reimplemented as a custom analyzer as follows:
    英语分词器等同以下自定义分词器

    PUT /english_example
    {
      "settings": {
        "analysis": {
          "filter": {
            "english_stop": {
              "type":       "stop",
              "stopwords":  "_english_" 
            },
            "english_keywords": {
              "type":       "keyword_marker",
              "keywords":   ["example"] 
            },
            "english_stemmer": {
              "type":       "stemmer",
              "language":   "english"
            },
            "english_possessive_stemmer": {
              "type":       "stemmer",
              "language":   "possessive_english"
            }
          },
          "analyzer": {
            "rebuilt_english": {
              "tokenizer":  "standard",
              "filter": [
                "english_possessive_stemmer",
                "lowercase",
                "english_stop",
                "english_keywords",
                "english_stemmer"
              ]
            }
          }
        }
      }
    }
    

    fingerprint /Standard Tokenizer;Lower Case Token Filter,ASCII Folding Token Filter,Stop Token Filter,Fingerprint Token Filter

    Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed.
    去重,排序并聚集为一个单个的term,若配置停止词则删除停止词
    分词效果:

    POST _analyze
    {
      "analyzer": "fingerprint",
      "text": "Yes yes, Gödel said this sentence is consistent and."
    }
    The above sentence would produce the following single term:
    
    [ and consistent godel is said sentence this yes ]
    

    configuration
    separator

    The character to use to concatenate the terms. Defaults to a space.
    默认使用空格聚集所有term 即分隔符
    

    max_output_size

    The maximum token size to emit. Defaults to 255. Tokens larger than this size will be discarded.
    输出被允许的最大长度,超过则丢弃 默认255
    

    stopwords
    stopwords_path

    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_fingerprint_analyzer": {
              "type": "fingerprint",
              "stopwords": "_english_",
              "max_output_size": 222,
              "separator": ","
            }
          }
        }
      }
    }
    

    Definition

    • Tokenizer
      • Standard Tokenizer
    • Token Filters (in order)
      • Lower Case Token Filter
      • ASCII Folding Token Filter
      • Stop Token Filter (disabled by default)
      • Fingerprint Token Filter

    customer分词器

    When the built-in analyzers do not fulfill your needs, you can create a custom analyzer which uses the appropriate combination of:

    • zero or more character filters
    • a tokenizer
    • zero or more token filters.

    内置分词器不符合需求可通过filter和tokenizer自定义分词器

    Configuration:
    tokenizer:必选

    	A built-in or customised tokenizer. (Required)
    

    char_filter

    An optional array of built-in or customised character filters.
    

    filter

    An optional array of built-in or customised token filters.
    

    position_increment_gap

    When indexing an array of text values, Elasticsearch inserts a fake "gap" between the last term of one value and the first term of the next value to ensure that a phrase query doesn’t match two terms from different array elements. Defaults to 100
    
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_custom_analyzer": {
              "type":      "custom",  # 自定义的analyzer其type固定custom
              "tokenizer": "standard",
              "char_filter": [
                "html_strip"
              ],
              "filter": [
                "lowercase",
                "asciifolding"
              ]
            }
          }
        }
      }
    }
    
    展开全文
  • 以上用例是使用 analyzer 指定英文分词器查看分词结果,如果field是索引里的字段,会使用字段指定的分词器进行分词。 接下来进入测试。 默认分词器 默认使用stander分词器 在不标明的时候都是使用默认的stander...

    注:测试环境:CentOS Linux release 7.6.1810 (Core) 
    jdk:1.8
    elasticsearch:6.8.2 单节点

    es       安装:https://blog.csdn.net/qq_33743572/article/details/108175092
    es添加索引:https://blog.csdn.net/qq_33743572/article/details/108231558

    导图(用于总结和复习)

    注:使用 GET _analyze 可以使用分词器查看分词结果,例:

    image.png

    以上用例是使用 analyzer 指定英文分词器查看分词结果,如果field是索引里的字段,会使用字段指定的分词器进行分词。

    接下来进入测试。

    默认分词器

    默认使用stander分词器

    在不标明的时候都是使用默认的stander分词

    在建索引的时候,使用 analyzer 指定字段分词器

    测试数据:

    #1.删除 /test下的测试数据
    DELETE /test/
    
    #2.给字段添加限制类型,使用analyzer添加分词方式,不标明的时候默认使用stander分词器,也可以标明english使用英文分词器
    PUT /test
    {
      "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 1
      },
      "mappings": {
        "_doc":{
          "properties":{
            "name":{"type":"text"},
            "age":{"type":"integer"},
            "introduce":{"type":"text","analyzer":"english"}
          }
        }
      }
    }
    
    #3.添加测试数据:李雷
    PUT /test/_doc/1
    {
      "name":"李雷",
      "age":12,
      "engname":"Lilei",
      "introduce":"My name is Lilei, I like eating apples and running"
    }
    
    #4.使用“李小雷”测试查询
    GET /test/_search
    {
      "query": {
        "match": {
          "name": "李小雷"
        }
      }
    }

    image.png

    分别执行以上的测试脚本,最后发现使用“李小雷”也能搜索出 name 为“李雷”的文档。这是因为 name 默认使用了 stander 分词器。可以通过以下方法查看分词器的分词结果。例:

    image.png

    这里field使用name,就会使用name的默认分词器stander

    可以看到 stander 分词器会把中文拆分成一个一个的汉字,搜索条件只需要满足一个汉字就能搜索出结果,所以“李小雷”能搜索到“李雷”。中文一般不建议直接使用这种分词器,否则没有效果了,后面会介绍中文分词器

    英文分词器

    image.png

    这里field使用introduce,就会使用introduce的分词器english。

    英文分词器会把单词的 词干 提取出来。当我们使用条件搜索的时候,也会提取查询单词的 词干 与分词结果匹配,所以搜索的时候只要有满足分词结果的 词干 就会有搜索结果。

    image.png

    使用 "appl" 和 "apples" 可以搜索到结果,因为搜索的时候都会解析成"appl"。但是如果使用app就不能搜索到结果,因为app跟english分词器的分词结果appl不匹配。

    中文分词器

    中文分词需要安装插件:analysis-ik

    网址:https://github.com/medcl/elasticsearch-analysis-ik

     

    安装命令:

    1.进入es主目录:cd /usr/lib/elasticsearch/elasticsearch-0/

    2.执行安装命令:./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.8.2/elasticsearch-analysis-ik-6.8.2.zip

    注意版本号一定要对应

     

    安装完成后

    cd plugins

    image.png

    说明已经安装好了

    安装好后可以进词库文件查看词库,命令:

    cd /usr/lib/elasticsearch/elasticsearch-0/config/analysis-ik/

    vi main.dic

    image.png

    词库量很大不做展示。接下来做测试。

     

    测试脚本(需要删除历史数据重新创建):

    #1.删除 /test下的测试数据
    DELETE /test/
    
    #2.给字段添加限制类型,使用analyzer添加分词方式,不标明的时候默认使用stander分词器,也可以标明english使用英文分词器
    PUT /test
    {
      "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 1
      },
      "mappings": {
        "_doc":{
          "properties":{
            "name":{"type":"text"},
            "age":{"type":"integer"},
            "introduce":{"type":"text","analyzer":"english"},
            "address":{"type":"text","analyzer":"ik_max_word"},
            "address2":{"type":"text","analyzer":"ik_smart"}
          }
        }
      }
    }
    
    #3.添加测试数据:李雷
    PUT /test/_doc/1
    {
      "name":"李雷",
      "age":12,
      "engname":"Lilei",
      "introduce":"My name is Lilei, I like eating apples and running",
      "address":"我家住在南京市长江大桥",
      "address2":"我家住在南京市长江大桥"
    }
    
    #4.测试中文分词
    GET /test/_search
    {
      "query": {
        "match": {
          "address": "南京市"
        }
      }
    }
    
    
    #测试中文分词器ik_max_word
    GET /test/_analyze
    {
      "field": "address",
      "text" : "我家住在南京市长江大桥"
    }
    #测试中文分词器ik_smart
    GET /test/_analyze
    {
      "field": "address2",
      "text" : "我家住在南京市长江大桥"
    }

    analysis-ik 插件支持两种分词器:

    第一种是 ​ik_max_word 分词器

    这种分词器几乎会把所有词汇进行分词。例:

    注意:使用GET方法测试分词结果。"field": "address"的时候,就会使用address指定的分词器 ik_max_word

    image.png

    在进行搜索的时候,只用满足其中的一个词汇就能搜索到结果,例:

    image.png

    关键词“南京”可以在分词结果里找到,所以可以搜索到结果。其他像“我”、“我家”、“住在”这些包含在分词结果里的词汇都可以搜索到结果,但是用“住”就不能找到,因为“住”不在分词结果里。使用“我家住”能找到,因为满足了“我”或者“我家”。

    第二种是 ik_smart 分词器

    注意:使用GET方法测试分词结果。"field": "address2"的时候,就会使用 address2 指定的分词器 ik_smart

    image.png

    ik_smart 分词的粒度就粗很多。搜索的时候满足其中的词汇就能搜索出结果。例:

    image.png

    这两种分词器该怎么配合使用呢?

    在建索引的时候使用 ik_max_word 分词器,这样分的词可以更细。

    在搜索的时候使用 ik_smart 分词器,这样在搜索的时候分的词粒度更粗,只用查更少的词,能提高效率。

    例:

    image.png

    这么配置就可以在搜索的时候使用 ik_smart 分词器,建索引的时候使用 ik_max_word 分词器了。

    展开全文
  • word分词器java源码
  • elasticsearch使用中文分词器和拼音分词器,自定义分词器 1. 到github 下载分词器 上面有已经编译好打好的包。下载后在es安装目录下的plugins/目录下创建ik和pinyin两个文件夹,把下载好的zip包解压在里面。重启es就...

    elasticsearch使用中文分词器和拼音分词器,自定义分词器

    1. 到github 下载分词器

    上面有已经编译好打好的包。下载后在es安装目录下的plugins/目录下创建ik和pinyin两个文件夹,把下载好的zip包解压在里面。重启es就会生效了。github上readme.txt文件里有使用说明。注意下载的时候下载版本对应的,比如我的es版本是5.6.16,下载分词器的时候也要下载这个版本的。

    ik 中文分词器:https://github.com/medcl/elasticsearch-analysis-ik/releases

    pinyin 拼音分词器:https://github.com/medcl/elasticsearch-analysis-pinyin/releases

    也可以下载源码后,用mvn手动打包,但是特别慢,我打了个拼音包两个多小时,可能和没翻墙也有关系。

    2. 使用分词器

    解压后重启es就可以使用了。分词器是属于索引的,所以测试分词器的时候,要指定是哪个索引。

    ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询。

    get http://localhost:9200/user_index/_analyze?analyzer=ik_smart&text=张三李四
    

    返回

    {
        "tokens": [
            {
                "token": "张三李四",
                "start_offset": 0,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 0
            }
        ]
    }
    

    ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query;

    get http://localhost:9200/user_index/_analyze?analyzer=ik_max_word&text=张三李四
    

    返回

    {
        "tokens": [
            {
                "token": "张三李四",
                "start_offset": 0,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 0
            },
            {
                "token": "张三",
                "start_offset": 0,
                "end_offset": 2,
                "type": "CN_WORD",
                "position": 1
            },
            {
                "token": "三",
                "start_offset": 1,
                "end_offset": 2,
                "type": "TYPE_CNUM",
                "position": 2
            },
            {
                "token": "李四",
                "start_offset": 2,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 3
            },
            {
                "token": "四",
                "start_offset": 3,
                "end_offset": 4,
                "type": "TYPE_CNUM",
                "position": 4
            }
        ]
    }
    

    get http://localhost:9200/user_index/_analyze?analyzer=pinyin&text=张三李四
    

    返回

    {
        "tokens": [
            {
                "token": "zhang",
                "start_offset": 0,
                "end_offset": 1,
                "type": "word",
                "position": 0
            },
            {
                "token": "zsls",
                "start_offset": 0,
                "end_offset": 4,
                "type": "word",
                "position": 0
            },
            {
                "token": "san",
                "start_offset": 1,
                "end_offset": 2,
                "type": "word",
                "position": 1
            },
            {
                "token": "li",
                "start_offset": 2,
                "end_offset": 3,
                "type": "word",
                "position": 2
            },
            {
                "token": "si",
                "start_offset": 3,
                "end_offset": 4,
                "type": "word",
                "position": 3
            }
        ]
    }
    

    3. 自定义分词器,ik+pinyin组合使用

    ik中文分词器,貌似没有可以设置的属性,直接用就行了。

    拼音分词器有许多可以设置的选项。可以自行定义。原本的拼音分词器,只能分析出来全拼、首字母全拼、和每个字的完整拼音,不过这个每个字的完整拼音我觉得没什么作用,太细微。我想实现的功能是,可以让中文分词器分词后的字词,再被拼音分词器分词,就可以用下面的方式,tokenizer 使用 中文分词器ik_max_word,最后的标记过滤器,再使用pinyin 分词器过滤一遍就可以了。

    {
      "index": {
        "number_of_replicas" : "0",
        "number_of_shards" : "1",
        "analysis": {
          "analyzer": {
            "ik_pinyin_analyzer": {
              "tokenizer": "my_ik_pinyin",
              "filter": "pinyin_first_letter_and_full_pinyin_filter"
            },
            "pinyin_analyzer": {
              "tokenizer": "my_pinyin"
            }
          },
          "tokenizer": {
            "my_ik_pinyin": {
              "type": "ik_max_word"
            },
            "my_pinyin": {
              "type": "pinyin",
              "keep_first_letter": true,
              "keep_separate_first_letter": false,
              "keep_full_pinyin": false,
              "keep_joined_full_pinyin": true,
              "keep_none_chinese": true,
              "none_chinese_pinyin_tokenize": false,
              "keep_none_chinese_in_joined_full_pinyin": true,
              "keep_original": false,
              "limit_first_letter_length": 16,
              "lowercase": true,
              "trim_whitespace": true,
              "remove_duplicated_term": true
            }
          },
          "filter": {
            "pinyin_first_letter_and_full_pinyin_filter": {
              "type": "pinyin",
              "keep_first_letter": true,
              "keep_separate_first_letter": false,
              "keep_full_pinyin": false,
              "keep_joined_full_pinyin": true,
              "keep_none_chinese": true,
              "none_chinese_pinyin_tokenize": false,
              "keep_none_chinese_in_joined_full_pinyin": true,
              "keep_original": false,
              "limit_first_letter_length": 16,
              "lowercase": true,
              "trim_whitespace": true,
              "remove_duplicated_term": true
            }
          }
        }
      }
    }
    

    我们测试一下

    http://localhost:9200/drug_index/_analyze?analyzer=ik_pinyin_analyzer&text=阿莫西林胶囊
    

    返回的结果就是汉字ik_max_word分词后的结果,再按照拼音分词的规则做了分析。

    {
        "tokens": [
            {
                "token": "amoxilin",
                "start_offset": 0,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 0
            },
            {
                "token": "amxl",
                "start_offset": 0,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 0
            },
            {
                "token": "moxi",
                "start_offset": 1,
                "end_offset": 3,
                "type": "CN_WORD",
                "position": 1
            },
            {
                "token": "mx",
                "start_offset": 1,
                "end_offset": 3,
                "type": "CN_WORD",
                "position": 1
            },
            {
                "token": "xilin",
                "start_offset": 2,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 2
            },
            {
                "token": "xl",
                "start_offset": 2,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 2
            },
            {
                "token": "jiaonang",
                "start_offset": 4,
                "end_offset": 6,
                "type": "CN_WORD",
                "position": 3
            },
            {
                "token": "jn",
                "start_offset": 4,
                "end_offset": 6,
                "type": "CN_WORD",
                "position": 3
            }
        ]
    }
    

    4. 代码测试

    package com.boot.es.model;
    
    import lombok.Data;
    import org.springframework.data.annotation.Id;
    import org.springframework.data.elasticsearch.annotations.Document;
    import org.springframework.data.elasticsearch.annotations.Field;
    import org.springframework.data.elasticsearch.annotations.FieldType;
    import org.springframework.data.elasticsearch.annotations.InnerField;
    import org.springframework.data.elasticsearch.annotations.MultiField;
    import org.springframework.data.elasticsearch.annotations.Setting;
    
    /**
     * Author:   susq
     * Date:     2019-06-30 10:12
     */
    @Data
    @Document(indexName = "drug_index", type = "drug")
    @Setting(settingPath = "settings.json")
    public class Drug {
    
        @Id
        private Long id;
    
        @Field(type = FieldType.Keyword)
        private String price;
    
        @MultiField(
                mainField = @Field(type = FieldType.Keyword),
                otherFields = {
                        @InnerField(type = FieldType.Text, suffix = "ik", analyzer = "ik_max_word", searchAnalyzer = "ik_max_word"),
                        @InnerField(type = FieldType.Text, suffix = "ik_pinyin", analyzer = "ik_pinyin_analyzer", searchAnalyzer = "ik_pinyin_analyzer"),
                        @InnerField(type = FieldType.Text, suffix = "pinyin", analyzer = "pinyin_analyzer", searchAnalyzer = "pinyin_analyzer")
                }
        )
        private String name;
    
        @MultiField(
                mainField = @Field(type = FieldType.Keyword),
                otherFields = {
                        @InnerField(type = FieldType.Text, suffix = "ik", analyzer = "ik_max_word", searchAnalyzer = "ik_smart"),
                        @InnerField(type = FieldType.Text, suffix = "ik_pinyin", analyzer = "ik_pinyin_analyzer", searchAnalyzer = "ik_pinyin_analyzer"),
                        @InnerField(type = FieldType.Text, suffix = "pinyin", analyzer = "pinyin_analyzer", searchAnalyzer = "pinyin_analyzer")
                }
        )
        private String effect;
    }
    
    @Test
    public void drugSaveTest() {
      Drug drug = new Drug();
      drug.setId(1L);
      drug.setName("阿莫西林胶囊");
      drug.setPrice("10");
      drug.setEffect("阿莫西林适用于敏感菌(不产β内酰胺酶菌株)所致的感染");
    
      Drug drug1 = new Drug();
      drug1.setId(3L);
      drug1.setName("阿莫西林");
      drug1.setPrice("10");
      drug1.setEffect("阿莫西林适用于敏感菌(不产β内酰胺酶菌株)所致的感染");
    
      Drug drug2 = new Drug();
      drug2.setId(2L);
      drug2.setName("999感冒灵颗粒");
      drug2.setPrice("20");
      drug2.setEffect("本品解热镇痛。用于感冒引起的头痛,发热,鼻塞,流涕,咽痛等");
    
      drugRepository.saveAll(Lists.newArrayList(drug, drug1, drug2));
    
      List<Drug> drugs = Lists.newArrayList(drugRepository.findAll());
      log.info("以保存的drugs: {}", drugs);
    }
    
    @Test
    public void drugSaveTest() {
      Drug drug = new Drug();
      drug.setId(1L);
      drug.setName("阿莫西林胶囊");
      drug.setPrice("10");
      drug.setEffect("阿莫西林适用于敏感菌(不产β内酰胺酶菌株)所致的感染");
    
      Drug drug1 = new Drug();
      drug1.setId(3L);
      drug1.setName("阿莫西林");
      drug1.setPrice("10");
      drug1.setEffect("阿莫西林适用于敏感菌(不产β内酰胺酶菌株)所致的感染");
    
      Drug drug2 = new Drug();
      drug2.setId(2L);
      drug2.setName("999感冒灵颗粒");
      drug2.setPrice("20");
      drug2.setEffect("本品解热镇痛。用于感冒引起的头痛,发热,鼻塞,流涕,咽痛等");
    
      drugRepository.saveAll(Lists.newArrayList(drug, drug1, drug2));
    
      List<Drug> drugs = Lists.newArrayList(drugRepository.findAll());
      log.info("以保存的drugs: {}", drugs);
    }
    
    /**
     * 这个测试中,name(不带后缀的时候是Keyword类型),不分词的时候,如果能匹配到 	* 那就是完全匹配,应该要得分高一点,所以设置是match查询的两倍
     */
    @Test
    public void drugIkSearchTest() {
      NativeSearchQueryBuilder builder = new NativeSearchQueryBuilder();
      NativeSearchQuery query = builder.withQuery(QueryBuilders.boolQuery()
                                                  .should(QueryBuilders.matchQuery("name", "阿莫西林")).boost(2)
                                                  .should(QueryBuilders.matchQuery("name.ik", "阿莫西林")).boost(1))
        .build();
      log.info("DSL:{}", query.getQuery().toString());
      Iterable<Drug> iterable = drugRepository.search(query);
      List<Drug> drugs = Lists.newArrayList(iterable);
      log.info("result: {}", drugs);
    }
    
    /**
     * 这个测试中,name.pinyin(只生成整个name的全拼和所有汉字首字母的全拼接),    	* 这个匹配的时候就是完全匹配,得分应该高一点
     */
    @Test
    public void drugPinyinSearchTest() {
      NativeSearchQueryBuilder builder = new NativeSearchQueryBuilder();
      NativeSearchQuery query = builder.withQuery(QueryBuilders.boolQuery()
                                                  .should(QueryBuilders.matchQuery("name.ik_pinyin", "阿莫西林").boost(1))
                                                  .should(QueryBuilders.matchQuery("name.pinyin", "阿莫西林").boost(2))
                                                 )
        .withSort(SortBuilders.scoreSort())
        .build();
      log.info("DSL:{}", query.getQuery().toString());
      Iterable<Drug> iterable = drugRepository.search(query);
      List<Drug> drugs = Lists.newArrayList(iterable);
      log.info("result: {}", drugs);
    }
    
    展开全文
  • 中文分词器

    2018-12-06 10:44:39
    在solr中搜索关键字可以使用IK Analyzer中文分词器去解决
  • 分词器3.72

    2017-06-28 08:20:38
    ansj_seg-3.7.2 分词器
  • 1、分词器分词器是从一串文本中切分一个个的词条,并对每个词条进行标准化,包含三个部分: character filter:分词之前的预处理,过滤掉HTML标签、特殊符号转换(例如,将&amp;amp;amp;符号转换成...
    1、分词器、

    分词器是从一串文本中切分一个个的词条,并对每个词条进行标准化,包含三个部分:

    • character filter:分词之前的预处理,过滤掉HTML标签、特殊符号转换(例如,将&符号转换成and、将|符号转换成or)等。
    • tokenizer:分词
    • token filter:标准化
    2、内置分词器
    • standard分词器:(默认的)它将词汇单元转换成小写形式,并去掉停用词(a、an、the等没有实际意义的词)和标点符号,支持中文采用的方法为单字切分(例如,‘你好’切分为‘你’和‘好’)。
    • simple分词器:首先通过非字母字符来分割文本信息,然后将词汇单元同一为小写形式。该分析器会去掉数字类型的字符。
    • Whitespace分词器:仅仅是去除空格,对字符没有lowcase(大小写转换)化,不支持中文;并且不对生成的词汇单元进行其他的标准化处理。
    • language分词器:特定语言的分词器,不支持中文。
    3、配置中文分词器(ayalysis-ik)
    //下载中文分词器https://github.com/medcl/elasticsearch-ayalysis-ik
    git clone https://github.com/medcl/elasticsearch-ayalysis-ik
    
    //解压elasticsearch-ayalysis-ik-master.zip
    unzip elasticsearch-ayalysis-ik-master.zip
    
    //进入elasticsearch-ayalysis-ik-master,编译源码(这里使用maven进行编译(需要提前安装配置maven),Dmaven.test.skip=true是跳过测试)
    mvn clean install —Dmaven.test.skip=true
    
    //在es的plugins目录下创建ik目录
    mkdir ik
    
    //将编译后生成的elasticsearch-analysis-ik-版本.zip移动至ik目录下,并解压即可
    cp elasticsearch-analysis-ik-版本.zip /opt/elasticsearch/plugins/ik
    unzip elasticsearch-analysis-ik-版本.zip
    

    Centos7-Minimal 版编译安装maven

    展开全文
  • ES 分词器和自定义分词器

    千次阅读 2020-05-12 22:18:42
    analysis是通过分词器analyzer来实现的。 2.ES自带分词器 Standard Analyzer——默认分词器,按词切分,小写处理 Simple Analyzer——按照非字母切分(符号被过滤),小写处理 Sto Analyzer——小写处理,停用词...
  • solr中文分词器

    2018-12-08 10:53:40
    solr中文分词器
  • 因项目需要,对目前比较流行的几个分词器进行了对比,ansj_seg是最美好的一个分词器,智能、强悍,对索引和最大颗粒分割都照顾得很到位,词库的树形读取也堪称经典;如果搜索只追求绝对准确度不考虑搜索结果最大化,...
  • Elasticsearch之分词器查询分词效果

    万次阅读 2019-07-31 17:20:54
    Elasticsearch之分词器中文的我们一般使用IK,如果没有指定分词器。默认使用的是standard分词。 IK分词能将中文分成词组: standard分词则会将每个中文分成一个单个的词: 其他分词器:ansj_index ...... 优劣:...
  • ik分词器文档

    2019-04-04 13:58:58
    ik分词器的搭建.
  • elasticsearch整合ik分词器时,自定义分词器的配置方法欢迎使用Markdown编辑器 elasticsearch整个ik分词器时,自定义分词器的配置方法 欢迎使用Markdown编辑器 最近在研究elasticsearch,在整合ik分词器的时候配置...
  • IKAnalyzer 分词器支持中文分词多元分词
  • 最近因为工作的需要,要做一个分词器,通过查找相关的资料最终用solr实现了,下面这篇文章主要给大家介绍了关于Solr通过特殊字符分词实现自定义分词器的相关资料,需要的朋友可以参考借鉴,下面随着小编来一起看看吧...
  • elasticsearch分词器插件

    2018-11-27 23:46:55
    因为Elasticsearch中默认的标准分词器分词器对中文分词不是很友好,会将中文词语拆分成一个一个中文的汉子。因此引入中文分词器-es-ik插件
  • IK中文分词器

    2019-05-06 19:31:03
    IKAnalyaer中文分词器,配合slor使用,让你的搜索效果更佳
  • IK分词器

    2018-12-21 15:15:10
    这是lucene的一个中文分词器,无敌版.
  • es7.0 ik分词器

    2019-09-27 17:29:26
    es提供的分词是英文分词,对于中文的分词就做的非常不好了,ik分词器是针对中文分词 来用于搜索和使用。
  • 分词器6659282.zip

    2019-12-11 13:21:34
    分词器 用于solr搜索引擎
  • word分词是一个Java实现的中文分词组件,提供了多种基于词典的分词算法,并利用ngram模型来消除歧义。...word分词器分词效果评估主要评估下面7种分词算法:   正向最大匹配算法:MaximumMatching逆向最大匹配...
  • ElasticSearch7.0.0安装IK分词器

    万次阅读 2020-03-30 16:42:53
    为什么要在elasticsearch中要使用ik这样的中文分词呢,那是因为es提供的分词是英文分词,对于中文的分词就做的非常不好了,因此我们需要一个中文分词器来用于搜索和使用。就尝试安装下IK分词。 2.去github下载对应...
  • 高版本IK分词器

    2018-11-06 10:44:30
    IK分词器到后面就不再更新了,所以做了一个基于Lucene6.4.0的分词器
  • solr ik分词器

    2019-02-14 11:28:06
    solr6.3 ik 分词器的相关jar包和相关配置文件,用户可以配置自己的扩展字典,以及扩展停止词字典

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 64,629
精华内容 25,851
关键字:

分词器