笔记十三：多字段特性及 Mapping 中配置自定义 Analyzer

CrazyZard 的个人博客 / 5 / 0 / 创建于 5年前 / 更新于 5年前

多字段类型

多字段特性
- 厂家名字实现精确匹配
  - 增加一个keyword字段
- 使用不同的analyzer
  - 不同语言
  - pinyin 字段的搜索
  - 还支持为搜索和索引指定不同的analyzer

ES 笔记十三：多字段特性及Mapping中配置自定义Analyzer

Excat values v.s Full Text

Excat Values ：包括数字 / 日期 / 具体一个字符串（例如 “Apple Store”）
- Elasticsearch 中的keyword
- 全文本，非结构化的文本数据
  - Elasticsearch 中的 text

ES 笔记十三：多字段特性及Mapping中配置自定义Analyzer

Elaticsearch 为每一个字段创建一个倒排索引
- Exact Value 在索引时，不需要做特殊的分词处理

ES 笔记十三：多字段特性及Mapping中配置自定义Analyzer

当Elasticsearch自带的分词器无法满足时，可以自定义分词器。通过自组合不同的组件实现
- Character Filter
- Tokenizer
- Token Filter

Character Filters

在 Tokenizer 之前对文本进行处理，例如增加删除及替换字符。可以配置多个Character Filters。会影响 Tokenizer 的 position 和offset信息
一些自带的Character Filters
- HTML strip - 去除html标签
- Mapping - 字符串替换
- Pattern replace - 正则匹配替换

Tokenizer

将原始的文本按照一定的规则，切分为词（term or token）
Elasticsearch 内置的 Tokenizers
- whitespace | standard | uax_url_email | pattern | keyword | path hierarchy
可以用JAVA 开发插件，实现自己的 Tokenizer

Token Filters

将Tokenizer输出的单词，进行增加、修改、删除
自带的Token Filters
- Lowercase |stop| synonym（添加近义词）

Demo char_filter

char_filter

POST _analyze
{
"tokenizer":"keyword",
"char_filter":["html_strip"],
"text": "<b>hello world</b>"
}
//结果
{
"tokens" : [
  {
    "token" : "hello world",
    "start_offset" : 3,
    "end_offset" : 18,
    "type" : "word",
    "position" : 0
  }
]
}

使用char filter进行替换

POST _analyze
{
"tokenizer": "standard",
"char_filter": [
    {
      "type" : "mapping",
      "mappings" : [ "- => _"]
    }
  ],
"text": "123-456, I-test! test-990 650-555-1234"
}
//返回
{
"tokens" : [
  {
    "token" : "123_456",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<NUM>",
    "position" : 0
  },
  {
    "token" : "I_test",
    "start_offset" : 9,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 1
  },
  {
    "token" : "test_990",
    "start_offset" : 17,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 2
  },
  {
    "token" : "650_555_1234",
    "start_offset" : 26,
    "end_offset" : 38,
    "type" : "<NUM>",
    "position" : 3
  }
]
}

char filter 替换表情符号

POST _analyze
{
"tokenizer": "standard",
"char_filter": [
    {
      "type" : "mapping",
      "mappings" : [ ":) => happy", ":( => sad"]
    }
  ],
  "text": ["I am felling :)", "Feeling :( today"]
}
//返回
{
"tokens" : [
  {
    "token" : "I",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "<ALPHANUM>",
    "position" : 0
  },
  {
    "token" : "am",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "<ALPHANUM>",
    "position" : 1
  },
  {
    "token" : "felling",
    "start_offset" : 5,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 2
  },
  {
    "token" : "happy",
    "start_offset" : 13,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  },
  {
    "token" : "Feeling",
    "start_offset" : 16,
    "end_offset" : 23,
    "type" : "<ALPHANUM>",
    "position" : 104
  },
  {
    "token" : "sad",
    "start_offset" : 24,
    "end_offset" : 26,
    "type" : "<ALPHANUM>",
    "position" : 105
  },
  {
    "token" : "today",
    "start_offset" : 27,
    "end_offset" : 32,
    "type" : "<ALPHANUM>",
    "position" : 106
  }
]
}

正则表达式

GET _analyze
{
"tokenizer": "standard",
"char_filter": [
    {
      "type" : "pattern_replace",
      "pattern" : "http://(.*)",
      "replacement" : "$1"
    }
  ],
  "text" : "http://www.elastic.co"
}
//返回
{
"tokens" : [
  {
    "token" : "www.elastic.co",
    "start_offset" : 0,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 0
  }
]
}

Demo tokenizer

通过路劲切分

POST _analyze
{
"tokenizer":"path_hierarchy",
"text":"/user/ymruan/a"
}
{
"tokens" : [
  {
    "token" : "/user",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "word",
    "position" : 0
  },
  {
    "token" : "/user/ymruan",
    "start_offset" : 0,
    "end_offset" : 12,
    "type" : "word",
    "position" : 0
  },
  {
    "token" : "/user/ymruan/a",
    "start_offset" : 0,
    "end_offset" : 14,
    "type" : "word",
    "position" : 0
  }
]
}

token_filters

GET _analyze
{
"tokenizer": "whitespace", 
"filter": ["stop","snowball"], //on the a
"text": ["The gilrs in China are playing this game!"]
}
{
"tokens" : [
  {
    "token" : "The", //大写的The 不做过滤
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 0
  },
  {
    "token" : "gilr",
    "start_offset" : 4,
    "end_offset" : 9,
    "type" : "word",
    "position" : 1
  },
  {
    "token" : "China",
    "start_offset" : 13,
    "end_offset" : 18,
    "type" : "word",
    "position" : 3
  },
  {
    "token" : "play",
    "start_offset" : 23,
    "end_offset" : 30,
    "type" : "word",
    "position" : 5
  },
  {
    "token" : "game!",
    "start_offset" : 36,
    "end_offset" : 41,
    "type" : "word",
    "position" : 7
  }
]
}

加入lowercase后，The被当成 stopword删除

GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase","stop","snowball"],
"text": ["The gilrs in China are playing this game!"]
}
{
"tokens" : [
  {
    "token" : "gilr",
    "start_offset" : 4,
    "end_offset" : 9,
    "type" : "word",
    "position" : 1
  },
  {
    "token" : "china",
    "start_offset" : 13,
    "end_offset" : 18,
    "type" : "word",
    "position" : 3
  },
  {
    "token" : "play",
    "start_offset" : 23,
    "end_offset" : 30,
    "type" : "word",
    "position" : 5
  },
  {
    "token" : "game!",
    "start_offset" : 36,
    "end_offset" : 41,
    "type" : "word",
    "position" : 7
  }
]
}

自定义analyzer

官网自定义分词器的标准格式

官网看了一下，自定义分析器标准格式是：
PUT /my_index
{
  "settings": {
      "analysis": {
          "char_filter": { ... custom character filters ... },//字符过滤器
          "tokenizer": { ... custom tokenizers ... },//分词器
          "filter": { ... custom token filters ... }, //词单元过滤器
          "analyzer": { ... custom analyzers ... }
      }
  }
}

自定义分词器

#定义自己的分词器
PUT my_index
{
"settings": {
  "analysis": {
    "analyzer": {
      "my_custom_analyzer":{
        "type":"custom",
        "char_filter":[
          "emoticons"
        ],
        "tokenizer":"punctuation",
        "filter":[
          "lowercase",
          "english_stop"
        ]
      }
    },
    "tokenizer": {
      "punctuation":{
        "type":"pattern",
        "pattern": "[ .,!?]"
      }
    },
    "char_filter": {
      "emoticons":{
        "type":"mapping",
        "mappings" : [ 
          ":) => happy",
          ":( => sad"
        ]
      }
    },
    "filter": {
      "english_stop":{
        "type":"stop",
        "stopwords":"_english_"
      }
    }
  }
}
}

es

本作品采用《CC 协议》，转载必须注明作者和本文链接

快乐就是解决一个又一个的问题！

CrazyZard

版主 1.3k 声望

程序猿 @ 西湖心辰

暂无个人描述~

0 人点赞

讨论数量: 0

(=￣ω￣=)··· 暂无内容！

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容，与人为善，比聪明更重要！

帮助

笔记十三：多字段特性及 Mapping 中配置自定义 Analyzer

多字段类型

Excat values v.s Full Text

Character Filters

Tokenizer

Token Filters

Demo char_filter

Demo tokenizer

自定义analyzer

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

笔记十三：多字段特性及 Mapping 中配置自定义 Analyzer

多字段类型

Excat values v.s Full Text

Character Filters

Tokenizer

Token Filters

Demo char_filter

Demo tokenizer

自定义analyzer

社区赞助商

关于 LearnKu

资源推荐

服务提供商

其他信息

请登录