Elasticsearch 使用不同分词器导致搜索排名的问题

  • 相信我们很多人做中文搜索的时候,在Github找了ik中分分词插件
  • 然后建立mapping的时候,很自然的使用这样的参数(参照官方分词文档实例)
    {
          "properties": {
              "title": {
                  "type": "text",
                  "analyzer": "ik_max_word",
                  "search_analyzer": "ik_smart"
              }
          }
    }

  • 那么我们来看一下全部数据(打火车和火车两条数据)
curl 127.0.0.1:9200/test/_search | jq
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 1,
        "_source": {
          "id": 1,
          "title": "打火车"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 1,
        "_source": {
          "id": 2,
          "title": "火车"
        }
      }
    ]
  }
}
  • 这时候我们开始搜索(打火车)
curl 127.0.0.1:9200/test/_search?q=打火车 | jq
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.21110919,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 0.21110919,
        "_source": {
          "id": 2,
          "title": "火车"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 0.160443,
        "_source": {
          "id": 1,
          "title": "打火车"
        }
      }
    ]
  }
}
  • 这时候我们惊奇的发现火车的分值是0.21110919居然比打火车0.160443还高

  • 中间经过一路排查, 首先感谢github.com/mobz/elasticsearch-head插件, 让排查数据的时候减少很多操作.
  • 之后查看文档分词结果就得知了答案
curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
  "_index": "test",
  "_type": "_doc",
  "_id": "Video_1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "title": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 2,
        "sum_ttf": 3
      },
      "terms": {
        "打火": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 2
            }
          ]
        },
        "火车": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 1,
              "end_offset": 3
            }
          ]
        }
      }
    }
  }
}
  • 很惊奇的发现打火车被划分成打火火车两个词, 所以这之中肯定有问题了(当然对于搜索引擎是没有问题的).
  • 打火车文档中的火车得到了分值,但打火会使搜索得分下降, 导致火车文档的排名靠前
  • 所以我决定把两个分词器设置成一样
{
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "ik_smart",
                "search_analyzer": "ik_smart"
            }
        }
}
  • 然后再看一下分词数据(这次分词的数据的确是我们预想的)
curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
  "_index": "test",
  "_type": "_doc",
  "_id": "Video_1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "title": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 2,
        "sum_ttf": 3
      },
      "terms": {
        "打": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 1
            }
          ]
        },
        "火车": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 1,
              "end_offset": 3
            }
          ]
        }
      }
    }
  }
}
  • 这时我们再搜索一次数据排名, 看到得分值排名的确是我们想要的了.
curl  127.0.0.1:9200/test/_search?q=打火车 | jq
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.77041256,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 0.77041256,
        "_source": {
          "id": 1,
          "title": "打火车"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 0.21110919,
        "_source": {
          "id": 2,
          "title": "火车"
        }
      }
    ]
  }
}
本作品采用《CC 协议》,转载必须注明作者和本文链接
当神不再是我们的信仰,那么信仰自己吧,努力让自己变好,不辜负自己的信仰!
《L03 构架 API 服务器》
你将学到如 RESTFul 设计风格、PostMan 的使用、OAuth 流程,JWT 概念及使用 和 API 开发相关的进阶知识。
《L02 从零构建论坛系统》
以构建论坛项目 LaraBBS 为线索,展开对 Laravel 框架的全面学习。应用程序架构思路贴近 Laravel 框架的设计哲学。
讨论数量: 0
(= ̄ω ̄=)··· 暂无内容!

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!