笔记三十八:Bucket & Metric 聚合分析及嵌套聚合
Bucket & Metric Aggregation
- Metric 一些系列的统计方法
- Bucket 一组满足条件的文档
Aggregation 的语法
- Aggregation属于Search 的一部分。一般情况下,建议将其Size指定为0
例子
Mertric Aggregation
- 单值分析:只输出一个分析结果
- min,max,avg,sum
- Cardinality(类似 distinct Count)
- 多值分析:输出多个分析结果
- stats ,extended stats
- percentile, percentile rank
- top hits (排在前面的示例)
Metric 聚合的具体Demo
- 查看最低工资
- 查看最高工资
- 一个聚合输出多个值
- 一次查询包含多个聚合
- 同时查看最低 最高 和平均工资
PUT /employees/ { "mappings" : { "properties" : { "age" : { "type" : "integer" }, "gender" : { "type" : "keyword" }, "job" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 50 } } }, "name" : { "type" : "keyword" }, "salary" : { "type" : "integer" } } } } PUT /employees/_bulk { "index" : { "_id" : "1" } } { "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 } { "index" : { "_id" : "2" } } { "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000} { "index" : { "_id" : "3" } } { "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 } { "index" : { "_id" : "4" } } { "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000} { "index" : { "_id" : "5" } } { "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 } { "index" : { "_id" : "6" } } { "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000} { "index" : { "_id" : "7" } } { "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 } { "index" : { "_id" : "8" } } { "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000} { "index" : { "_id" : "9" } } { "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 } { "index" : { "_id" : "10" } } { "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000} { "index" : { "_id" : "11" } } { "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 } { "index" : { "_id" : "12" } } { "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000} { "index" : { "_id" : "13" } } { "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 } { "index" : { "_id" : "14" } } { "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000} { "index" : { "_id" : "15" } } { "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 } { "index" : { "_id" : "16" } } { "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000} { "index" : { "_id" : "17" } } { "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000} { "index" : { "_id" : "18" } } { "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000} { "index" : { "_id" : "19" } } { "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000} { "index" : { "_id" : "20" } } { "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000} //查询 POST employees/_search { "size":0, "aggs": { "min": { "min": { "field": "salary" } }, "max":{ "max" :{ "field": "salary" } }, "avg":{ "avg": { "field": "salary" } } } } //返回 { "took" : 111, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "avg" : { "value" : 24700.0 }, "min" : { "value" : 9000.0 }, "max" : { "value" : 50000.0 } } } # 一个聚合,输出多值 POST employees/_search { "size": 0, "aggs": { "stats_salary": { "stats": { "field":"salary" } } } } // "aggregations" : { "stats_salary" : { "count" : 20, "min" : 9000.0, "max" : 50000.0, "avg" : 24700.0, "sum" : 494000.0 } }
Bucket
- 同时查看最低 最高 和平均工资
- 按照一定的规则,将文档分配到不同的桶中,从而达到分类的目的。ES提供的一些常见的Bucket Aggregation
- Term
- 数字类型
- Range 、Date Range
- Histogram / Data Histogram
- 支持嵌套:也就在桶里在做分桶
Terms Aggregation
- 字段需要打开fielddata,才能进行Terms Aggregation
- Keyword 默认支持doc_values
- Text 需要在Mapping 中 enable ,会按照分词后的结果进行分
- Demo
- 对job 和 job.keyword 进行聚合
- 对性别进行Terms聚合
- 指定bucket size
POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field":"job.keyword" } } } } //return "aggregations" : { "jobs" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Java Programmer", "doc_count" : 7 }, { "key" : "Javascript Programmer", "doc_count" : 4 }, { "key" : "QA", "doc_count" : 3 }, { "key" : "DBA", "doc_count" : 2 }, { "key" : "Web Designer", "doc_count" : 2 }, { "key" : "Dev Manager", "doc_count" : 1 }, { "key" : "Product Manager", "doc_count" : 1 } ] } } # 对 Text 字段打开 fielddata,支持terms aggregation PUT employees/_mapping { "properties" : { "job":{ "type": "text", "fielddata": true } } } # 对 Text 字段进行 terms 分词。分词后的terms POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field":"job" } } } } # 对job.keyword 和 job 进行 terms 聚合,分桶的总数并不一样 POST employees/_search { "size": 0, "aggs": { "cardinate": { "cardinality": { "field": "job.keyword" } } } } # 对 性别的 keyword 进行聚合 POST employees/_search { "size": 0, "aggs": { "gender": { "terms": { "field":"gender" } } } }
Cardinality
- 类似SQL中的Distinct
Bucket Size & Top Hists Demo
- 应用场景:当后去分桶后,桶内最匹配的顶部文档列表
- Size :按年龄分桶,找出指定数据量的分桶信息
- Top Hits:查看各个工种中,年纪最大的3名员工
#指定 bucket 的 size POST employees/_search { "size": 0, "aggs": { "ages_5": { "terms": { "field":"age", "size":3 } } } } # 指定size,不同工种中,年纪最大的3个员工的具体信息 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword" }, "aggs":{ "old_employee":{ "top_hits": { "size": 3, "sort": [{ "age": { "order": "desc" } }] } } } } } }
优化Terms聚合的性能
- 在聚合经常发生,性能高的,索引不断写入
Range & Histogram
按照数字的范围,进行分桶
在Range Aggregation中,可以自定义Key
Demo:
按照工资的Range 分桶
按照工资的间隔(Histogram)分桶
//Salary Ranges 分桶,可以自己定义 key POST employees/_search { "size": 0, "aggs": { "salary_range": { "range": { "field":"salary", "ranges":[ { "to":10000 }, { "from":10000, "to":20000 }, { "key":">20000", "from":20000 } ] } } } } //return "aggregations" : { "salary_range" : { "buckets" : [ { "key" : "*-10000.0", "to" : 10000.0, "doc_count" : 1 }, { "key" : "10000.0-20000.0", "from" : 10000.0, "to" : 20000.0, "doc_count" : 4 }, { "key" : ">20000", "from" : 20000.0, "doc_count" : 15 } ] } } //Salary Histogram,工资0到10万,以 5000一个区间进行分桶 POST employees/_search { "size": 0, "aggs": { "salary_histrogram": { "histogram": { "field":"salary", "interval":10000, "extended_bounds":{ "min":0, "max":100000 } } } } }
Bucket + Metric Aggregation
Bucket 聚合分析允许通过添加子聚合分析进一步分析,子聚合分析可以是
- Bucket
- Metric
Demo
- 按照工作类型进行分桶,并统计工资信息
- 先按照工作类型分桶,然后按性别分桶,并统计工资信息
# 嵌套聚合1,按照工作类型分桶,并统计工资信息 POST employees/_search { "size": 0, "aggs": { "Job_salary_stats": { "terms": { "field": "job.keyword" }, "aggs": { "salary": { "stats": { "field": "salary" } } } } } } # 多次嵌套。根据工作类型分桶,然后按照性别分桶,计算工资的统计信息 POST employees/_search { "size": 0, "aggs": { "Job_gender_stats": { "terms": { "field": "job.keyword" }, "aggs": { "gender_stats": { "terms": { "field": "gender" }, "aggs": { "salary_stats": { "stats": { "field": "salary" } } } } } } } }
总结
聚合分析的具体语法
- 一个聚合查询中可以包含多个聚合:每个Bucket聚合可以包含多个子聚合
Metrix
- 单值输出 & 多值输出
Bucket
- Terms & 数字范围
本作品采用《CC 协议》,转载必须注明作者和本文链接