Aggregation is used to analyze the statistical indicators of the query result set, we take the View log analysis as an example to introduce a variety of commonly used elasticsearch aggregation operations.
Directory:
- Query the number of users to watch and watch the length of video
- Aggregation Page Splitter
- Querying video Uvs
- Single Video UV
- Bulk query for video Uvs
- Having query
- Filtering based on Count
- Filtering according to other indicators
Let's start by showing the structure of the document we're analyzing:
{ "video_id": 1289643545120062253, // 视频id "video_uid": 3931482202390368051, // 视频发布者id "uid": 47381776787453866, // 观看用户id "time": 1533891263224, // 时间发生时间 "watch_duration": 30 // 观看时长}
Each document records a watch event, and we analyze the user's viewing behavior through aggregation.
Elasticsearch introduces two related concepts:
- Bucket (Buckets): A collection of documents that meet certain criteria
- Indicator (Metrics): The statistical value of a document in a bucket, such as the average of a specific field
Query the number of users to watch and watch the length of video
First describe the query with an SQL statement:
SELECT uid, count(*) as view_count, avg(watch_duration) as avg_duration FROM view_logWHERE time >= #{since} AND time <= #{to} GROUP BY uid;
GET /view_log/_search{ "size" : 0, "query": { "range": { "time": { "gte": 0, // since "lte": 0 // to } } }, "aggs": { "agg": { // agg为聚合的名称 "terms": { // 聚合的条件为 uid 相同 "field": "uid" }, "aggs": { // 添加统计指标(Metrics) "avg_duration": { "avg": { // 统计 watch_duration 的平均值 "field": "watch_duration" } } } } }}
Response
{ "took": 10, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 100000, "max_score": 0, "hits": [] }, "aggregations": { "agg": { "buckets": [ { "key": 21836334489858688, "doc_count": 4026, "avg_duration": { "value": 12778.882352941177 } }, { "key": 31489302390368051, "doc_count": 2717, "avg_duration": { "value": 2652.5714285714284 } } ] }}
The Result.aggregations.agg.buckets list contains the results of the query.
Because we aggregate according to Terms:uid, each bucket is the same set of documents with the UID, the key field is the UID.
The Doc_count field indicates that the number of files in the bucket is the SQL statement count(*) as view_count
.
Avg_duration.value represents the average length of time that the user is viewing for watch_duration.
Aggregation Page Splitter
The number of users in real-world applications is staggering, and it is impossible to get all the results through a single query so we need to retrieve the page in batches:
GET /view_log/_search{ "size" : 0, "query": { "range": { "time": { "gte": 0, // since "lte": 0 // to } } }, "aggs": { "agg": { "terms": { "field": "uid", "size": 10000, // bucket 的最大个数 "include": { // 将聚合结果分为10页,序号为[0,9], 取第一页 "partition": 0, "num_partitions": 10 } }, "aggs": { "avg_duration": { "avg": { "field": "watch_duration" } } } } }}
The query above is almost identical to the query in the previous section, except that the include field is added to the Aggs.agg.terms field for paging.
Query Video UV Single video
UV refers to the number of users who watch a video (user view), as opposed to the number of views viewed as PV (Page view).
Use SQL statements to describe:
SELECT video_id, count(*) as pv, count(distinct uid) as uvFROM view_logWHERE video_id = #{video_id};
The Elasticsearch can be conveniently used for count (distinct) queries:
GET /view_log/_search{ "aggs": { "uv": { "cardinality": { "field": "uid" } } }}
Response
{ "took": 255, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 17579, "max_score": 0, "hits": [] }, "aggregations": { "uv": { "value": 11 } }}
Bulk query for video Uvs
Elasticsearch can also bulk query count (DISTINCT), first described with sql:
SELECT video_id, count(*) as pv, count(distinct uid) as uvFROM view_logGROUP BY video_id;
Inquire:
GET /view_log/_search{ "size": 0, "aggs": { "video": { "terms": { "field": "video_id" }, "aggs": { "uv": { "cardinality": { "field": "uid" } } } } }}
Response
{ "took": 313, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 16940, "max_score": 0, "hits": [] }, "aggregations": { "video": { "buckets": [ { "key": 25417499722062, // 视频id "doc_count": 427, // 视频观看次数 pv "uv": { "value": 124 // 观看视频的用户数 uv } }, { "key": 72446898144, "doc_count": 744, "uv": { "value":233 } } ] } }}
Having query
SQL can use the having statement to filter based on aggregated results, Elasticsearch can use pipeline aggregations to achieve this effect, but the syntax is cumbersome.
Filtering based on Count
Use SQL queries to view videos over 200 times:
SELECT video_id, count(*) as view_countFROM view_logGROUP BY video_idHAVING count(*) > 200;
GET /view_log/_search{ "size": 0, "aggs": { "view_count": { "terms": { "field": "video_id" }, "aggs": { "having": { "bucket_selector": { "buckets_path": { // 选择 view_count 聚合的 doc_count 进行过滤 "view_count": "_count" }, "script": { "source": "params.view_count > 200" } } } } } }}
Response
{ "took": 83, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 775, "max_score": 0, "hits": [] }, "aggregations": { "view_count": { "buckets": [ { "key": 35025417499764062, "doc_count": 529 }, { "key": 19913672446898144, "doc_count": 759 } ] } }}
The key to implementing a elasticsearch-like query is to use [Bucket_selector] to select aggregate results for filtering.
Filtering according to other indicators
Next we try to query the video with an average of 5 minutes of viewing, describing the query in sql:
SELECT video_id FROM view_logGROUP BY video_idHAVING avg(watch_duration) > 300;
GET /view_log/_search{ "size": 0, "aggs": { "video": { "terms": { "field": "video_id" }, "aggs": { "avg_duration": { "avg": { "field": "watch_duration" } }, "avg_duration_filter": { "bucket_selector": { "buckets_path": { "avg_duration": "avg_duration" }, "script": { "source": "params.avg_duration > 200" } } } } } }}
Response
{"Took": 137, "timed_out": false, "_shards": {"Total": 5, "successful": 5, "skipped" : 0, "failed": 0}, "hits": {"Total": 255, "Max_score": 0, "hits": []}, "aggregations": {"video": { "Buckets": [{"Key": 5417499764062, "Doc_count": 91576, "avg_duration": { "Value": 103}, {"Key": 19913672446898144, "Doc_count": 15771, "av G_duration ": {" value ": 197}}]}}