es数据库聚合操作——sql的group by

晚上有一个紧急任务，需要在es数据库上做一些数据统计的工作，之前搭建这个 es(Elastic Search，数据库) 的时候因为是直接集成进去的，没有弄明白 DSL 语法，这次借着这个机会，先对这个现在常用的数据库有一个大致的了解。

本文会涉及 es 数据库中的 bucket聚合-桶聚合操作，主要是aggs，顺带介绍一些 DSL 的一些基本概念，时间有限，难免有些差池，大佬轻喷。

es 中的基本概念

这里我会讲到一些 es 数据库中的一些基本概念，包括 index，type，aggs，query等等，因为单纯地讲概念不是很好理解，因此，我结合具体的例子来进行相关的说明。

具体的一条 json 数据例子如下

{
  "_index": "logstash-2019.03.21",
  "_type": "Cowrie",
  "_id": "AWmf-pxufLcCNFbYf3vg",
  "_version": 1,
  "_score": null,
  "_source": {
    "eventid": "cowrie.login.failed",
    "t-pot_hostname": "structuralwallet",
    "geoip": {
      "timezone": "Asia/Singapore",
      "ip": "180.255.15.211",
      "latitude": 1.2854999999999999,
      "continent_code": "AS",
      "as_org": "SINGTEL MOBILE INTERNET SERVICE PROVIDER Singapore",
      "city_name": "Singapore",
      "country_name": "Singapore",
      "country_code2": "SG",
      "country_code3": "SG",
      "region_name": "Central Singapore Community Development Council",
      "location": {
        "lon": 103.8565,
        "lat": 1.2854999999999999
      },
      "asn": 45143,
      "region_code": "01",
      "longitude": 103.8565
    },
    "session": "224ecd858279",
    "t-pot_ip_int": "172.24.106.97",
    "message": "login attempt [root/88888888] failed",
    "type": "Cowrie",
    "src_ip": "180.255.15.211",
    "t-pot_ip_ext": "39.104.64.89",
    "path": "/data/cowrie/log/cowrie.json",
    "password": "88888888",
    "system": "CowrieTelnetTransport,1830,180.255.15.211",
    "isError": 0,
    "@timestamp": "2019-03-21T11:19:55.102Z",
    "@version": "1",
    "host": "8929fa3b68bd",
    "sensor": "33618e9c9386",
    "username": "root",
    "timestamp": "2019-03-21T11:19:55.102483Z"
  },
  "fields": {
    "@timestamp": [
      1553167195102
    ],
    "timestamp": [
      1553167195102
    ]
  },
  "sort": [
    1553167195102
  ]
}

emmmmmm，懂的童鞋大概知道这个数据是什么生成的，我这里就不详细说了。
我们可以看到，上面的

具体聚合代码

下面具体讲一讲具体的需求和实现方式

任务需求

聚合操作的代码

GET _all/Cowrie/_search
{
  "aggs": {
    "IP": {
      "terms": {
        "field": "geoip.ip",
        "size": 100000,
        "min_doc_count": 0
      },
      "aggs": {
        "distinct_IP": {
          "cardinality": {
            "field": "geoip.ip"
          }
        }
      }
    },
    "sum_of_rul": {
      "sum_bucket": {
        "buckets_path": "IP>distinct_IP.value"
      }
    }
  }
}

聚合操作的结果

{
    "took": 2805,
    "timed_out": false,
    "_shards": {
      "total": 93,
      "successful": 93,
      "skipped": 0,
      "failed": 0
    },
    "hits": {
      "total": 1158960,
      "max_score": 1,
      "hits": [
        {
          "_index": "logstash-2018.12.21",
          "_type": "Cowrie",
          "_id": "AWfOtwlqs4bscJpC8YlL",
          "_score": 1
        }
      ]
    },
  "aggregations": {
      "IP": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
          {
          "key": "193.112.88.114",
          "doc_count": 82161,
          "distinct_IP": {
              "value": 1
          }
          },
          {
          "key": "124.23.134.142",
          "doc_count": 29807,
          "distinct_IP": {
              "value": 1
          }
          }         
        ]
      },
      "sum_of_rul": {
          "value": 8036
      }
    }
}

这里我把结果中的很多重复性的内容都省略掉了，比如 hits 下面列出了很多条示例的 json 数据， aggregation 中也列出了很多统计的数据。
这里解释一下结果中一些字段的含义：

took, time_out
hits
doc_count_error_upper_bound
sum_other_doc_count
buckets

参考链接

Elasticsearch权威指南（中文版）
https://es.xiaoleilu.com/
Elasticsearch Aggregations 统计buckets中key的个数
https://blog.csdn.net/greenappple/article/details/79728395
Elasticsearch：权威指南
https://www.elastic.co/guide/cn/elasticsearch/guide/current/index.html
elasticsearch——部分聚合结果不准确
https://www.jianshu.com/p/f650f76f21e2