晚上有一个紧急任务,需要在es数据库上做一些数据统计的工作,之前搭建这个 es(Elastic Search,数据库) 的时候因为是直接集成进去的,没有弄明白 DSL 语法,这次借着这个机会,先对这个现在常用的数据库有一个大致的了解。
本文会涉及 es 数据库中的 bucket聚合-桶聚合操作,主要是aggs,顺带介绍一些 DSL 的一些基本概念,时间有限,难免有些差池,大佬轻喷。
es 中的基本概念
这里我会讲到一些 es 数据库中的一些基本概念,包括 index,type,aggs,query等等,因为单纯地讲概念不是很好理解,因此,我结合具体的例子来进行相关的说明。
具体的一条 json 数据例子如下1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57{
"_index": "logstash-2019.03.21",
"_type": "Cowrie",
"_id": "AWmf-pxufLcCNFbYf3vg",
"_version": 1,
"_score": null,
"_source": {
"eventid": "cowrie.login.failed",
"t-pot_hostname": "structuralwallet",
"geoip": {
"timezone": "Asia/Singapore",
"ip": "180.255.15.211",
"latitude": 1.2854999999999999,
"continent_code": "AS",
"as_org": "SINGTEL MOBILE INTERNET SERVICE PROVIDER Singapore",
"city_name": "Singapore",
"country_name": "Singapore",
"country_code2": "SG",
"country_code3": "SG",
"region_name": "Central Singapore Community Development Council",
"location": {
"lon": 103.8565,
"lat": 1.2854999999999999
},
"asn": 45143,
"region_code": "01",
"longitude": 103.8565
},
"session": "224ecd858279",
"t-pot_ip_int": "172.24.106.97",
"message": "login attempt [root/88888888] failed",
"type": "Cowrie",
"src_ip": "180.255.15.211",
"t-pot_ip_ext": "39.104.64.89",
"path": "/data/cowrie/log/cowrie.json",
"password": "88888888",
"system": "CowrieTelnetTransport,1830,180.255.15.211",
"isError": 0,
"@timestamp": "2019-03-21T11:19:55.102Z",
"@version": "1",
"host": "8929fa3b68bd",
"sensor": "33618e9c9386",
"username": "root",
"timestamp": "2019-03-21T11:19:55.102483Z"
},
"fields": {
"@timestamp": [
1553167195102
],
"timestamp": [
1553167195102
]
},
"sort": [
1553167195102
]
}
emmmmmm,懂的童鞋大概知道这个数据是什么生成的,我这里就不详细说了。
我们可以看到,上面的
具体聚合代码
下面具体讲一讲具体的需求和实现方式
任务需求
聚合操作的代码
1 | GET _all/Cowrie/_search |
聚合操作的结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47{
"took": 2805,
"timed_out": false,
"_shards": {
"total": 93,
"successful": 93,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1158960,
"max_score": 1,
"hits": [
{
"_index": "logstash-2018.12.21",
"_type": "Cowrie",
"_id": "AWfOtwlqs4bscJpC8YlL",
"_score": 1
}
]
},
"aggregations": {
"IP": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "193.112.88.114",
"doc_count": 82161,
"distinct_IP": {
"value": 1
}
},
{
"key": "124.23.134.142",
"doc_count": 29807,
"distinct_IP": {
"value": 1
}
}
]
},
"sum_of_rul": {
"value": 8036
}
}
}
这里我把结果中的很多重复性的内容都省略掉了,比如 hits 下面列出了很多条示例的 json 数据, aggregation 中也列出了很多统计的数据。
这里解释一下结果中一些字段的含义:
- took, time_out
- hits
- doc_count_error_upper_bound
- sum_other_doc_count
- buckets
参考链接
Elasticsearch权威指南(中文版)
https://es.xiaoleilu.com/
Elasticsearch Aggregations 统计buckets中key的个数
https://blog.csdn.net/greenappple/article/details/79728395
Elasticsearch:权威指南
https://www.elastic.co/guide/cn/elasticsearch/guide/current/index.html
elasticsearch——部分聚合结果不准确
https://www.jianshu.com/p/f650f76f21e2