When monitoring the internal performance of zabbix, we usually use the following metric to measure the service performance:
Nvps, queue, update percent, process busy, pending sync data, cache.
By adding corresponding monitoring, zabbix performance problems can be effectively found, and targeted optimization can be carried out. 1. nvps: the amount of data processed per second is a theoretical value. Value SQL:
Whole cluster:
SELECTSUM(1.0/i.delay)
AS
qps
FROM
items i,hosts h
WHERE
i.status=
'0'
AND
i.hostid=h.hostid
AND
h.status=
'0'
AND
i.delay<>0;
Breakdown to proxy:
SELECT h.proxy_hostid,SUM(
1.0
/i.delay) AS qps FROM items i,hosts h WHERE i.status=
'0'
AND i.hostid=h.hostid AND h.status=
'0'
AND i.delay<>
0
AND h.proxy_hostid
is
NOT NULL GROUP BY h.proxy_hostid;
2. Data delay. For example, if the interval of an item is set to 60 s, but it is updated around 70 s, it indicates that delay is 10 s. A larger queue value indicates that zabbix has some internal performance problems. The most common problems are the busy problem of poller and trapper processes. This is an interval check. You can create the following item: zabbix [queue] zabbix [queue, 5 m] zabbix [queue, 10 m] 3. update percent: used to measure the update of item values. If percent is low, it indicates that the data has delay or some agent data has exceptions. 1) The entire cluster
select a.aa/b.bb from
(select count(*)
as
aa from items
where lastclock > UNIX_TIMESTAMP()-
1800
and delay <
900
and hostid
in
(select hostid from hosts where status=
0
)
and status =
0
) a,
(select count(*)
as
bb from items
where delay <
900
and status =
0
and hostid
in
(select hostid from hosts where status=
0
)
) b
2) To proxy:
select a.aa/b.bb from
(select count(*)
as
aa from items
where lastclock > UNIX_TIMESTAMP()-
1800
and delay <
900
and hostid
in
(select hostid from hosts where status=
0
and proxy_hostid =
10100
)
and status =
0
) a,
(select count(*)
as
bb from items
where delay <
900
and status =
0
and hostid
in
(select hostid from hosts where status=
0
and proxy_hostid =
10100
)
) b
Proxy_hostid is the id of the corresponding proxy.
3) to the host, you can locate which host value updates are abnormal (more accurate than the unreachable alarm ):
select b.hostname ,c.ip,a.update_percent
as
uppercent from
(select a.hostid,round(a.aa*
100
/b.bb,
2
)
as
update_percent from
(select hostid,count(*)
as
aa from items
where lastclock > UNIX_TIMESTAMP()-
1800
and delay <
900
and hostid
in
(select hostid from hosts where status=
0
)
and status =
0
group by hostid
) a,
(select hostid,count(*)
as
bb from items
where delay <
900
and status =
0
and hostid
in
(select hostid from hosts where status=
0
) group by hostid
) b where a.hostid=b.hostid)a,(select hostid,lower(host)
as
hostname from hosts where status=
0
)b,
(select hostid,ip from
interface
where type=
'1'
)c
where a.hostid=b.hostid and b.hostid=c.hostid having(a.update_percent) <
80
order by uppercent;
4. busy of internal processes: the working thread of zabbix can quickly locate the internal performance bottleneck of zabbix, specifically interval check. For example, zabbix [process, housekeeper, avg, busy], zabbix [process, http poller, avg, busy], zabbix [process, poller, avg, busy], etc. 5. the pending send data of the proxy is used to measure the data transmission from the proxy to the server. The smaller the value, the faster the data transmission.
Value SQL:
SELECT
((
SELECT
MAX
(proxy_history.id)
FROM
proxy_history)-nextid)
FROM
ids
WHERE
field_name=
'history_lastid'
6. cache, interval check
For example: zabbix [wcache, history, pfree], zabbix [wcache, text, pfree]