Ganglia监控系统是UC Berkeley开源的一个项目,设计初衷就是要做好分布式集群的监控。监控层面包含资源层面和业务层面,资源层面包含cpu、memory、disk、IO、网络负载等,至于业务层面因为用户能够非常方便的添加自己定义的metric。因此能够用于做诸如服务性能、负载、出错率等的监控。比如某web服务的QPS、Http status错误率。此外,假设和Nagios集成起来还能够在某指标超过一定阈值时触发对应的报警。

Ganglia相比zabbix的优势在于client收集agent(gmond)所带来的系统开销很低,不会影响相关服务的性能。

ganglia主要有几个模块:

  • gmond: 部署在各个被监控机器上,用于定期将数据收集起来。进行广播或者单播。
  • gmetad:部署在server端,定时从配置的data_source中的host去拉取gmond收集好的数据
  • ganglia-web:将监控数据投递到web页面

关于ganglia的安装本文不做过多介绍,传送门:http://www.it165.net/admin/html/201302/770.html

本文主要介绍一下怎样开发自己定义的metric。方便监控自己关心的指标。

主要有几大类的方法:

1. 直接使用gmetric

安装gmond的机器,会同一时候安装上/usr/bin/gmetric,该命令是将一个metric的name value等信息进行广播的工具,比如

/usr/bin/gmetric -c /etc/ganglia/gmond.conf --name=test --type=int32 --units=sec --value=2    
详细gmetric的选项见:http://manpages.ubuntu.com/manpages/hardy/man1/gmetric.1.html


此外,除了直接命令行使用gmetric外,还能够使用常见语言的binding,比如go、Java、python等,github上都有相关的binding能够使用。仅仅须要import进来就可以。 go语言   https://github.com/ganglia/ganglia_contrib/tree/master/ganglia-go

ruby  https://github.com/igrigorik/gmetric/blob/master/lib/gmetric.rb

Java  https://github.com/ganglia/ganglia_contrib/tree/master/gmetric-java

Python   https://github.com/ganglia/ganglia_contrib/tree/master/gmetric-python

2. 使用基于gmetric的第三方工具

本文以ganglia-logtailer举例:https://github.com/ganglia/ganglia_contrib/tree/master/ganglia-logtailer

该工具基于logtail(debain)/logcheck(centos) package, 实现对日志的定时tail,然后通过指定classname来使用对应的类进行日志的分析,

依据自己关注的字段统计出自己定义metric,并由gmetric广播出来。

比如我们依据自己服务的nginx日志格式,改动NginxLogtailer.py例如以下:

# -*- coding: utf-8 -*-
###
###  This plugin for logtailer will crunch nginx logs and produce these metrics:
###    * hits per second
###    * GETs per second
###    * average query processing time
###    * ninetieth percentile query processing time
###    * number of HTTP 200, 300, 400, and 500 responses per second
###
###  Note that this plugin depends on a certain nginx log format, documented in
##   __init__.
import time
import threading
import re
# local dependencies
from ganglia_logtailer_helper import GangliaMetricObject
from ganglia_logtailer_helper import LogtailerParsingException, LogtailerStateException
class NginxLogtailer(object):# only used in daemon modeperiod = 30def __init__(self):'''This function should initialize any data structures or variablesneeded for the internal state of the line parser.'''self.reset_state()self.lock = threading.RLock()# this is what will match the nginx lines#log_format ganglia-logtailer#    '$host '#    '$server_addr '#    '$remote_addr '#    '- '#    '"$time_iso8601" '#    '$status '#    '$body_bytes_sent '#    '$request_time '#    '"$http_referer" '#    '"$request" '#    '"$http_user_agent" '#    '$pid';# NOTE: nginx 0.7 doesn't support $time_iso8601, use $time_local# instead# original apache log format string:# %v %A %a %u %{%Y-%m-%dT%H:%M:%S}t %c %s %>s %B %D \"%{Referer}i\" \"%r\" \"%{User-Agent}i\" %P# host.com 127.0.0.1 127.0.0.1 - "2008-05-08T07:34:44" - 200 200 371 103918 - "-" "GET /path HTTP/1.0" "-" 23794# match keys: server_name, local_ip, remote_ip, date, status, size,#               req_time, referrer, request, user_agent, pidself.reg = re.compile('^(?P<remote_ip>[^ ]+) (?

P<server_name>[^ ]+) (?P<hit>[^ ]+) \[(?P<date>[^\]]+)\] "(?P<request>[^"]+)" (?P<status>[^ ]+) (?P<size>[^ ]+) "(?P<referrer>[^"]+)" "(?P<user_agent>[^"]+)" "(?

P<forward_to>[^"]+)" "(?

P<req_time>[^"]+)"') # assume we're in daemon mode unless set_check_duration gets called self.dur_override = False # example function for parse line # takes one argument (text) line to be parsed # returns nothing def parse_line(self, line): '''This function should digest the contents of one line at a time, updating the internal state variables.''' self.lock.acquire() try: regMatch = self.reg.match(line) if regMatch: linebits = regMatch.groupdict() if '-' == linebits['request'] or 'file2get' in linebits['request']: self.lock.release() return self.num_hits+=1 # capture GETs if( 'GET' in linebits['request'] ): self.num_gets+=1 # capture HTTP response code rescode = float(linebits['status']) if( (rescode >= 200) and (rescode < 300) ): self.num_two+=1 elif( (rescode >= 300) and (rescode < 400) ): self.num_three+=1 elif( (rescode >= 400) and (rescode < 500) ): self.num_four+=1 elif( (rescode >= 500) and (rescode < 600) ): self.num_five+=1 # capture request duration dur = float(linebits['req_time']) self.req_time += dur # store for 90th % calculation self.ninetieth.append(dur) else: raise LogtailerParsingException, "regmatch failed to match" except Exception, e: self.lock.release() raise LogtailerParsingException, "regmatch or contents failed with %s" % e self.lock.release() # example function for deep copy # takes no arguments # returns one object def deep_copy(self): '''This function should return a copy of the data structure used to maintain state. This copy should different from the object that is currently being modified so that the other thread can deal with it without fear of it changing out from under it. The format of this object is internal to the plugin.''' myret = dict( num_hits=self.num_hits, num_gets=self.num_gets, req_time=self.req_time, num_two=self.num_two, num_three=self.num_three, num_four=self.num_four, num_five=self.num_five, ninetieth=self.ninetieth ) return myret # example function for reset_state # takes no arguments # returns nothing def reset_state(self): '''This function resets the internal data structure to 0 (saving whatever state it needs). This function should be called immediately after deep copy with a lock in place so the internal data structures can't be modified in between the two calls. If the time between calls to get_state is necessary to calculate metrics, reset_state should store now() each time it's called, and get_state will use the time since that now() to do its calculations''' self.num_hits = 0 self.num_gets = 0 self.req_time = 0 self.num_two = 0 self.num_three = 0 self.num_four = 0 self.num_five = 0 self.ninetieth = list() self.last_reset_time = time.time() # example for keeping track of runtimes # takes no arguments # returns float number of seconds for this run def set_check_duration(self, dur): '''This function only used if logtailer is in cron mode. If it is invoked, get_check_duration should use this value instead of calculating it.''' self.duration = dur self.dur_override = True def get_check_duration(self): '''This function should return the time since the last check. If called from cron mode, this must be set using set_check_duration(). If in daemon mode, it should be calculated internally.''' if( self.dur_override ): duration = self.duration else: cur_time = time.time() duration = cur_time - self.last_reset_time # the duration should be within 10% of period acceptable_duration_min = self.period - (self.period / 10.0) acceptable_duration_max = self.period + (self.period / 10.0) if (duration < acceptable_duration_min or duration > acceptable_duration_max): raise LogtailerStateException, "time calculation problem - duration (%s) > 10%% away from period (%s)" % (duration, self.period) return duration # example function for get_state # takes no arguments # returns a dictionary of (metric => metric_object) pairs def get_state(self): '''This function should acquire a lock, call deep copy, get the current time if necessary, call reset_state, then do its calculations. It should return a list of metric objects.''' # get the data to work with self.lock.acquire() try: mydata = self.deep_copy() check_time = self.get_check_duration() self.reset_state() self.lock.release() except LogtailerStateException, e: # if something went wrong with deep_copy or the duration, reset and continue self.reset_state() self.lock.release() raise e # crunch data to how you want to report it hits_per_second = mydata['num_hits'] / check_time gets_per_second = mydata['num_gets'] / check_time if (mydata['num_hits'] != 0): avg_req_time = mydata['req_time'] / mydata['num_hits'] else: avg_req_time = 0 two_per_second = mydata['num_two'] / check_time three_per_second = mydata['num_three'] / check_time four_per_second = mydata['num_four'] / check_time five_per_second = mydata['num_five'] / check_time # calculate 90th % request time ninetieth_list = mydata['ninetieth'] ninetieth_list.sort() num_entries = len(ninetieth_list) if (num_entries != 0 ): ninetieth_element = ninetieth_list[int(num_entries * 0.9)] else: ninetieth_element = 0 # package up the data you want to submit hps_metric = GangliaMetricObject('nginx_hits', hits_per_second, units='hps') gps_metric = GangliaMetricObject('nginx_gets', gets_per_second, units='hps') avgdur_metric = GangliaMetricObject('nginx_avg_dur', avg_req_time, units='sec') ninetieth_metric = GangliaMetricObject('nginx_90th_dur', ninetieth_element, units='sec') twops_metric = GangliaMetricObject('nginx_200', two_per_second, units='hps') threeps_metric = GangliaMetricObject('nginx_300', three_per_second, units='hps') fourps_metric = GangliaMetricObject('nginx_400', four_per_second, units='hps') fiveps_metric = GangliaMetricObject('nginx_500', five_per_second, units='hps') # return a list of metric objects return [ hps_metric, gps_metric, avgdur_metric, ninetieth_metric, twops_metric, threeps_metric, fourps_metric, fiveps_metric, ]

在被监控机器上部署ganglia-logtailer后,使用例如以下命令建立crond任务

*/1 * * * * root   /usr/local/bin/ganglia-logtailer --classname NginxLogtailer --log_file /usr/local/nginx-video/logs/access.log  --mode cron --gmetric_options '-C test_cluster -g nginx_status'

reload crond service,过一分钟后。在ganglia web上就可以看到对应的metric信息:

关于ganglia-logtailer的部署方法,详见:https://github.com/ganglia/ganglia_contrib/tree/master/ganglia-logtailer

3. 用支持的语言编写自己的module。本文以Python为例

ganglia支持用户编写自己的Python module,下面为github上简要介绍:

Writing a Python module is very simple. You just need to write it following a template and put the resulting Python module (.py) in /usr/lib(64)/ganglia/python_modules.

A corresponding Python Configuration (.pyconf) file needs to reside in /etc/ganglia/conf.d/.

比如。编写一个检查机器温度的演示样例Python文件

acpi_file = "/proc/acpi/thermal_zone/THRM/temperature"def temp_handler(name):  try:f = open(acpi_file, 'r')except IOError:return 0for l in f:line = l.split()return int(line[1])def metric_init(params):global descriptors, acpi_fileif 'acpi_file' in params:acpi_file = params['acpi_file']d1 = {'name': 'temp','call_back': temp_handler,'time_max': 90,'value_type': 'uint','units': 'C','slope': 'both','format': '%u','description': 'Temperature of host','groups': 'health'}descriptors = [d1]return descriptorsdef metric_cleanup():'''Clean up the metric module.'''pass#This code is for debugging and unit testing
if __name__ == '__main__':metric_init({})for d in descriptors:v = d['call_back'](d['name'])print 'value for %s is %u' % (d['name'],  v)

有了module功能文件,还须要编写一个相应的配置文件(放在/etc/ganglia/conf.d/temp.pyconf下),格式例如以下:

modules {module {name = "temp"language = "python"# The following params are examples only#  They are not actually used by the temp moduleparam RandomMax {value = 600}param ConstantValue {value = 112}}
}collection_group {collect_every = 10time_threshold = 50metric {name = "temp"title = "Temperature"value_threshold = 70}
}

有了这两个文件,这个module就算加入成功了。

很多其它的用户贡献的module,请查看 https://github.com/ganglia/gmond_python_modules

当中包含elasticsearch、filecheck、nginx_status、MySQL等常见服务的监控metric相应的module,很实用。仅仅须要稍作改动,就可以满足自己的需求。

其它的一些比較有用的用户贡献的工具
  • ganglia-alert :获取gmetad数据,并报警  https://github.com/ganglia/ganglia_contrib/tree/master/ganglia-alert
  • ganglia-docker:在docker中使用ganglia,https://github.com/ganglia/ganglia_contrib/tree/master/docker
  • gmetad-health-check:监控gmetad服务状态,假设down掉,则restart服务, https://github.com/ganglia/ganglia_contrib/tree/master/gmetad_health_checker
  • chef-ganglia:用chef部署ganglia, https://github.com/ganglia/chef-ganglia
  • ansible-ganglia: 使用ansible自己主动化部署ganglia。https://github.com/remysaissy/ansible-ganglia
  • ganglia-nagios: 集成nagios和ganglia。https://github.com/ganglia/ganglios
  • ganglia-api : 对外提供rest api,以特定格式返回gmetad收集到的数据, https://github.com/guardian/ganglia-api

如有问题,欢迎留言讨论。

ganglia监控自己定义metric实践相关推荐

  1. ganglia监控自定义metric实践

    Ganglia监控系统是UC Berkeley开源的一个项目,设计初衷就是要做好分布式集群的监控,监控层面包括资源层面和业务层面,资源层面包括cpu.memory.disk.IO.网络负载等,至于业务 ...

  2. 在微服务架构下基于 Prometheus 构建一体化监控平台的最佳实践

    欢迎关注方志朋的博客,回复"666"获面试宝典 随着 Prometheus 逐渐成为云原生时代的可观测事实标准,那么今天为大家带来在微服务架构下基于 Prometheus 构建一体 ...

  3. linux下ganglia监控系统搭建,开源监控软件Ganglia 安装使用

    1.ganglia简介 Ganglia 是一款为 HPC(高性能计算)集群而设计的可扩展的分布式监控系统,它可以监视和显示集群中的节点的各种状态信息,它由运行在各个节点上的 gmond 守护进程来采集 ...

  4. 用 Ganglia 监控基于 Biginsights 的 HBase 集群性能

    2019独角兽企业重金招聘Python工程师标准>>> BigInsights 和 HBase 简介 InfoSphere BigInsights 是 IBM 集成和开发的一个大数据 ...

  5. 使用ganglia监控hadoop及hbase集群

    介绍性内容来自:http://www.uml.org.cn/sjjm/201305171.asp 一.Ganglia简介 Ganglia 是 UC Berkeley 发起的一个开源监视项目,设计用于测 ...

  6. ganglia监控hadoop集群配置

    1:ganglia简介 ganglia是UC Berkeley发起的一个开源集群监控项目,设计用于测量和监控数以千计的节点. 主要是采用监控系统性能,如cpu,内存,硬盘使用率,I/O负载,网络流量情 ...

  7. ganglia 监控mysql_Ganglia监控

    一.Ganglia是什么? Ganglia是由UC Berkeley发起的一个开源项目,主要通过收集各节点上的度量数据(如处理器速度.内存使用量等)实现系统性能的监控.Ganglia的核心包含gmet ...

  8. ganglia 监控mysql_Ganglia监控MySQL

    1.下载MySQL监控脚本:gmetric-mysql.sh 具体下载目录在 /2014年资料/4月/25日/Ganglia监控MySQL 2.修改脚本中的msyql用户名和密码 3.该脚本54和58 ...

  9. MaxCompute作业日常监控与运维实践

    简介: MaxCompute作业日常监控与运维实践 监控项目作业超时运行 案例一 专用于业务团队取数的project_A ,基本都是手动跑SQL查询,每个作业执行基本不会很长时间,由于目前使用的是包年 ...

最新文章

  1. MySQL8.0新特性——锁读取
  2. 用指针、子函数的方法去一维数组中所有元素的平均值,并放在a[0]处
  3. 深入理解LVS,还学不会算我输!
  4. 记录gulp报错The following tasks did not complete: cssmin或类似任务
  5. 谣言粉碎机 - 极短时间内发送两个Odata request,前一个会自动被cancel掉?
  6. 算数运算符与关系运算符_Swift进阶三——运算符相关
  7. ASP.NET MVC 3.0学习系列文章(开始学习MVC)
  8. Android HIDL第一个HelloWorld demo
  9. IM即时通讯源码系统安卓苹果IOS双端源码介绍
  10. 调查问卷怎么html做成链接,怎么做一个网页链接调查问卷
  11. 怎样给word插入页码,点击进来有惊喜
  12. python-提取特征 特征选择
  13. java 解析邮件_JavaMail:在Web应用下完整接收、解析复杂邮件
  14. 图形化编程 html,c++图形界面编程
  15. 道教的精神———闻一多
  16. 爬虫day1 requests基本用法和网页基础
  17. 重复造轮子,对此你的看法
  18. 人工智能科技成熟的11个Github上免费开源项目,很多电影中才有的场景应用到现实颠覆普通人的认知和想象
  19. “短信拦截马”黑色产业链与溯源取证研究
  20. ubuntu linux支持rndis,BBB 通过USB虚拟的RNDIS与PC的vmware ubuntu 进行网络通信

热门文章

  1. 维修计算机起名字,电脑维修店起名,电脑维修店起名大全
  2. Tomcat配置本地文件映射
  3. 健身教练职业资格证考试难吗
  4. 计算机961考研试题,2008年北京航空航天大学961计算机专业综合考研试题
  5. 手游破解手段介绍及易盾保护方案
  6. golang-阅读雨痕大神的Go语言学习笔记的心得
  7. Golang主线程让子线程退出的三种方式
  8. 获取kernel32.dll基址
  9. jupyter指定端口打开;ImportError: DLL load failed报错;jupyter登录token获取
  10. AD使用Backspace键删除连线