victoriametrics的prometheus高可用性和容错策略长期存储

本文的“为什么”？(“Why” of this article?)

Prometheus is a great tool for monitoring small, medium, and big infrastructures.

Prometheus是监视小型，中型和大型基础架构的好工具。

Prometheus anyway, and the development team behind it, are focused on scraping metrics. It’s a particularly great solution for short term retention of the metrics. Long term retention is another story unless it’s used for collecting a small number of metrics. This is normal in some way, because most of the time, when investigating some problems using the metrics scraped by Prometheus, we use metrics not older than 10 days. But this is not always the case, especially when the statistics that we are searching for are a correlation between different periods, like different weeks per months, or different months, or we are interested in keeping historical synthesis.

无论如何，Prometheus及其背后的开发团队都专注于抓取指标。对于短期保留指标而言，这是一个特别好的解决方案。除非用于收集少量指标，否则长期保留是另一回事。这在某种程度上是正常的，因为在大多数情况下，当使用Prometheus收集的指标调查某些问题时，我们使用的时间不超过10天。但这并非总是如此，尤其是当我们要搜索的统计数据是不同时期之间的相关性时，例如每月不同的周数或不同的月份，或者我们有兴趣保留历史综合信息。

Actually, Prometheus is perfectly able to collect metrics and to store them even for a long time, but storage will become extremely expensive since Prometheus needs to use fast storage, and Prometheus is not known to be a solution which permits to reach HA and FT in a sophisticated way (as we are going to explain there is a way, not so sophisticated, but it’s there). We will explain in the present article how to achieve HA and FT for Prometheus and also why we can achieve long term storage for metrics, in a better way using another tool.

实际上，Prometheus能够完美地收集和存储指标，甚至可以长时间存储，但是由于Prometheus需要使用快速存储，因此存储将变得极其昂贵，而Prometheus并不是一个可以在其中达到HA和FT的解决方案。一种复杂的方法(正如我们将要解释的那样，有一种方法，虽然不是那么复杂，但确实存在)。我们将在本文中解释如何实现Prometheus的HA和FT，以及为什么我们可以使用另一种工具以更好的方式实现指标的长期存储。

That said, during the past years many tools started to compete and many are still competing for solving those problems and not only.

就是说，在过去的几年中，许多工具开始竞争，并且不仅为解决这些问题，还在为解决这些问题而竞争。

The common components of a Prometheus installation are:

Prometheus安装的常见组件是：

Prometheus普罗米修斯
Blackbox黑盒子
Exporters出口商
AlertManager警报管理器
PushGatewayPushGateway

普罗米修斯的HA和FT (HA and FT of Prometheus)

Prometheus can use federation (Hierarchical and Cross-Service), which permits to configure a Prometheus instance to scrape selected metrics from other Prometheus instances (https://prometheus.io/docs/prometheus/latest/federation/). This kind of solution is pretty good when you want to expose only a subset of selected metrics to tools like Grafana, or when you want to aggregate cross-functional metrics (like business metrics from one Prometheus and a subset of services metrics from another one which is working in a federated way). This is perfectly fine, and it can work in many use cases, but it’s not compliant with the concept of High Availability, nor with the concept of Fault Tolerance: we are still talking about a subset of metrics, and if one of the Prometheus instances goes down, those metrics will be not collected during the down. Making Prometheus HA and FT must be done differently: there is no native solution from the Prometheus project itself.

Prometheus可以使用联盟(分层和跨服务)，该联盟允许将Prometheus实例配置为从其他Prometheus实例( https://prometheus.io/docs/prometheus/latest/federation/ )抓取所选指标。当您只想将所选指标的子集暴露给Grafana等工具时，或者想要汇总跨功能指标(例如来自一个Prometheus的业务指标和来自另一个Prometheus的服务指标的子集)时，这种解决方案非常好正在以联盟方式工作)。这样做很好，并且可以在许多用例中使用，但是它既不符合高可用性的概念，也不符合容错的概念：我们仍在讨论度量的子集，以及Prometheus实例中的一个下降，这些指标将不会在下降期间收集。制作Prometheus HA和FT的方法必须不同：Prometheus项目本身没有本地解决方案。

Prometheus can achieve HA and FT in a very easy way, without the need for complex clusters or consensus strategies.

普罗米修斯可以非常轻松地实现HA和FT，而无需复杂的集群或共识策略。

What we have to do, is to duplicate the same configuration file, the prometheus.yml in two different instances configured in the same manner, that are going to scrape the same metrics from the same sources. The only difference is that instance A is also monitoring instance B and vice versa. The good and old concept of redundancy is easy to implement, it’s solid, and if we use IaC (Infrastructure as Code, like Terraform) and a CM (Configuration Manager, like Ansible) it will also be extremely easy to manage and maintain. You do not want to duplicate an extremely big and expensive instance with another one, it’s better to duplicate a small instance, and to keep only short term metrics on it. This also makes the instances quickly recreable.

我们要做的是在以相同方式配置的两个不同实例中复制相同的配置文件prometheus.yml<

查看全文

http://www.taodudu.cc/news/show-8176865.html

PostgreSQL 关于时间复杂函数详解（长期更）
【unity小技巧】常用的方法属性和技巧汇总（长期更新）
学习人工智能与大脑开发，佛学理论“无我“
攻防世界 re1 wp
攻防世界-reverserMe WP
【C语言指针详解-CSAPP数据段解析】1024程序员节 | 汇编语言机械级编程｜用代码，改变世界#
攻防世界逆向高手题之ReverseMe-120
catch-me 攻防世界
攻防世界，Reverse：re1
攻防世界 REVERSE 新手区/re1
攻防世界-catch-me
攻防世界ReverseMe-120详解
SCI论文全攻略
HTML+CSS----------HTML5+CSS3
jqGrid 5.x 参数详解
【Python3】HTML基础
javascript-study
Android:＜8＞自动填充、单选对话框、复选对话框和登录对话框
张引琼计算机英语,ORACLE数据库的备份机制分析与研究.pdf
Vue.js 2.0 学习笔记（五）路由
Vue2.路由
Vue----路由(Vue2与Vue3)
PPT基础（六）组合与取消组合
组合（布尔运算） —— 大神PPT里常用的简单操作
MIRFLICKR-1M数据集百度网盘
MIR4效率问题
Mir 2.0 发布，Ubuntu 显示服务器
SAP 检查跨公司预制发票MIR7
MIR DATABANK自动化和智能制造每周要闻——2019年10月10日
SAP_MIR7预制发票控制余额不为0则不允许保存