不要害怕文件系统!

Kafka relies heavily on the filesystem for storing and caching messages. There is a general perception that "disks are slow" which makes people skeptical that a persistent structure can offer competitive performance. In fact disks are both much slower and much faster than people expect depending on how they are used; and a properly designed disk structure can often be as fast as the network.

Kafka高度依赖文件系统来存储和缓存消息。一般的人都认为“磁盘是缓慢的”,这使得人们对“持久化结构”持有怀疑态度。实际上,磁盘比人们预想的快很多也慢很多,这取决于它们如何被使用;一个好的磁盘结构设计可以使之跟网络速度一样快。

The key fact about disk performance is that the throughput of hard drives has been diverging from the latency of a disk seek for the last decade. As a result the performance of linear writes on a JBODconfiguration with six 7200rpm SATA RAID-5 array is about 600MB/sec but the performance of random writes is only about 100k/sec—a difference of over 6000X. These linear reads and writes are the most predictable of all usage patterns, and are heavily optimized by the operating system. A modern operating system provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes. A further discussion of this issue can be found in this ACM Queue article; they actually find that sequential disk access can in some cases be faster than random memory access!

一个有关磁盘性能的关键事实是:在过去的十年,磁盘驱动器的吞吐量跟磁盘寻道的延迟是相背离的。结果就是:JBOD配置的6个7200rpm SATA RAID-5 的磁盘阵列上线性写的速度大概是600M/秒,但是随机写的速度只有100K/秒,两者相差将近6000倍。线性读写在大多数应用场景下是可以预测的,因此,现代的操作系统提供了预读和写技术,将多个大块预取数据,并将较小的写入组合成一个大的物理写。更多的讨论可以在ACM Queue Artical中找到,他们发现,对磁盘的线性读在有些情况下可以比内存的随机访问要更快。

To compensate for this performance divergence modern operating systems have become increasingly aggressive in their use of main memory for disk caching. A modern OS will happily divert allfree memory to disk caching with little performance penalty when the memory is reclaimed. All disk reads and writes will go through this unified cache. This feature cannot easily be turned off without using direct I/O, so even if a process maintains an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice.

为了补偿这个性能上的差异,现代操作系统用内存做磁盘缓存时变得越来越重。现在操作系统很乐意将所有空闲的内存用于磁盘缓冲,尽管在内存回收的时候会有一点性能上的代价。所有的磁盘读写会通过这个统一的缓存。没有使用直接I/O的情况下,不能轻易关闭此功能。所以即使 一个进程维护着一个进程内的数据缓存,这些数据还是会在OS的页缓存中被复制,从而有效地存储两次。

Furthermore we are building on top of the JVM, and anyone who has spent any time with Java memory usage knows two things:

此外,我们建立在JVM的顶部,熟悉java内存应用管理的人应该清楚以下两件事情:

  1. The memory overhead of objects is very high, often doubling the size of the data stored (or worse).
    一个对象的内存消耗是非常高的,经常是所存数据的两倍(或者更多)。
  2. Java garbage collection becomes increasingly fiddly and slow as the in-heap data increases.
    随着堆内数据的增多,Java的垃圾回收变得越来越繁琐而缓慢。

As a result of these factors using the filesystem and relying on pagecache is superior to maintaining an in-memory cache or other structure—we at least double the available cache by having automatic access to all free memory, and likely double again by storing a compact byte structure rather than individual objects. Doing so will result in a cache of up to 28-30GB on a 32GB machine without GC penalties. Furthermore this cache will stay warm even if the service is restarted, whereas the in-process cache will need to be rebuilt in memory (which for a 10GB cache may take 10 minutes) or else it will need to start with a completely cold cache (which likely means terrible initial performance). This also greatly simplifies the code as all logic for maintaining coherency between the cache and filesystem is now in the OS, which tends to do so more efficiently and more correctly than one-off in-process attempts. If your disk usage favors linear reads then read-ahead is effectively pre-populating this cache with useful data on each disk read.

由于这些因素,使用文件系统并依赖pagecache(页缓存)将优于缓存在内存中或其他的结构 - 我们通过自动访问所有可用的内存将使得可用的内存至少提高一倍。并可能通过存储紧凑型字节结构再次提高一倍。这将使得32G机器上高达28-32GB的缓存,并无需GC。此外,即使服务重新启动,该缓存保持可用,而进程内的缓存则需要在内存中重建(10GB缓存需要10分钟),否则将需要启动完全冷却的缓存(这意味着可怕的初始化性能)。这也大大简化了代码,因为在缓存和文件系统之间维持的一致性的所有逻辑现在都在OS中,这比一次性进程更加有效和更正确。如果你的磁盘支持线性的读取,那么预读取将有效地将每个磁盘中有用的数据预填充此缓存。

This suggests a design which is very simple: rather than maintain as much as possible in-memory and flush it all out to the filesystem in a panic when we run out of space, we invert that. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel's pagecache.

这带来一个非常简单的设计:当内存空间耗尽时,将它全部flush到文件系统中,而不是尽可能把数据维持在内存中。我们反过来看,所有的数据直接写入到文件系统的持久化日志中,无需flush到磁盘上。实际上这只是意味着它被转移到内核的页缓存中。

This style of pagecache-centric design is described in an articleon the design of Varnish here (along with a healthy dose of arrogance).

这种以页缓存为中心的设计风格在这里描述。

常数时间就足够了 (Constant Time Suffices)

The persistent data structure used in messaging systems are often a per-consumer queue with an associated BTree or other general-purpose random access data structures to maintain metadata about messages. BTrees are the most versatile data structure available, and make it possible to support a wide variety of transactional and non-transactional semantics in the messaging system. They do come with a fairly high cost, though: Btree operations are O(log N). Normally O(log N) is considered essentially equivalent to constant time, but this is not true for disk operations. Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited. Hence even a handful of disk seeks leads to very high overhead. Since storage systems mix very fast cached operations with very slow physical disk operations, the observed performance of tree structures is often superlinear as data increases with fixed cache--i.e. doubling your data makes things much worse then twice as slow.

在消息系统中使用的持久数据结构常常具有相关联的BTree或其他通过随机访问数据结构的每个消费者队列,以维护关于消息的元数据。BTrees是可用的最通用的数据结构,可以在消息系统中支持各种各样的事务和非事务性语义。尽管,Btree的操作是O(log N),但它们的成本相当高。通常O(log N)O(log N)基本上等同于恒定时间,但是磁盘操作不是这样,磁盘寻找在10ms的pop,每个磁盘一次只能做一次寻找,所以并行性受限制。因此,即使是少量的磁盘搜索导致非常高的开销。由于存储系统将非常快速的缓存操作与非常慢的物理磁盘操作相结合,因为数据随固定缓存而增加,所以观察到的树结构的性能通常是超线性的。- 即,你的数据翻倍则使得事情慢两倍还多。

Intuitively a persistent queue could be built on simple reads and appends to files as is commonly the case with logging solutions. This structure has the advantage that all operations are O(1) and reads do not block writes or each other. This has obvious performance advantages since the performance is completely decoupled from the data size—one server can now take full advantage of a number of cheap, low-rotational speed 1+TB SATA drives. Though they have poor seek performance, these drives have acceptable performance for large reads and writes and come at 1/3 the price and 3x the capacity.

直观上,持久队列可以建立在简单的读取和附加到文件上,就像日志解决方案的情况一样。 这种结构的优点是所有操作都是O(1),并且读取不会阻塞写入或彼此。 这具有明显的性能优势,因为性能与数据大小完全分离 - 服务器现在可以充分利用这点,低转速 1+TB SATA驱动器。虽然这些驱动器的搜索性能不佳,但是对于大量的读写而言,这些驱动器具有可接受的性能,并且价格是1/3,能力为3倍。

Having access to virtually unlimited disk space without any performance penalty means that we can provide some features not usually found in a messaging system. For example, in Kafka, instead of attempting to deleting messages as soon as they are consumed, we can retain messages for a relative long period (say a week). This leads to a great deal of flexibility for consumers, as we will describe.

事实上,无需任何性能损失就可以访问几乎无限制的磁盘空间,这意味着我们可以提供一般消息传递系统无法提供的特性。 例如,在Kafka中,消息被消费后不是立马被删除,我们可以保留消息相对较长的时间(例如一个星期)。 这将为消费者带来很大的灵活性。

【kafka系列教程15】kafka持久化相关推荐

  1. STM32 基础系列教程 15 - SPI

    前言 学习stm32 SPI通信接口使用,学会用SPI接口收发数据. 示例详解 基于硬件平台: STM32F10C8T6最小系统板, MCU 的型号是 STM32F103c8t6, 使用stm32cu ...

  2. Kafka系列之:kafka命令详细总结

    Kafka系列之:kafka命令详细总结 一.添加和删​​除topic 二.修改topic 三.平衡领导者 四.检查消费者位置 五.管理消费者群体 一.添加和删​​除topic bin/kafka-t ...

  3. 【kafka系列教程41】kafka监控

    Monitoring Kafka uses Yammer Metrics for metrics reporting in both the server and the client. This c ...

  4. 【视频教程】帝国CMS制作网站系列教程15—数据表、字段及系统模型创建

    作为一个程序员,搭建一个自己的博客网站是件非常容易的事情,但是作为很多非程序员非计算机专业的学习者来讲,可能就需要花点时间进行学习,而如果你想通过自学来学习怎么制作一个属于自己的网站的话,那这套帝国C ...

  5. kafka系列三、Kafka三款监控工具比较

    转载原文:http://top.jobbole.com/31084/ 通过研究,发现主流的三种kafka监控程序分别为: Kafka Web Conslole Kafka Manager KafkaO ...

  6. zigbee CC2530 系列教程 15 温湿度传感器DHT11实验

    所有课程见此链接: zigbee CC2530 系列教程 0 课程介绍 4.12温湿度传感器DHT11实验 4.12.1 实验目的 学习使用温湿度传感器DHT11,并在串口显示温湿度数据. 4.12. ...

  7. mac 搭建kafka系列教程

    新入手mac,当然是装各种软件啦 下面来记录一下使用mac安装kafka的过程,mac上面的homebrew是真的好用 下面我们来开始进入安装吧 安装环境基础 # jdk1.8 并且配置好环境变量 1 ...

  8. kafka系列九、kafka事务原理、事务API和使用场景

    一.事务场景 最简单的需求是producer发的多条消息组成一个事务这些消息需要对consumer同时可见或者同时不可见 . producer可能会给多个topic,多个partition发消息,这些 ...

  9. Kafka系列:查看Kafka版本

    kafka没有提供version命令,但是命令行里面有 ps -ef|grep '/libs/kafka.\{2,40\}.jar'

最新文章

  1. 代码攻击破坏设备,炸毁 27 吨发电机的背后
  2. c++ 两个多边形区域重叠_2018 年英国中级数学挑战赛中的多边形问题
  3. 【NOI2012】迷失游乐园【概率期望】【换根dp】【基环树】
  4. Java中的函数传递
  5. 聊聊同步、异步、阻塞与非阻塞
  6. linux 磁盘分配 简书,linux 磁盘分区
  7. dbf文件mysql,dbf文件怎么打开?dbf是什么文件?
  8. Linux期末考试试题长沙理工,Linux期末考试试题8套(含答案)
  9. 聊聊mac系统的 secoclient和iTerm2
  10. Juypter Notebook 的安装、配置、部署
  11. [poi-tl]转换html内容到word
  12. MyBatis 关联查询(一对多 多对一)
  13. 嵌入式调试神器-虚拟示波器之JScope
  14. 文本文件操作 单词排序
  15. 心脏滴血漏洞利用(CVE-2014-0160)
  16. 3、Qt5 主窗口点击按钮 弹出另一个自定义窗口
  17. 排序(下):归并排序和快速排序
  18. :-1: error: [debug/qrc_image.cpp] Error 1
  19. 【蓝桥杯真题】18天Python组冲刺 心得总结
  20. 输出信噪比公式_如何计算信号的信噪比

热门文章

  1. Jupyter 无法下载文件夹如何曲线救国
  2. linux 下 c++ 标准库的安装
  3. Go语言自学系列 | golang数组
  4. php pdo ttfb慢,接口速度慢问题查找(TTFB时间长)
  5. 2385. 感染二叉树需要的总时间
  6. 超多软件百度云盘资源及安装教程
  7. 灰色相关性matlab程序,灰色关联度分析MATLAB程序
  8. linux安装lldb
  9. Verilog乘法器
  10. 漫画:顶级高手都擅长“二维四象限”分析法​