Avro, Protocol Buffers 、Thrift的联系与区别

项目github地址：bitcarmanlee easy-algorithm-interview-and-practice
欢迎大家star，留言，一起学习进步

当想要数据, 比如对象或其他类型的, 存到文件或是通过网络传输, 需要面对的问题是序列化问题
对于序列化, 当然各个语言都提供相应的包, 比如, Java serialization, Ruby’s marshal, or Python’s pickle

一切都没有问题, 但如果考虑到跨平台和语言, 可以使用Json或XML
如果你无法忍受Json或XML的verbose和parse的效率, 问题出现了, 当然你可以试图为Json发明一种binary编码

当然没有这个必要重复造轮子, Thrift, Protocol Buffers or Avro, provide efficient, cross-language serialization of data using a schema, and code generation for the Java folks.

So you have some data that you want to store in a file or send over the network. You may find yourself going through several phases of evolution:

Using your programming language’s built-in serialization, such as Java serialization, Ruby’s marshal, or Python’s pickle. Or maybe you even invent your own format.
Then you realise that being locked into one programming language sucks, so you move to using a widely supported, language-agnostic format like JSON (or XML if you like to party like it’s 1999).
Then you decide that JSON is too verbose and too slow to parse, you’re annoyed that it doesn’t differentiate integers from floating point, and think that you’d quite like binary strings as well as Unicode strings. So you invent some sort of binary format that’s kinda like JSON, but binary (1, 2, 3, 4, 5, 6).
Then you find that people are stuffing all sorts of random fields into their objects, using inconsistent types, and you’d quite like a schemaand some documentation, thank you very much. Perhaps you’re also using a statically typed programming language and want to generate model classes from a schema. Also you realize that your binary JSON-lookalike actually isn’t all that compact, because you’re still storing field names over and over again; hey, if you had a schema, you could avoid storing objects’ field names, and you could save some more bytes!
Once you get to the fourth stage, your options are typically Thrift, Protocol Buffers or Avro. All three provide efficient, cross-language serialization of data using a schema, and code generation for the Java folks.

实际使用中, data总是在不断变化的, 所以schema总是在不断evoluation的, Thrift, Protobuf and Avro都支持这种特性, 保证在client或server的schema发生变化的时候可以尽量不影响正常的服务.

In real life, data is always in flux. The moment you think you have finalised a schema, someone will come up with a use case that wasn’t anticipated, and wants to “just quickly add a field”. Fortunately Thrift, Protobuf and Avro all support schema evolution: you can change the schema, you can have producers and consumers with different versions of the schema at the same time, and it all continues to work. That is an extremely valuable feature when you’re dealing with a big production system, because it allows you to update different components of the system independently, at different times, without worrying about compatibility.

本文的重点就是比较一下, Thrift, Protobuf and Avro到底如果将数据进行序列化成binary并且支持schema evoluation.

The example I will use is a little object describing a person. In JSON I would write it like this:

{"userName": "Martin","favouriteNumber": 1337,"interests": ["daydreaming", "hacking"]
}

This JSON encoding can be our baseline. If I remove all the whitespace it consumes 82 bytes.

##1.Protocol Buffers
The Protocol Buffers schema for the person object might look something like this:

message Person {required string user_name        = 1;optional int64  favourite_number = 2;repeated string interests        = 3;
}

首先PB使用IDL来表示person的schema
对于每个field都有一个唯一的tag,作为标识, 所以=1, =2, =3不是赋值, 是注明每个field的tag。然后每个field可以是optional, required and repeated When we encode the data above using this schema, it uses 33 bytes, as follows:

上图清晰反映出, 如果将82bytes的Json格式转化为33bytes的binary格式
首先序列化的时候只会记录tag, 而不会记录name, 所以可以任意改变fieldname而不会有影响, 而tag是永远不能变化的
第一个byte记录tag和type, 后面记录具体的数据, 对于string还需要加上length

可以看到, 在encoding的过程中, 没有特意记录optional, required and repeated
在decoding的时候,对required field会进行validation check, 但对于opitonal和repeated, 如果没有可以完全不出现在encoding数据中
所以对于opitonal和repeated, 可以简单的从schema中删除, 比如在客户端. 但是需要注意的是, 被删除的field的tag后面不能被再次使用
但是对于required field的改动, 可能导致问题, 比如在客户端删除required field, 此时server端的validation check就会失败

对于增加field, 只要使用新的tag, 就不会有任何问题

##2.Thrift
Thrift is a much bigger project than Avro or Protocol Buffers, as it’s not just a data serialization library, but also an entire RPC framework.
It also has a somewhat different culture: whereas Avro and Protobuf standardize a single binary encoding, Thrift embraces a whole variety of different serialization formats (which it calls “protocols”).

Thrift的功能比较强大, 不仅仅是数据序列化库, 还是一整套的RPC框架, 支持完整的RPC协议栈.
而且其对protocal的封装, 使其不仅仅支持binary encoding, 也可以实现不同的协议来支持其他的encoding

Thrift IDL和PB其实很像, 不同是使用1:(而非=1)来标注field tag, 并且没有optional, required and repeated类型
All the encodings share the same schema definition, in Thrift IDL:

struct Person {1: string       userName,2: optional i64 favouriteNumber,3: list<string> interests
}

The BinaryProtocol encoding is very straightforward, but also fairly wasteful (it takes 59 bytes to encode our example record):

The CompactProtocol encoding is semantically equivalent, but uses variable-length integers and bit packing to reduce the size to 34 bytes:

前面说了, Thrift可以通过protocal封装不同的编码方式, 对于binary编码, 也有两种选择
第一种就是简单的binary编码,没有做任何的空间优化, 可以看到浪费很多空间, 需要59 bytes
第二种是compact binary编码, 和PB的编码方式比较相似, 区别的是Thrift比PB更灵活, 可以直接支持Container, 比如这里的list. 而PB就只能通过repeated来实现简单的数据结构. (Thrift defines an explicit list type rather than Protobuf’s repeated field approach)

##3.Avro
Avro schemas can be written in two ways, either in a JSON format:

{"type": "record","name": "Person","fields": [{"name": "userName",        "type": "string"},{"name": "favouriteNumber", "type": ["null", "long"]},{"name": "interests",       "type": {"type": "array", "items": "string"}}]
}

…or in an IDL:

record Person {string               userName;union { null, long } favouriteNumber;array<string>        interests;
}

Notice that there are no tag numbers in the schema! So how does it work?

Here is the same example data encoded in just 32 bytes:

Avro是比较新的方案, 现在使用的人还比较少, 主要在Hadoop. 同时设计也比较独特, 和Thrift和PB相比
首先Schema可以使用IDL和Json定义, 而且注意binary encoding, 没有存储field tag和field type 意味着,
1.reader在parse data时必须有和其匹配的schema文件
2.没有field tag, 只能使用field name作为标识符, Avro支持field name的改变, 但需要先通知所有reader, 如下

Because fields are matched by name, changing the name of a field is tricky. You need to first update all readers of the data to use the new field name, while keeping the old name as an alias (since the name matching uses aliases from the reader’s schema). Then you can update the writer’s schema to use the new field name.

3.读数据的时候是按照schema的field定义顺序依次读取的, 所以对于optional field需要特别处理, 如例子使用union { null, long }

if you want to be able to leave out a value, you can use a union type, like union { null, long } above. This is encoded as a byte to tell the parser which of the possible union types to use, followed by the value itself. By making a union with the null type (which is simply encoded as zero bytes) you can make a field optional.

4.可以选择使用Json实现schema, 而对于Thrift或PB只支持通过IDL将schema转化为具体的代码. 所以avro可以实现通用的客户端和server, 当schema变化时, 只需要更改Json, 而不需要重新编译

当schema发生变化时, Avro的处理更加简单, 只需要将新的schema通知所有的reader

对于Thrift或PB, schema变化时, 需要重新编译client和server的代码, 虽然对于两边版本不匹配也有比较好的支持
5.writer的schema和reader的schema不一定完全匹配, Avro parser可以使用resolution rules进行data translation

So how does Avro support schema evolution?
Well, although you need to know the exact schema with which the data was written (the writer’s schema), that doesn’t have to be the same as the schema the consumer is expecting (the reader’s schema). You can actually give two different schemas to the Avro parser, and it uses resolution rules to translate data from the writer schema into the reader schema.

6.支持简单的增加或减少field

You can add a field to a record, provided that you also give it a default value (e.g. null if the field’s type is a union with null). The default is necessary so that when a reader using the new schema parses a record written with the old schema (and hence lacking the field), it can fill in the default instead.

Conversely, you can remove a field from a record, provided that it previously had a default value. (This is a good reason to give all your fields default values if possible.) This is so that when a reader using the old schema parses a record written with the newschema, it can fall back to the default.

有一个重要的问题没有讨论, Avro依赖于Json schema, 何时, 如何在client和server之间传递schema数据?

答案是, 不同的场景不同的方法…通过文件头, connection的握手时…

This leaves us with the problem of knowing the exact schema with which a given record was written.
The best solution depends on the context in which your data is being used:

In Hadoop you typically have large files containing millions of records, all encoded with the same schema. Object container fileshandle this case: they just include the schema once at the beginning of the file, and the rest of the file can be decoded with that schema.
In an RPC context, it’s probably too much overhead to send the schema with every request and response. But if your RPC framework uses long-lived connections, it can negotiate the schema once at the start of the connection, and amortize that overhead over many requests.
If you’re storing records in a database one-by-one, you may end up with different schema versions written at different times, and so you have to annotate each record with its schema version. If storing the schema itself is too much overhead, you can use ahash of the schema, or a sequential schema version number. You then need a schema registry where you can look up the exact schema definition for a given version number.
Avro相对于Thrift和PB, 更加复杂和难于使用, 当然有如下优点…

At first glance it may seem that Avro’s approach suffers from greater complexity, because you need to Go to the additional effort of distributing schemas.
However, I am beginning to think that Avro’s approach also has some distinct advantages:

Object container files are wonderfully self-describing: the writer schema embedded in the file contains all the field names and types, and even documentation strings (if the author of the schema bothered to write some). This means you can load these files directly into interactive tools like Pig, and it Just Works™ without any configuration.
As Avro schemas are JSON, you can add your own metadata to them, e.g. describing application-level semantics for a field. And as you distribute schemas, that metadata automatically gets distributed too.
A schema registry is probably a good thing in any case, serving as documentation and helping you to find and reuse data. And because you simply can’t parse Avro data without the schema, the schema registry is guaranteed to be up-to-date. Of course you can set up a protobuf schema registry too, but since it’s not required for operation, it’ll end up being on a best-effort basis.

##4.总结比较
Google protobuf：

优点
二进制消息，性能好/效率高（空间和时间效率都很不错）
proto文件生成目标代码，简单易用
序列化反序列化直接对应程序中的数据类，不需要解析后在进行映射(XML,JSON都是这种方式)
支持向前兼容（新加字段采用默认值）和向后兼容（忽略新加字段），简化升级
支持多种语言（可以把proto文件看做IDL文件）
Netty等一些框架集成

缺点
官方只支持C++,JAVA和Python语言绑定
二进制可读性差（貌似提供了Text_Fromat功能）
二进制不具有自描述特性
默认不具备动态特性（可以通过动态定义生成消息类型或者动态编译支持）
只涉及序列化和反序列化技术，不涉及RPC功能（类似XML或者JSON的解析器）

Apache Thrift：

优点
支持非常多的语言绑定
thrift文件生成目标代码，简单易用
消息定义文件支持注释
数据结构与传输表现的分离，支持多种消息格式
包含完整的客户端/服务端堆栈，可快速实现RPC
支持同步和异步通信

缺点
和protobuf一样不支持动态特性

Apache Avro：

优点
二进制消息，性能好/效率高
使用JSON描述模式
模式和数据统一存储，消息自描述，不需要生成stub代码（支持生成IDL）
RPC调用在握手阶段交换模式定义
包含完整的客户端/服务端堆栈，可快速实现RPC
支持同步和异步通信
支持动态消息
模式定义允许定义数据的排序（序列化时会遵循这个顺序）
提供了基于Jetty内核的服务基于Netty的服务

缺点
只支持Avro自己的序列化格式

Avro, Protocol Buffers 、Thrift的联系与区别相关推荐

序列化--反序列化：Schema evolution in Avro, Protocol Buffers and Thrift
当想要数据, 比如对象或其他类型的, 存到文件或是通过网络传输, 需要面对的问题是序列化问题对于序列化, 当然各个语言都提供相应的包, 比如, Java serialization, Ruby's ...
Schema evolution in Avro, Protocol Buffers and Thrift
http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html 当想要数据, ...
MessagePack, Protocol Buffers和Thrift序列化框架原理和比较说明
第1部分messagepack说明 1.1messagepack的消息编码说明为什么messagepack比json序列化使用的字节流更少, 可通过图1-1.图1-2有个直观的感觉. 图1-1mes ...
Protocol Buffers简明教程
随着微服务架构的流行,RPC框架渐渐地成为服务框架的一个重要部分.在很多RPC的设计中,都采用了高性能的编解码技术,Protocol Buffers就属于其中的佼佼者.Protocol Buffers ...
Protocol Buffers 在 iOS 中的使用
翻译自:Introduction to Protocol Buffers on iOS 对大多数的应用来说,后台服务.传输和存储数据都是个重要的模块.开发者在给一个 web service 写接口时, ...
在Java中使用Protocol Buffers
这份教程为Java开发者提供了使用 Protocol Buffer 的基本介绍.通过创建一个简单的示例应用,它展示了在 .proto 文件中定义消息格式. 使用 Protocol Buffer 编译 ...
在C++中使用Protocol Buffers
下载并编译Protocol Buffer 这份教程为C++开发者提供了使用 Protocol Buffer 的基本介绍.通过创建一个简单应用,它展示了在 .proto 文件中定义消息格式. 使用 P ...
什么是Protocol Buffers / protobuf / protobuffer？一种服务器和客户端的消息交互方式
Table of Contents Protocol Buffers 定义消息类型指定字段类型分配字段编号指定字段规则添加更多消息类型添加评论保留字段您产生了什么.proto? 标量值类 ...
C++程序员Protocol Buffers基础指南
这篇教程提供了一个面向 C++ 程序员关于 protocol buffers 的基础介绍.通过创建一个简单的示例应用程序,它将向我们展示: 在 .proto 文件中定义消息格式使用 protocol ...
Protocol Buffers C++ 入门教程
文章目录 1.ProtoBuf 简介 2.序列化和反序列化 2.1 简介 2.2 JSON 简介 2.3 使用 JSON 进行序列化和反序列化 2.4 C++ 对象序列化常用方法 2.3.1 XML ...

Avro, Protocol Buffers 、Thrift的联系与区别

Avro, Protocol Buffers 、Thrift的联系与区别相关推荐

最新文章

热门文章