Comparison between Apache Avro and Thrift

Source: Internet
Author: User
Tags comparison json serialization static class advantage

http://www.tbdata.org/archives/1307


comparison between Pache Avro and Thrift

Avro and thrift are cross-language, binary-based, high-performance communication middleware. They all provide the functionality of data serialization and RPC services. The general function is similar, but the philosophy is different. Thrift from Facebook for communication between the various services in the background, thrift's design emphasizes a unified programming interface for the multilingual communication framework. Avro from the father of Hadoop Doug Cutting, the thrift has been quite popular in the case of the launch of Avro, its goal is not only to provide a set of similar thrift communication middleware is to build a new, standard cloud of data exchange and storage protocol. Unlike thrift's philosophy, thrift that there is no perfect solution for all problems, so try to keep a neutral framework, insert different implementations, and interact with each other. While Avro is biased towards practicality, rejecting the possible confusion brought about by multiple schemes, advocating for the establishment of a unified standard that does not mind the adoption of specific optimizations. The innovation of Avro is the integration of explicit, declarative schema and efficient binary data representation, emphasizing the self-description of the data, overcoming the shortcomings of the previous simple XML or binary system. Avro to the schema dynamic loading function, is Thrift programming interface does not have, conforms to the Hadoop Hive/pig and NoSQL and so on both ad hoc, but also pursues the performance application demand. Language Binding

The current stage of thrift is richer than the language supported by Avro.

Thrift:c++, C #, Cocoa, Erlang, Haskell, Java, Ocami, Perl, PHP, Python, Ruby, Smalltalk.

AVRO:C, C + +, Java, Python, Ruby, PHP. Data Type

From the point of view of common data types, Avro and thrift are very close, and there is no difference in functionality.

Avro Thrift
Basic type

True or False

N/A 8-bit signed integer
N/A I16 16-bit signed integer
Int I32 32-bit signed integer
Long I64 64-bit signed integer
Float N/A 32-bit floating Point
Double Double 64-bit floating Point
bytes Binary Byte sequence
String String Character sequence
Complex types
Record struct User-defined Types
Enum Enum
Array<t> List<t>
N/A Set<t>
Map<string,t> Map<t1,t2> Avro Map's key

Must be a string

Union Union
Fixed N/A Fixed-size byte array
e.g. MD5 (16);
RPC Service
Protocol Service RPC Service type
Error exception RPC Exception type
Namespace Namespace Domain name
Development Process

From the developer's point of view, Avro and thrift are quite similar,

1) The same service is described by Avro and thrift, respectively.

Avro.idl:

Protocol SimpleService {

Record Message {

String topic;

bytes content;

Long Createdtime;

String ID;

String ipAddress;

Map<string> props;

}

int Publish (string context,array<message> messages);

}

Thrift.idl:

struct Message {

1:string Topic

2:binary Content

3:i64 Createdtime

4:string ID

5:string ipAddress

6:map<string,string> Props

}

Service SimpleService {

I32 Publish (1:string context,2:list<message> messages);

}

2) Avro and thrift support IDL code generation function

Java IDL avro.idl Idl.avro

Java Org.apache.avro.specific.SpecificCompiler Idl.avro Avro-gen

Target directory generates Message.java and Simpleservice.java

Thrift-gen Java thrift.idl

Similarly, the target directory generates Message.java and Simpleservice.java

3) Client Code

Avro Client:

URL url = new URL ("http", HOST, PORT, "/");

Transceiver trans = new Httptransceiver (URL);

SimpleService proxy=

= (SimpleService) specificrequestor.getclient (Simpleservice.class, transceiver);

...

Thrift Client:

Ttransport transport = new Tframedtransport (new Tsocket (Host,port));

Tprotocol protocol = new Tcompactprotocol (transport);

Transport.open ();

Simpleservice.client Client = new simpleservice.client (protocol);

...

4) server-side Avro and thrift both generate interfaces that need to be implemented:

Avro Server:

public static class Serviceimpl implements SimpleService {

..

}

Responder Responder = new Specificresponder (Simpleservice.class, New Serviceimpl ());

Server server = new Httpserver (responder, PORT);

Thrift Server:

public static class Serverimpl implements Simpleservice.iface {

..

}

Tservertransport servertransport=new Tserversocket (PORT);

Tserver server=new tsimpleserver (processor,servertransport,new tframedtransport.factory (), New Tcompactprotocol.factory ());

Server.serve (); Schema Processing

Avro and thrift deal with schema differently.

Thrift is a programming-oriented system that relies entirely on idl->binding language code generation. The schema is also "hidden" in the generated code, completely static. In order for the system to recognize and process a new data source, it is necessary to go through the process of editing IDL, code generation, and compiling the loading.

In contrast, although Avro also supports IDL-based schema descriptions, the Avro internal schema is explicit and exists in JSON-formatted files, Avro can convert the schema in IDL format into JSON format.

Avro supports 2 different ways. The avro-specific approach is similar to the thrift, and relies on code generation to produce specific classes and embed JSON schemas. The Avro-generic mode supports the dynamic loading of schemas, and uses a common structure (map) to represent data objects, without the need to compile and load directly to handle new data sources.

Serialization

A protocol has been developed for serialization Avro, and the design goal of thrift is a framework that does not enforce the serialization format.

AVRO specifies a standard serialized format in which the data schema (in JASON) appears in front of the data, whether it is a file store or a network transmission. The data itself does not contain any metadata (TAG). When the file is stored, the schema appears in the file header. When the network is transmitted, the schema appears in the initial handshake phase. The advantage of this is to make the data self describe, improve the transparency and operability of the data, and the second is to reduce the amount of information in the data itself to improve storage efficiency.

Avro's protocol offers many opportunities for optimization: projection data, and by scanning the schema, only the parts of interest are deserialized. Support schema versioning and mapping, different versions of reader and writer can exchange data by querying schema (schema aliases support mapping), That's a lot better than the way thrift uses each domain number.

The schema of the Avro allows you to define the ordering of the data and follow that order when serializing. This allows you to sort the data directly without deserialization, which works well in Hadoop.

Another feature of Avro is the use of the Block list structure, which breaks the limit of the size of a single integer representation. For example, an array or map consists of a series of blocks, each containing a counter and corresponding elements, the counter is the end of the 0 identity.

Thrift provides a variety of serialization implementations:

Tcompactprotocol: The most efficient binary serialization protocol, but not all binding languages are supported.

Tbinaryprotocol: Default simple binary serialization protocol.

Unlike Avro, thrift's data is stored with a tag in front of each field, which is used to identify the type and sequence ID of the domain (defined in IDL for versioning). In the same batch of data, the tag information is exactly the same, and when the number of data bar is large, this is obviously wasted.

RPC Service

The Avro provides

Httpserver: Default, service based on the jetty kernel.

Nettyserver: The new Netty-based service.

The Thrift offers:

Tthreadpolserver: Multithreading Services

Tnonblockingserver: Single thread non blocking service

Thshaserver: Multi-threaded non blocking service benchmarking

Test environment: 2 sets of 4-core Intel Xeon 2.66GHz, 8G memory, Linux, respectively, the client, the server.

Object definition:

Record Message {

String topic;

Bytes payload;

Long Createdtime;

String ID;

String ipAddress;

map<string,string > props;

}

Actual instance:

Msg.createdTime:System.nanoTime ();

Msg.ipaddress: "127.0.0.1″;

Msg.topic: "PV";

MSG.PAYLOAD:BYTE[100]

Msg.id:UUID.randomUUID (). toString ();

Msg.props: new hashmap<string,string> ();

Msg.props.put ("Author", "Tjerry");

Msg.props.put ("Date", new date (). toString ());

Msg.props.put ("Status", "new");

Serialization size

Avro serialization produces minimal results

Serialization speed

Thrift-binary because the serialization is simple instead it looks the fastest.

Deserialization speed

The speed of the thrift is very fast, because it is related to the improvement of its internal implementation using Zero-copy. But this advantage in the RPC comprehensive test

Does not seem to show.

The serialization test data collection takes advantage of the framework provided by Http://code.google.com/p/thrift-protobuf-compare/,

Original output:

Starting

, object Create, Serialize,/w same Object, deserialize, and check Media, and check all, total time, S Erialized Size

Avro-generic, 8751.30500, 10938.00000, 1696.50000, 16825.00000, 16825.00000, 16825.00000 , 27763.00000, 221

Avro-specific, 8566.88000, 10534.50000, 1242.50000, 18157.00000, 18157.00000, 18157.00000 , 28691.50000, 221

Thrift-compact, 6784.61500, 11665.00000, 4214.00000, 1799.00000, 1799.00000, 1799.00000 , 13464.00000, 227

Thrift-binary, 6721.19500, 12386.50000, 4478.00000, 1692.00000, 1692.00000, 1692.00000 , 14078.50000, 273

RPC Test Cases:

The client sends a fixed-length message to the server, in order to be able to test both the sequence and the deserialization, and the server receives the original message back to the client.

Array<message> Publish (string context, array<message> messages);

The tests used Avro Netty server and Thrift HaHa server because they are both asynchronous IO-based and suitable for high concurrency environments.

Results

From this test, the Avro Netty provides higher throughput and faster response than the Thrift Hsha service before it reaches the network bottleneck, and Avro consumes more memory.

Further experiments revealed that there was no absolute Avro and thrift service which was faster, decided to give the test case, or related to the use of the program, such as the current test cases are batch mode, a large number of sending fine grained objects (near the background TT, Hadoop usage), in this case Avro has an advantage. But for chatty clients that pass only one object at a time, the reversal becomes thrift more efficient. And when the BLOB scale in the data structure becomes larger, the difference between Avro and thrift is decreasing. Conclusion Thrift is suitable for program-to-Program static data exchange, which requires schema prediction and relative fixation. Avro adds support for schema dynamics on a thrift basis and does not lose performance on thrift. The Avro explicit schema design makes it more suitable for building common tools and platforms for data exchange and storage, especially in the background. At present, the advantage of thrift lies in more language support and relative maturity.

by Fankong | Cloud computing, high performance servers | Comments (9) 9 responses Raymond said: December 28, 2010 9:26 PM

Interesting review, which one Taobao select? Fankong said: December 29, 2010 3:55 PM

Now with Thrift Peter said: January 6, 2011 10:33

Summed up very good, a little doubt, Thrift Hsha server performance should not have so poor, small data volume, 100 threads, there should be 4w+. Hadoop RPC mechanism && introducing Avro into the Hadoop RPC mechanism-Webguo said on the road: February 11, 2011 3:56 PM

[...] Avro on the comparison between Avro and thrift, http://www.tbdata.org/archives/1307 did a detailed analysis, this section mainly introduces some details of Avro. 3.1 [...]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.