Before we get started, let's take a look at what Apache Avro really is. Can be used to do anything.
Apache Avro is a data serialization system. Serialization is the conversion of objects into binary streams, and the corresponding deserialization is to convert binary streams into corresponding objects. Therefore, Avro is used to convert the object into a binary stream before the data is transmitted, and then the binary stream is Avro to the target address.
Next, let's look at what the official website says.
Apache Avro is a data serialization system.
Avro provides: Rich data structure a compact, fast, binary data format for a container file to store persistent data remote procedure call (RPC) simple dynamic language integration. Code generation does not need to read and write data files, and does not use or implement RPC protocols. Code generation is an optional optimization, and only a static type of language is worth implementing.
As you know, JSON is a lightweight data transmission format, for large datasets, JSON data will show up, because JSON format is key:value type, each record must attach the name of the key, sometimes, the optical key consumes more space than the value of space, This is a very serious waste of space, especially for large data sets, because it is not only not compact enough, but also repeatedly add key information, not only will create a waste of storage space, but also increase the pressure of data transmission, so as to increase the burden on the cluster, and thus affect the overall cluster throughput. The use of Avro data serialization system can be a better solution to this problem, because the Avro serialized file by the schema and the real content, schema is only the metadata of the data, the equivalent of the JSON data key information, the schema is stored in a single JSON file, As a result, the metadata for the data is stored only once, reducing the storage capacity considerably, compared to files in JSON data format. This allows the Avro file to organize data more tightly.
Next, we start using Avro. Download
Take Maven as an example, adding Avro dependencies and Plug-ins, the advantage of plug-ins is that you can automatically generate classes for AVSC files directly.
<dependencies> <dependency> <groupId>org.apache.avro</groupId>
;artifactid>avro</artifactid> <version>1.8.1</version> </dependency> <dependency> <groupId>junit</groupId> <ARTIFACTID>JUNIT</ARTIFACTID&G
T <version>4.12</version> </dependency> </dependencies> <build> <plugi ns> <plugin> <groupId>org.apache.avro</groupId> <artif Actid>avro-maven-plugin</artifactid> <version>1.8.1</version> <exe
Cutions> <execution> <phase>generate-sources</phase> <goals> <goal>schema</goal> < /goals> <configuration> <sourcedirectory>${project.basedir}/src/main/avro/</sourcedirec
Tory> <outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
</configuration> </execution> </executions>
</plugin> <plugin> <groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.6</source> <target>1.6</target> </con figuration> </plugin> </plugins> </build>
It is noteworthy that: the above Pom file is configured with the path to automatically generate classes, that is, ${project.basedir}/src/main/avro/and ${project.basedir}/src/main/java/, so that after configuration, When the MVN command is executed, the plugin automatically generates the class file for the AVSC schema under this directory and puts it in the directory below. Define schema
Use JSON to define a schema for Avro. Schemas consist of basic types (Null,boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed). For example, the following defines a user schema, creates a Avro directory under the main directory, and then new file USER.AVSC in the Avro directory:
{' namespace ': ' Lancoo.ecbdc.pre ', '
type ': ' Record ',
' name ': ' User ',
' Fields ': [
{' name ': ' Name ', ' Type ': ' String '},
{' name ': ' Favorite_number ', ' type ': [' int ', ' null ']},
{' name ': ' Favorite_Color ', ' type ': [' string ', ' null ']}
]
}
to serialize and deserialize with code generation Compiling schema
Here, because the Avro plugin is used, the MAVEN plugin automatically generates the class file for us by entering the following command directly:
MVN clean Install
The corresponding classes are then generated in the directory that you just configured, as follows:
If you do not use Plug-ins, you can also use Avro-tools to build:
Java-jar/path/to/avro-tools-1.8.1.jar Compile schema <schema file> <destination>
Create a user
Before that, the class file has been created, and then you can use the class you just generated automatically to create the user:
User User1 = new user ();
User1.setname ("Alyssa");
User1.setfavoritenumber (256);
Leave favorite Color null
//Alternate constructor
User user2 = new User ("Ben", 7, "Red");
Construct via builder
User User3 = User.newbuilder ()
. SetName ("Charlie")
. Setfavoritecolor ("Blue")
. Setfavoritenumber (NULL)
. Build ();
Serialization of
Serializes and stores the previously created user to a disk file:
Serialize user1, User2 and User3 to disk
datumwriter<user> Userdatumwriter = new Specificdatumwriter<user > (user.class);
datafilewriter<user> datafilewriter = new datafilewriter<user> (userdatumwriter);
Datafilewriter.create (User1.getschema (), New File ("Users.avro"));
Datafilewriter.append (user1);
Datafilewriter.append (user2);
Datafilewriter.append (USER3);
Datafilewriter.close ();
Here we are serializing the user to the file Users.avro deserialization
Next, we deserialize the serialized data:
Deserialize Users from disk
datumreader<user> Userdatumreader = new Specificdatumreader<user> ( User.class);
datafilereader<user> Datafilereader = new datafilereader<user> (New File ("Users.avro"), UserDatumReader);
User user = null;
while (Datafilereader.hasnext ()) {
//Reuse user object by passing it to next (). This is saves us from
//allocating and garbage collecting many to files with
//objects items.
user = Datafilereader.next (user);
SYSTEM.OUT.PRINTLN (user);
}
The entire creation Avro schema, code generation, creating users, serializing user objects, deserializing and final output, complete code can be organized as follows (here I use JUnit):
Import Org.apache.avro.file.DataFileReader;
Import Org.apache.avro.file.DataFileWriter;
Import Org.apache.avro.io.DatumReader;
Import Org.apache.avro.io.DatumWriter;
Import Org.apache.avro.specific.SpecificDatumReader;
Import Org.apache.avro.specific.SpecificDatumWriter;
Import Org.junit.Test;
Import Java.io.File;
Import java.io.IOException;
/** * Created by Yang on 12/23/16. */public class TestUser {@Test public void Testcreateuserclass () throws IOException {User user1 = new
User ();
User1.setname ("Alyssa");
User1.setfavoritenumber (256);
Leave favorite Color Null//Alternate constructor User user2 = new User ("Ben", 7, "Red"); Construct via builder User User3 = User.newbuilder (). SetName ("Charlie"). SETFA
Voritecolor ("Blue"). Setfavoritenumber (NULL). Build (); Serialize user1, User2 and User3 to disk Datumwriter<user> userdatumwriter = new specificdatumwriter<user> (user.class);
datafilewriter<user> datafilewriter = new datafilewriter<user> (userdatumwriter);
Datafilewriter.create (User1.getschema (), New File ("Users.avro"));
Datafilewriter.append (user1);
Datafilewriter.append (User2);
Datafilewriter.append (USER3);
Datafilewriter.close (); Deserialize Users from disk datumreader<user> Userdatumreader = new Specificdatumreader<user> (User.
Class);
datafilereader<user> Datafilereader = new datafilereader<user> (New File ("Users.avro"), UserDatumReader);
User user = null; while (Datafilereader.hasnext ()) {//Reuse user object by passing it to next ().
This is saves us from//allocating and garbage collecting many to files with//objects items.
user = Datafilereader.next (user);
SYSTEM.OUT.PRINTLN (user);}
}
}
After the code executes, you can find that the file Users.avro was created.
The output results are:
{' name ': ' Alyssa ', ' favorite_number ': 256, ' Favorite_Color ': null}
{"Name": "Ben", "Favorite_number": 7, "Favorite_Color": "Red"}
{"Name": "Charlie", "Favorite_number": null, "Favorite_Color": "Blue"}
Okay, is not very simple.