This is a creation in Article, where the information may have evolved or changed.
Scott Mansfield (@sgmansfield) is a Senior software Engineer at Netflix. He works on Evcache, a high-performance, and low latency persistence system. The He is the primary author of Rend, an open-source memcached proxies as a part of Evcache. He talked at Gophercon on a custom data serialization format he developed at Netflix.
*note:this Post was Best-effort live-blogged at the conference. Let me know on Twitter (@renfredxh) if I missed anything!
Why serialization?
Serialization is everywhere. From high-level applications such as serializing metadata as JSON objects, to the lowest level of encoding binary instruct Ions to electrical voltages a CPU can understand, serialization plays a huge role in transcoding data everywhere IN-BETW Een. Some Interesting examples include:
- HTTP/2 headers (HTTP headers serialized into a binary format)
- Hard drive communication (SATA interfaces)
- Video display (serializing color and timing information into formats is encoded and transmitted across VGA)
Frameworks such as Grpc/protofbuf already define existing formats and methods for serializing data, so why create Somethin G New? Scott mentions a universal truth-that is the we can always look at Hacker News for inspiration:
"The field is too in love with horribly inefficient frameworks. Writing Network code and protocols is now considered too low level for people. "
People is often afraid of peeking under the covers to both understand the underlying formats, and if necessary create the IR own that ' s suited to a specific need. Scott had taken this challenge head on and developed a custom data serialization format that best suits the requirements a T Netflix which includes the ability to be self-describing, storage efficiency, performance, and flexible querying.
The following is a summarized overview of Netflix's format, which is powered by Go.
The document
JSON is a unanimously known data format. By using JSON as a starting-point, Netflix have created an augmented format the is both familiar yet s that is important to them such as performance and querying capability. They ' ve also ironed out some ambiguities in the JSON format such as the byte-size of the number type by supporting 64-bit integers and floats.
So that's the document format, but how does we interact with the data? A common pattern for accessing JSON documents is:
- Get Entire Document
- Inflate serialized data
- Walk Data structure
This requires fetching all of the data in the document, and walking on it in an often inefficient (or random).
By adding additional querying capabilities-over JSON document, we can leverage a new and improved pattern:
- Ask for data need
- Get only the data need
- Still need to inflate
These json-like documents is stored as byte array for maximum flexibility and efficiency, in a-a-to-that supports these Qu Eries.
The Queries
The syntax to query fields within these JSON documents are designed to only request the specific data a user needs, and RET Urns only that data.
A value of single field can be accessed as follows:
Query: .foo{"foo": 3, "bar": 4} ↑Key fooResult: 3
Multiple fields can is accessed in a similar to:
Query: .foo|bar{"foo":3, "bar":4, "baz":5} ↑ ↑Key foo Key barResult: {"foo":3, "bar":4}
More complex querying capabilities include fully recursive nesting and array slicing:
Query: .m[] .k1 [0]{"foo": {"k1": [3,4]}, "bar": {"k1": [5,6]}} ↑Index 0 of each array value in k1 Result: {"foo": 3, "bar": 5}
Performance
Netflix operates at massive scale. For this reason their query syntax was not only designed to being flexible, but to work in a-to-do leverages the internal F Ormat of the document to efficiently return data. Offsets and data lengths is included in header fields for composite types (arrays and maps) which allows constant time AC Cess for array slicing.
Below is a diagram this depicts the anatomy of the resulting byte-array that a JSON array are serialized into, including a Type field, the header information mentioned above, followed by the data itself.
For map types, keys is stored as interned strings. This means, string-represented key is assigned an integer which prevents the issue of storing duplicate copies of Potentially long string keys. For example, if there ' a key named "Orange was the New Black", this would being assigned an ID such as "1", and each Subseque NT reference to that key was stored as "1" in the database and translated back into the original string during Deserializat Ion.
Additionally, keys and their associated offsets is stored in sorted order. This means for a given key, binary search can is used to efficiently lookup the desired value. Scott's talk included several benchmark results that verify the performance assumptions (I ' ve decided to spare the details Of the benchmark tests here, but those who is curious can check out Scott's slides for more).
Key takeaways
Now that I had a better idea behind the motivations and design decisions that influenced Netflix ' s new serialization form At, an obvious question Arises:was the design of a new, custom format worth it?
It's difficult to answer this without more data surrounding the practical applications of this format in Netflix ' s infrast Ructure. Scott's talk served as a satisfactory explanation for the necessity of opting-roll your own protocol over choosing a 3r D Party Framework.
One of the key takeaways of Scott's talk for me is not necessarily a recipe for constructing a well-designed serializatio n format. Instead, the option to develop a custom infrastructure that's well-suited to your requirements should isn't be immediately di Smissed because an existing framework exists, no matter how low-level. Understanding the underlying details can inform an potentially better solution, one of can is verified by measuring resu Lts.