The secret of Chief engineer: How LinkedIn's big data backstage works

Source: Internet
Author: User
Keywords We we or we or can we or can these we or can these nbsp;

Editor's note: Jay Kreps, a chief engineer from LinkedIn, says that logs exist almost at the time of the computer's creation, and there is a wide range of uses in addition to distributed computing or abstract distributed computing models. In this paper, he describes the principles of the log and the use of the log as a separate service to achieve data integration, real-time data processing and distributed system design. Article content is very dry, worth learning.

The following is the original text:

I joined the LinkedIn company at an exciting time six years ago. From that point on, we cracked down on a single, centralized database and started switching to a special distributed System suite. This is an exciting thing: the Distributed graphics database, the distributed search backend, the Hadoop installation, and the first and second generation key value data stores that we build, deploy, and still run until today.

The most rewarding thing we've learned from all of this is that the core of many of the things we build contains a simple idea: log. Sometimes called a pre-write log or a commit or transaction log, the log exists almost at the time of the computer's creation, and is at the heart of many distributed data systems and real-time application structures.

Without knowledge of journals, you cannot fully understand databases, NoSQL storage, key-value storage, replication, Paxos,hadoop, versioning, and almost all software systems; However, most software engineers are not familiar with them. I am willing to change this situation. In this blog post, I will take you through all the things you must know about the log, including what the log is, how to use the log in data integration, real-time processing, and system building.

the first part: What is the log?

A log is a simple storage abstraction that can no longer be simple. It is a series of records that can only be added and sorted exactly by time. The log looks as follows:

we can add records to the end of the log, and we can read the log records from left to right. Each record specifies a unique, sequential number of log records.

The sort of logging is determined by time, because the log record on the left is earlier than the one on the right. The logging number can be thought of as the "timestamp" of this log record. It's a bit superfluous to say this sort of sorting at the outset, but the time attribute is a very easy to use property compared to any specific physical clock. This attribute is important when we run multiple distributed systems.

For the purpose of this discussion, the content and format of the log records are not important. In addition, it is not possible to add records to the log when the storage space is completely depleted. We will mention this question later.

Logs are not entirely different from files or datasheets. A file is made up of a series of bytes, and a table consists of a series of records, and a log is actually a datasheet or file that stores records in chronological order.

At this point, you may wonder why you should discuss such a simple matter. How do you associate a data system with an increased number of log records that can only be added in different environments? The answer is that the log has a specific application goal: It records what happens at the time. For many aspects of distributed data systems, this is the real core of the problem.

But before we go into more in-depth discussions, let me clarify some confusing concepts. Each programmer is familiar with another type of logging-applications that use syslog or log4j may write to local files without structural error messages or trace information. To differentiate, we call this case logging "application logging." Application logging is a low-level variant of the log I'm talking about here. The biggest difference is that text logs are meant primarily for people to read, and the "log" or "Data log" I've described is built to facilitate program access.

(In fact, if you think deeply about it, the idea that people read logs on a certain machine doesn't conform to the trend of the times.) This approach quickly becomes a difficult way to manage when it comes to many services and servers, and in order to understand the behavior of multiple machines, the goal of logging quickly becomes the input of queries and graphs of these behaviors-for certain behaviors of multiple machines, The English form of the text in the file is almost unsuitable compared to the structured log described here. )

Database Log

I don't know where the log concept originated-maybe it was like a binary search: the inventor thought it was too simple to be an invention. It appeared as early as IBM's system R appeared. The usage in the database is used to synchronize various data structures and indexes when it crashes. To ensure the atomicity and durability of the operation, the database will transcribe the information that is about to be modified into the log before making any changes to the data structures maintained by the database. The log records what happened, and each of these tables or indexes is a historical mapping of data structures or indexes. Since the log is immediate and permanent, it can be used as a trusted source of data to restore all other permanent structures when a crash occurs.

Over time, the purpose of logging is a way to replicate data between databases and grow from the realization of acid details. The result of using a log is that the order of changes that occurs on the database and the order of changes on the remote replicated database need to be kept fully synchronized.

Both Oracle,mysql and PostgreSQL include log transport protocols for transferring logs to an alternate replication database. Oracle also makes log products a common data subscription mechanism so that non-Oracle data subscribers can subscribe to data using Xstreams and Goldengate, and similar implementations on MySQL and PostgreSQL are key components of many data structures.
Because of this origin, the concept of machine-identifiable logs is largely confined to the internal database. The mechanism by which logs are used as data subscriptions appears to be accidental, but it is impractical to use this abstraction to support all types of message transmission, data flow, and real-time data processing.

Distributed System Log

The log solves two problems: changing the order of actions and distributing data, both of which are particularly important in distributed data systems. It is one of the core issues of distributed system design to negotiate a consistent change in the order of the actions (or to maintain the practices of the subsystems themselves, but to make copies of the data that have side effects).

The realization of a distributed system with a log-centric implementation is inspired by a simple experience, I refer to this experience as the principle of state machine copying: If two identical, deterministic processes start from the same state and get the same input in the same order, then the two processes will produce the same output. And ends in the same state.

This may be a bit difficult to understand, let's explore it more deeply and understand its true meaning.

Certainty means that the process is time-independent, and that any other "external" input does not affect the processing result. For example, if the output of a program is affected by the specific order in which the thread executes, or is affected by gettimeofday calls, or some other non repetitive events, then such a program is generally most likely to be considered non-deterministic.

The process state is any data that the process holds on the machine, which is either stored in memory or saved on disk at the end of process processing.

The place where the same input is obtained in the same order should be noticed-this is where the log is introduced. Here's an important common sense: if you give the same log input to two deterministic codes, they generate the same output.

The application of distributed computing is particularly obvious. You can reduce the problem of using multiple machines to perform the same thing in a distributed consistency log for these processes. The purpose of this log is to exclude all non-deterministic items from the input stream to ensure that each replication process can process input synchronously.

When you understand this, the state machine copy principle is no longer complex or esoteric: this more or less means that the "deterministic process is deterministic". Anyway, I think it's one of the more common tools in distributed system design.

One of the beauty of this approach is that the timestamp of the index log is like a copy of the clock state-you can use a separate number to describe each copy, which is the timestamp of the processed log. The timestamp corresponds to the status of the entire replica with log one by one.

There are many different ways to apply this principle to the system because of the different contents of the log. For example, we record a request for a service, or a change in the state of the service from the request to the response, or it performs a conversion of the command. Theoretically, we can even record a series of machine instructions or called method names and parameters for each copy. As long as two processes process these inputs in the same way, these processes maintain the consistency of the replicas.

There are 1000 kinds of log usage in 1000 eyes. Database workers typically distinguish between physical and logical logs. A physical log is a record of what is changed in each row. Logical logging is not a changed line, but rather an SQL statement that causes the content of the row to be changed (insert,update and DELETE statements).

Distributed systems can generally be divided into two ways to process data and complete responses. The state machine model often refers to an active-active model-that is, the object in which we record requests and responses. A minor change to this, called a "prep model", is to select a copy as leader and allow it to process at the time the request arrives and output logs of its state changes from the process. Other replicas apply those changes according to the order in which the leader state changes, so that they can sync up and take over leader's work when leader fails.

To understand the difference between the two approaches, let's look at a less rigorous example. Suppose there is a copy of the algorithm service, keeping a separate number as its state (the initial value is 0), and adding and multiplying the value. Active-Active mode should output the transformation, such as "+1", "*2" and so on. Each copy will apply these transformations to get the same set of solutions. Active-Passive mode will have an independent body to perform these transformations and output the result log, such as "1", "3", "6" and so on. This example also clearly shows why order is the key to ensuring consistency between replicas: a change in the order of addition and multiplication will result in different results.

Distributed logging can be understood as the data structure of the consistency problem model. Because the log represents a series of decisions for subsequent append values. You need to re-examine the Paxos algorithm clusters, although log modules are their most common application. In the Paxos algorithm, it typically uses a protocol called multiple Paxos, which models the log into a series of problems, in which each problem has a corresponding part. In Zab, raft and other protocols, the role of the log is particularly prominent, it directly to the maintenance of distributed, consistent log of the problem model.

What I suspect is that our view of historical development is biased, perhaps because the theory of distributed computing has far exceeded its practical application over the past few decades. In reality, the question of consensus is a little too simple. Computer systems rarely have to decide on a single value, they almost always process a sequence of requests. Such records, rather than a simple single value register, are naturally more abstract.

In addition, focusing on algorithms masks the underlying logs that abstract systems need. I suspect that we will end up paying more attention to the Journal as a commodity cornerstone, whether it's implemented in the same way or not, and we often talk about a hash table, rather than the tangle of hash tables that we get are not specific details, such as linear or with what other variant hash tables are. The log will become a popular interface, providing the best assurance and best performance for most algorithms and their implementation improvements.

Change Log 101: The second phase of the table and event.

Let's keep talking about the database. There is a large number of changes in the database between the log and the duality of the table. These logs are a bit like a debit list and a bank's process, and the database table is the current surplus statement. If you have a large number of change logs, you can use these changes to create a table that captures the current state. This table will record the status information for each key point (a particular point in time in the log). That's why logs are a very basic data structure: Logs can be used to create basic tables or to create various derivative tables. It also means that you can store an object that is not a relational type.

This process is also reversible: if you are updating a table, you can record the changes and publish all the updated logs to the table's status information. These change logs are what you need to support a quasi real-time clone. Based on this, you can clearly understand the duality of tables and events: tables support static data and log capture changes. The beauty of a log is that it is a complete record of the changes, not only capturing the contents of the final version of the table, but also recording the information of other versions that once existed. A log is essentially a series of backups of a table's historical state.

This may cause you to control the source code version. There is a close relationship between source code control and databases. Versioning solves a problem that is well known to all, and that is what distributed data systems need to be addressed-all the while in a changing distributed management. Version management systems are usually based on patch releases, which can actually be a log. You can make "snapshot" interactions directly with code that is currently similar to a table. You will notice that, like other distributed stateful systems, version control systems replicate logs when you update them, and you want to update the patches and apply them to your current snapshot.

Recently, some people have gotten some ideas from a company that datomic– a sales log database. These ideas give them an open understanding of how to apply these ideas to their systems. Of course these ideas are not just for the system, they will be part of the more than 10-year distributed system and database literature.

This may seem a little too idealistic. But don't be pessimistic! We'll make it happen soon.

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.