Chief Engineer Secret: How does the LinkedIn big data backend work? A

Source: Internet
Author: User
Tags version control system

The content of the article is very dry, it is worth learning. The article will be described in four parts, it is recommended that you read patiently.

The first part: what is log?

Part II: Data integration

Part III: Log and live stream processing

Part IV: System construction

I joined LinkedIn at an exciting time six years ago. From that moment on, we cracked down on a single, centralized database, and started transitioning to a special distributed System suite. It's an exciting thing to do: we build, deploy, and even today are still running distributed graphics databases, distributed search backend, Hadoop installations, and first and second generation key-value data stores.

One of the most rewarding things we've learned from all this is that the core of many of the things we build contains a simple idea: logs. Sometimes called pre-write logs or commit logs or transaction logs, the logs exist almost at the time of the computer's creation and are at the heart of many distributed data systems and real-time application architectures.

Without a log, you can't fully understand databases, NoSQL storage, key-value storage, replication, Paxos,hadoop, versioning, and almost all software systems, but most software engineers are not familiar with them. I am willing to change this situation. In this blog post, I'll take you through all the things you have to know about logs, including what the logs are, how to use logs in data integration, real-time processing, and system building.

the first part: What is a log?

The log is a simple storage abstraction that can no longer be simple. It is a series of records that can only be added, sorted exactly by time. The log looks like this:

we can add records to the end of the log, and we can read the log records from left to right. Each record specifies a unique log record number that has a certain order.

The sort of logging is determined by "time", because the log record on the left is earlier than the one on the right. The logging number can be thought of as the "timestamp" of this log record. It's a bit redundant to sort this sort of chronological order from the start, but the time attribute is a very handy property compared to any specific physical clock. This is a very important attribute when we run multiple distributed systems.

For the purpose of this discussion, the content and format of the log record is not important. It's also a reminder that we're not going to be able to add records to the log without completely draining the storage space. We will refer to this question later.

Logs are not completely different from files or data tables. A file is made up of a series of bytes, and the table is composed of a series of records, and the log is actually just a data table or file that stores the records in chronological order.

At this point, you may wonder why you should discuss such a simple thing? How does an incremental, sequential log record in a different environment relate to the data system? The answer is that the log has a specific application target: It records what happened at what time. For many aspects of distributed data systems, this is the real core of the problem.

But let me clarify some confusing concepts before we go into more in-depth discussions. Each programmer is familiar with another kind of logging-an application that uses syslog or log4j may write to a local file without structure error messages or tracking information. To differentiate, we refer to the logging of this situation as "application logging". Application logging is a low-level variant of the log I'm talking about here. The biggest difference is that the text log means that it is primarily intended for people to read, and the "log" or "Data log" I have described is built to facilitate program access.

(In fact, if you think deeply about it, the idea that people are reading logs on a machine is a bit out of tune with the trend of the times.) When it comes to many services and servers, this approach quickly becomes a difficult way to manage, and in order to recognize the behavior of multiple machines, the goal of the log quickly becomes the input to the query and graphical behavior-in the case of certain behaviors of multiple machines, the English form of the document The text is almost unsuitable compared to the structured log described here. )

Database Log

I don't know where the log concept originated-maybe it was like a binary search: the inventor thought it was too simple to be an invention. It appeared as early as the advent of IBM's system R. The usage in the database is to use it to synchronize various data structures and indexes when it crashes. To ensure the atomicity and durability of operations, the database will transcribe the information that is about to be modified into the log before making changes to all the various data structures maintained by the database. The log records what happened, and each of these tables or indexes is a historical mapping of some data structures or indexes. Since the log is instantly permanent, it can be used as a trustworthy source of data to recover all other permanent structures when a crash occurs.

Over time, the use of logs has grown from the realization of acid detail to a method of replicating data between databases. The result of using the log is that the order of change that occurs on the database and the order of change on the remote replication database need to be fully synchronized.

Both Oracle,mysql and PostgreSQL include the Log transport protocol used to transfer logs to the standby replication database. Oracle also turns the log product into a common data subscription mechanism so that non-Oracle data subscribers can subscribe to data using Xstreams and Goldengate, and similar implementations on MySQL and PostgreSQL are key components of many data structures.

It is because of this origin that the concept of machine-readable logs is largely confined to the inside of the database. The mechanism by which logs are used as a data subscription appears to be accidental, but it is impractical to use this abstraction to support all types of message transmission, data flow, and real-time data processing.

Distributed System Logs

The log solves two problems: change the ordering of actions and the distribution of data, which is particularly important in distributed data systems. It is one of the core problems in the design of distributed systems to negotiate a consistent sequence of change actions (or to maintain the practice of each subsystem itself, but can make a copy of the data with side effects).

A log-centric implementation of distributed systems is inspired by a simple experience common sense, which I call the state machine replication principle: If two identical, deterministic processes start from the same state and get the same input in the same order, both processes will produce the same output, and end in the same state.

This may be a bit difficult to understand, let's go into more depth, to understand its true meaning.

Certainty means that the process is time-independent, and that any other "external" input does not affect the processing result. For example, if the output of a program is affected by the specific order in which the thread executes, or is affected by a gettimeofday call, or some other non-recurring event, then such a program is generally most likely to be considered nondeterministic.

The process state is any data that the process holds on the machine, and at the end of process processing, the data is either stored in memory or stored on disk.

Where the same input is obtained in the same order should be noticed-this is where the log is introduced. Here's an important common sense: if you give the same log input to the two-segment deterministic code, they generate the same output.

The application of distributed computing is particularly obvious. You can reduce the problem of doing the same thing with more than one machine to implement the distributed consistency log for these processes. The purpose of this log is to exclude all non-deterministic objects from the input stream to ensure that each replication process is able to process the input synchronously.

When you understand this, the state machine duplication principle is no longer complex or esoteric: this means, more or less, that "deterministic processes are deterministic". Anyway, I think it's one of the more common tools in distributed system design.

One of the beauty of this approach is that the timestamp of the index log is like a copy of the clock state-you can describe each copy with a single number, which is the timestamp of the processed log. The timestamp and log one by one correspond to the state of the entire replica.

There are many different ways to apply this principle to the system because of the different content written into the log. For example, we record a request for a service, or a change in the state of a service from a request to a response, or it performs a conversion of a command. Theoretically, we can even record a series of machine instructions or method names and parameters to be executed for each copy. As long as two processes process these inputs in the same way, these processes maintain the consistency of the replicas.

1000 human eyes have 1000 kinds of log usage. Database workers typically differentiate between physical and logical logs. A physical log is a record of what is changed in each row. The logical log records not the changed rows but the SQL statements (Insert,update and DELETE statements) that cause the contents of the row to be changed.

Distributed systems can generally be broadly divided into two ways of handling data and completing responses. The state machine model typically refers to an active-active model-that is, the object for which we record requests and responses. A minor change to this is called the "reserve Model", which is to select a copy as leader and allow it to process at the time the request arrives and output a log of its status changes from the process. The other replicas apply those changes in the order in which the leader state changes, so that they are synchronized and can take over leader's work when the leader fails.

To understand the differences between the two ways, let's look at a less rigorous example. Assume that there is a copy of the algorithm service, maintain a separate number as its state (initial value is 0), and add and multiply this value. Active-Active mode should output the transformation, such as "+1", "* *" and so on. Each copy applies these transformations to get the same set of solutions. Active-Passive mode will have a separate subject to perform these transformations and output the result log, such as "1", "3", "6" and so on. This example also shows clearly why order is the key to ensuring consistency between replicas: the change in the order of one addition and multiplication will result in different results.

Distributed logging can be understood as the data structure of the consistency problem model. Because the log represents a series of decisions for subsequent append values. You need to revisit the Paxos algorithm cluster, although the log module is their most common application. In the Paxos algorithm, it is usually done by using a protocol called multi-Paxos, which models the log as a series of problems, with a corresponding section in the log for each problem. In other protocols such as Zab, raft, the log is particularly prominent, and it directly models the issue of maintaining distributed, consistent logs.

I suspect that our view of historical development is biased, probably because the theory of distributed computing has far exceeded its practical application in the past few decades. In reality, the question of consensus is a bit too simple. Computer systems rarely need to determine a single value, they almost always process a sequence of requests. Such a record, rather than a simple single-value register, is naturally more abstract.

In addition, focusing on algorithms masks the underlying logs required by abstract systems. I suspect that we will end up putting more emphasis on the log as a cornerstone of commercialization, regardless of whether it is implemented in the same way, we often talk about a hash table instead of the hash tables we get that are not specific details, such as linear or what other variant hash tables are. The log will become a popular interface, providing the best guarantee and best performance for most algorithms and their implementation improvements.

Change Log 101: Two-phase of the table and event.

Let's keep talking about the database. There is a large number of two phases between the change log and the table in the database. These logs are a bit like the loan list and the bank's process, and the database table is the current surplus table. If you have a large number of change logs, you can use these changes to create a table that captures the current state. This table will record the status information for each key point (a particular point in time in the log). That's why the log is a very basic data structure: Logs can be used to create basic tables, or they can be used to create various types of derived tables. It also means that non-relational objects can be stored.

This process is also reversible: if you are updating a table, you can record the changes and publish all the updated logs to the status information of the table. These change logs are what you need to support a quasi-real-time clone. Based on this, you can clearly understand the two phases of the table and the event: tables support static data while log capture changes. The charm of the log is that it is the complete record of the change, which not only captures the content of the final version of the table, but also records other versions of the information that existed. A log is essentially a series of backups of the table's historical state.

This may cause your version management of the source code. There is a close relationship between source code control and the database. Versioning solves a problem that you are very familiar with, and that is what distributed data systems need to address--and constantly changing distributed management. The version management system is usually based on the release of patches, which may actually be a log. You can make a "snapshot" interaction directly on code that is currently similar to a table. You'll notice that, like other distributed stateful systems, the version control system replicates the log when you update, and you want to just update the patches and apply them to your current snapshot.

Recently, some people got some ideas from datomic–, a company that sells journal databases. These ideas give them a broad understanding of how these ideas can be applied in their systems. Of course these ideas are not just for this system, they will be part of the more than 10-year distributed system and database documentation.

This may seem a bit too idealistic. But don't be pessimistic! We'll make it happen soon.

What's next

In the remainder of this article, I'll try to explain what the logs can do in addition to being available within a distributed computing or abstract distributed computing model. These include:

Data integration-makes it easy to access all the data in the organization's entire storage and processing system.

Real-time data processing-calculates the generated data stream.

Distributed System design-how a system that is actually applied simplifies the design by using centralized logging.

All of these usages are done by using the log as a separate service.

In either of these usages, the log is used to start with a simple function that the log can provide: to generate a permanent, reproducible history. Surprisingly, the core of the problem is the ability to allow many machines to reproduce history at their own pace in a specific way.



Http://www.pmtoo.com/data/2014/0305/5037.html


Chief Engineer Secret: How does the LinkedIn big data backend work? A

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.