LinkedIn large data Platform depth analysis (i)

Source: Internet
Author: User
Keywords Or we this
Tags analysis application company cracked data data structures distributed file

I joined the LinkedIn company at an exciting time six years ago. From that point on, we cracked down on a single, centralized database and started switching to a special distributed System suite. This is an exciting thing: the Distributed graphics database, the distributed search backend, the Hadoop installation, and the first and second generation key value data stores that we build, deploy, and still run until today.

The most rewarding thing we've learned from all of this is that the core of many of the things we build contains a simple idea: log. Sometimes called a pre-write log or a commit or transaction log, the log exists almost at the time of the computer's creation, and is at the heart of many distributed data systems and real-time application structures.

Without knowledge of journals, you cannot fully understand databases, NoSQL storage, key-value storage, replication, Paxos,hadoop, versioning, and almost all software systems; However, most software engineers are not familiar with them. I am willing to change this situation. In this blog post, I will take you through all the things you must know about the log, including what the log is, how to use the log in data integration, real-time processing, and system building.

Part I: What is the journal?

A log is a simple storage abstraction that can no longer be simple. It is a series of records that can only be added and sorted exactly by time. The log looks as follows:

We can add records to the end of the log, and we can read the log records from left to right. Each record specifies a unique, sequential number of log records.

The sort of logging is determined by time, because the log record on the left is earlier than the one on the right. The logging number can be thought of as the "timestamp" of this log record. It's a bit superfluous to say this sort of sorting at the outset, but the time attribute is a very easy to use property compared to any specific physical clock. This attribute is important when we run multiple distributed systems.

For the purpose of this discussion, the content and format of the log records are not important. In addition, it is not possible to add records to the log when the storage space is completely depleted. We will mention this question later.

Logs are not entirely different from files or datasheets. A file is made up of a series of bytes, and a table consists of a series of records, and a log is actually a datasheet or file that stores records in chronological order.

At this point, you may wonder why you should discuss such a simple matter. How do you associate a data system with an increased number of log records that can only be added in different environments? The answer is that the log has a specific application goal: It records what happens at the time. For many aspects of distributed data systems, this is the real core of the problem.

But before we go into more in-depth discussions, let me clarify some confusing concepts. Each programmer is familiar with another type of logging-applications that use syslog or log4j may write to local files without structural error messages or trace information. To differentiate, we call this case logging "application logging." Application logging is a low-level variant of the log I'm talking about here. The biggest difference is that text logs are meant primarily for people to read, and the "log" or "Data log" I've described is built to facilitate program access.

(In fact, if you think deeply about it, the idea that people read logs on a certain machine doesn't conform to the trend of the times.) This approach quickly becomes a difficult way to manage when it comes to many services and servers, and in order to understand the behavior of multiple machines, the goal of logging quickly becomes the input of queries and graphs of these behaviors-for certain behaviors of multiple machines, The English form of the text in the file is almost unsuitable compared to the structured log described here. )

Database log

I don't know where the log concept originated-maybe it was like a binary search: the inventor thought it was too simple to be an invention. It appeared as early as IBM's system R appeared. The usage in the database is used to synchronize various data structures and indexes when it crashes. To ensure the atomicity and durability of the operation, the database will transcribe the information that is about to be modified into the log before making any changes to the data structures maintained by the database. The log records what happened, and each of these tables or indexes is a historical mapping of data structures or indexes. Since the log is immediate and permanent, it can be used as a trusted source of data to restore all other permanent structures when a crash occurs.

Over time, the purpose of logging is a way to replicate data between databases and grow from the realization of acid details. The result of using a log is that the order of changes that occurs on the database and the order of changes on the remote replicated database need to be kept fully synchronized.

Both Oracle,mysql and PostgreSQL include log transport protocols for transferring logs to an alternate replication database. Oracle also makes log products a common data subscription mechanism so that non-Oracle data subscribers can subscribe to data using Xstreams and Goldengate, and similar implementations on MySQL and PostgreSQL are key components of many data structures.

Because of this origin, the concept of machine-identifiable logs is largely confined to the internal database. The mechanism by which logs are used as data subscriptions appears to be accidental, but it is impractical to use this abstraction to support all types of message transmission, data flow, and real-time data processing.

(Responsible editor: Mengyishan)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.