Learning notes: The Log (one of the best distributed technical articles I've ever read)

Source: Internet
Author: User
Tags data structures garbage collection joins postgresql unique id zookeeper python script
Preface

This is a study note.
The learning material comes from a log blog post from Jay Kreps.
The original text is very long, but I persisted to read, the harvest is very much, also deeply to Jay's technical ability, the architecture ability and to the distributed system understanding profound admiration. At the same time, because some understanding and Jay Brother's point of view coincide with slightly complacent.

Jay Kreps is a former LinkedIn principal Staff Engineer, the current co-founder of Confluent company and the main author of Ceo,kafka and Samza.

The so-called notes, is read the article, write a note, because of his own chapter of the organization is too good, and its own scientific literacy and philosophy is very high, so the private thought that the things that are not omitted. I. Sources of information

The log:what every software engineer should know about real-time data ' s unifying Abstraction II, notes 2.1 Log value

1 log is the core of the following system: Distributed graph Database distributed search engine Hadoop first generation and second-generation K-v database

2 log may be as long as the computer history, and is the core of distributed data systems and real-time computing system.
3 Log name many: Commit log Transaction log write-ahead log

4 Do not understand log, you can not fully understand the database NoSQL storage k-v storage replication Paxos algorithm Hadoop Version control or, any software system 2.2 what is log. 2.2.1 Overview

The record is appended to the tail of the log. Reads records from left to right. Each entry has a unique and orderly log entry sequence number.

The order of the records defines such a concept: time.
Because the more left record earlier.
The idea that the ordinal of a entry can be used as a timestamp and that the order of records as time seems strange, but soon you will find that it is easy to decouple "time" from any particular physical clock.
Log and Common Files, tables (table) are not that big difference. A file is a set of byte tables is a set of records log can be said to be a record of the chronological sort of files or tables

That said, you might think that log is so simple, and that there is a need for discussion.
In fact, the core meaning of log is:

Log records what happened (they record what happened and when.).

And this one is usually the most important thing in a distributed system.
Note that there is a need to clarify several concepts: the log that is discussed in this article and the Application log (application logs), which is commonly contacted by programmers, are usually unstructured, logging error messages, debugging information for tracking applications running, and giving people a view of the log, For example, write a log of local files via log4j or syslog. The log discussed in this article is accessed programmatically, not for people, such as "journal" or "Data logs". The Application log is a special feature of the log discussed in this article. logs in the 2.2.2 database

The origin of the log is unknown, just like the man who invented the binary search, it is difficult to realize that the invention is an invention.
Log appears as early as IBM's system R.
In a database, you need to keep a variety of data structures and indexes synchronized while the database crashes.
To ensure atomicity and durability, the database needs to record what it wants to modify before it commits a modification to the data structure and index.
So log records what happened, and every table and index itself is a mapping of this historical information.
Because log is immediately persistent, it becomes a reliable source for restoring other persistent structures when crash occurs.

Log has developed into a means of data replication between databases from the realization of the acid characteristics.

Obviously, a series of data changes that occur in the database become the most needed information to keep the database synchronized.
Oracle, MySQL, and PostgreSQL all include the log Transfer Protocol, which sends part of the log to the From database (Slave) that is used to keep the copy.
Oracle's Xstreams and goldenstate, using log as a common data subscription mechanism to provide database subscription data to non-Oracle.
MySQL and PostgreSQL also provide similar components that are at the core of the data system architecture.
Machine-oriented log, not only can be used in the database, can also be used in: Message system Data Flow (2.2.3) Real-time computing logs in distributed systems

Log solves problems in two very important distributed Data systems:
1. Orderly data changes
2) Data distribution

The so-called state machine copy principle (Machine Replication principle):

If two of the determined processes, starting from the same state, receive the same input in the same order, they will produce the same output and end in the same state.

The term "deterministic" means that the process is time independent and the processing result is not affected by the additional input.
It can be understood by an indeterminate example: the execution order of multiple threads is different, resulting in different results executing the gettimeofday () method and other processes that cannot be duplicated

The so-called state, can be arbitrary data on the machine, whether at the end of processing, in the machine's memory or on disk.
The same input that produces the same result in the same order is worth your attention, and this is why log is so important, an intuitive concept: if you enter the same log into two deterministic programs, they will produce the same output.
In the building of distributed systems, awareness of this can enable:
Let all the machines do the same thing, the statute is:
Build a distributed, consistent log system that provides input for all processing systems.

The function of the log system is to disperse all the uncertainties above the input stream to ensure that all replicated nodes that handle the same input are kept in sync.

The best thing about this approach is that you can treat the timestamp of the index log as the clock for all the replicated nodes:

By using the largest timestamp in the log processed by the replication node as the unique ID of the replicated node, the timestamp can be combined with log to uniquely represent the entire state of the node.

There are also many ways to apply this method: Record a request for a service in log before and after the reply request, record the change of service state or a series of conversion commands performed by the service, and so on.

Theoretically, we can record a series of machine instructions, or the name and parameters of the method called, so long as the data processing process behaves the same, these processes can guarantee the consistency across nodes.
People who play the database distinguish between logical and physical logs: Physical log: Records all changes in line content. Logical log: Is not a change in the content of the record, but inserts, UPDATE, delete, etc., resulting in changes in the contents of the SQL statement.

For distributed systems, there are usually two ways to handle replication and data processing:
1) State machine model (active–active)
2) Primary-back model (active–passive)

As shown in the following illustration:

To understand the differences between the two approaches, let's look at a simple example:
Now, the cluster needs to provide a simple service to do arithmetic operations such as addition, multiplication, and so on. Initially, maintain a number, such as 0. Active–active: Some operations such as "+1", "*2", etc. are logged, so that each replication node needs to perform these operations to ensure that the final data state is consistent. Active–passive: A separate master node that performs operations such as "+1", "*2", and records the results of the operation in the log, such as "1", "3", "6", and so on.

The above example also reveals why order is a key element of consistency between replication nodes, and if the order of these operations is disrupted, different results are obtained.
Distributed log, which can be used as a data structure for some consistency algorithms: Paxos Zab RAFT viewstamped Replication

A log that represents a series of decisions about what the next value is. 2.2.4 Changelog

From the point of view of database, a group of changelog and tables that record data changes are dual and interoperable.
1 According to log of the data change, can reconstruct a table of a state (also can be a non-relational storage system with key Records)
2 instead, if a table changes, you can count the changes into log.

This is where you want the real-time copy of the cheats.

This is very similar to what versioning does: managing distributed, concurrent, state-made modifications.

Version Control tool, which maintains a patch that reflects the modification, which is actually log, you interact with a branch snapshot of a checked-out (checked out), which is equivalent to a table in the database. You will find that in version control and distributed systems, replication is based on log: When you update the version, you just pull the patch that reflects the version change and apply it to the current branch snapshot. 2.3 Data Integration (integration) the meaning of 2.3.1 data integration

Data integration is the use of data from all services and systems in an organization.

In fact, the efficient use of data is in line with Maslow's hierarchy of demand theory.
The bottom of the pyramid is to collect data and integrate it into the application system (whether it's a real-time computing engine, a text file, or a Python script).
These data, however, need to be converted to maintain a uniform, normative, and tidy format for easy reading and processing.
When the above requirements are met, you can start to consider a variety of data processing methods, such as map–reduce or real-time query system.
Obviously, without a reliable, complete stream of data, Hadoop is simply an expensive, difficult to integrate heater. )。
Conversely, if the data flow is reliable, usable, and complete, you can consider more advanced gameplay, better data models, and consistent, more understandable semantics.
Then, attention can be shifted to visualization, reporting, algorithms, and predictions (digging, and depth). two complexity of 2.3.2 data integration

Events

event data that records how events occur, not just what happens, this type of log is often used as an application log because it is generally written by the application system. But this, in fact, confuses the function of log.
Google's wealth, in fact, is created by a correlation pipeline (user) clicks and impressions (experiences), while clicking on the stream and impressions is the event.

the outbreak of various professional data Systems

Why these systems exist: Online analysis (OLAP) search simple online storage batch map analysis, etc. (e.g. spark)

Obviously, to integrate data into such a system is extremely difficult for data integration. 2.3.3 Data flow based on log structure

Each of the data sources in a logical sense can be modeled according to the log.

The data source can be an application that records events (clicks and PV), and can be a database table that accepts changes.

Each subscriber obtains new records as quickly as possible from the log generated by these data sources, applies to the local storage system, and increases its read offset (offset) in log. Subscribers can be any data system, such as caching, Hadoop, a database of another site, or a search engine.

Log, in effect, provides a logical clock that, in the case of data changes, can measure the state of different subscribers, as these subscribers have different and independent reading offsets in the log, which is like a "moment" in time sense.

Consider such an example, a database, and some caching servers:
Log provides the ability to synchronize all cache servers and to launch the "moment" where they are.

Suppose we write a log numbered x, to read data from a cache server, and to not read the old data, just make sure that we don't read anything from this cache until the cache server copies the data (sync) to the X location.

In addition, log also provides the ability to act as a buffer to support the behavior of producers and consumers in an asynchronous manner.

One of the most critical reasons to support Asynchrony is that the subscription system can crash, get off the line for maintenance, and then get back online, in which case each subscriber consumes the data at its own pace.

A batch system, such as Hadoop, or a data warehouse, consumes data in hours or days, while a real-time system typically consumes data at the second level.
The data source or log does not know the subscriber of the consumer data, so it is necessary to add subscribers and remove subscribers seamlessly in the pipeline.

More importantly, subscribers simply need to know the log without having any knowledge of the source of the data they consume, whether it is an RDBMS, a Hadoop, a newly popular k-v database, and so on.

Log is discussed, not the message system, because different message systems guarantee different characteristics, and using the word message system, it is difficult to fully and accurately express a certain semantics, because the message system, more importantly, the redirection of messages.

However, log can be interpreted as a message system that provides persistence and strong ordered semantics, which are called atomic broadcasts in communication systems. 2.4 in LinkedIn

LinkedIn's current major systems include (note: 2013): Search Social Graph Voldemort (k-v storage) Espresso (document storage) Recommendation Engine OLAP query engine Hadoop Terradata ingraphs (Monitoring Atlas and Metrics Services)

Each system provides specialized, advanced functionality in its professional field.

(This section is too long, brother Jay is very able to kan ah, so pick the key to remember it.) )

1 The concept of the introduction of the data stream is to build an abstract cache layer on top of the table of Oracle database, and provide the ability to expand the search engine's index construction and social map.

2 in order to better deal with some of LinkedIn's recommended algorithms, began to build Hadoop cluster, but the team in this piece of experience is shallow, so go a lot of detours.

3 in the beginning, it was simply rude to think that just by pulling the data out of the Oracle Data Warehouse, you could throw it into Hadoop. The result: first, it is a nightmare to export data quickly from an Oracle data Warehouse; second, and worse, some of the data in the data warehouse is poorly handled, leading to the fact that the batch task of Hadoop does not output as expected, and that it is often irreversible to perform tasks through Hadoop batching, Especially after the report is out.

4 Finally, the team abandoned the data from the Data Warehouse way, directly to the database and logs as the data source. Then, a wheel was created: K-v storage (Voldemort).

5 Even if it's a data copy, it's not a big job, also occupy a lot of time to deal with the team, and, worse still, once the data processing pipeline a point error, Hadoop immediately become obsolete, because the algorithm ran on the wrong data, only one consequence, is to produce more error data.

6 Even if the team constructs a high level of abstraction, for each data source or need a specific configuration, and this is a lot of errors and failures of the root.

7 A large number of programmers want to follow up, each programmer has a large number of ideas, integration of the system, add this feature, integrate this feature, or want to customize the data source.

8) Brother Jay began to realize:
First, although the pipelines they build are still rough, they are extremely valuable. Even solving the problem of data availability in new systems such as Hadoop also unlocks a large number of possibilities. Previously difficult calculations began to become possible. New products and analytics, which only need to unlock data in other systems and are integrated, can be easily done.

Second, it is clear that reliable data loads require more solid support, and that if all structures are captured, the Hadoop data load can be fully automated, without the need to add new data sources or manually modify the data patterns. The data magically appears in HDFs, and when new data sources are added, hive tables are automatically and adaptively generated with the right columns.

Third, data coverage is far from adequate. Because it's hard to deal with a lot of new data sources.

9 in order to solve the data loading problem after the new data source joins, the team began to try this:

Soon, they found that this was not going to work because it was an O (n^2) issue because of the fact that the data flow was usually two-way, in the form of publishing and subscriptions, production and consumption.
So, what they need is a model like this:

Need to isolate each consumer from the data source, ideally, these consumers interact with only one data repository, and this repository can provide them with the ability to access arbitrary data.

10 message system + Log = Kafka,kafka was born. 2.5 log and ETL, Data Warehouse relationship 2.5.1 Data Warehouse

1 A clean, structured, integrated data repository for analysis.
2 Although the idea is good, the way to get the data is a bit outdated: periodically get the data from the database and convert it to a better readable format.
3 Before the Data Warehouse problem is: the clean data and data warehouse highly coupled .

Data Warehouse, should be a set of query functions, which serve the report, search, ad hot analysis, including counting (counting), aggregation (aggregation), filtering (filtering) operations, so more should be a batch processing system.

But the high coupling of clean data with such a batch system means that the data cannot be consumed by real-time systems, such as search engine indexing, real-time computing and real-time monitoring systems, and so on. 2.5.2 ETL

Brother Jay believes that the ETL does nothing but two things:

1 data extraction and cleaning, the data from the specific system to unlock
2 reconstruct the data so that it can be queried through the data warehouse. For example, change the data type to a type that is appropriate to a relational database, convert the pattern to star or snowflake mode, or break it into a column-oriented storage format.

However, the two things coupled together, the problem is very large, because the integrated, clean data, should be able to be other real-time systems, index building system, low latency processing system consumption.

The Data Warehouse team is responsible for collecting and cleaning the data, but the producers of these data often have data that is difficult to extract and clean because of the unclear data-processing requirements of the warehouse team.
At the same time, because the core business team is not sensitive to keeping up with the rest of the company, the real data coverage is very low, the data flow is fragile, and it is difficult to quickly respond to changes.

So, the better way is:

If you want to do a bit of searching on a clean dataset, real-time monitoring of trends, and real-time alerting, it is not appropriate to use an existing data warehouse or Hadoop cluster as an infrastructure. What's worse, the ETL-built data-loading system for the Data Warehouse has no use for other (real-time) systems.

The best model is to have the data cleaned before the data Publisher publishes the data, because only the publisher knows what their data is. And all the operations that are done at this stage should be both lossless and reversible .

All real-time transformations that enrich semantics, or add values, should be processed after the original log post (post-processing), including establishing a session for the event data, or adding some fields of interest. The original log can still be used alone, but this kind of real-time application also derives a new parameterized log.

Finally, only data aggregation operations corresponding to specific target systems should be used as part of data loading, such as converting to star or snowflake patterns for analysis and reporting in the Data Warehouse. Because this phase, as the traditional ETL did, because of the very clean and canonical data flow, (with log) now becomes very simple. 2.6 log files and events

The log-core architecture also has the added benefit of being easy to implement without coupling, event-driven systems.

The traditional way to capture user activity and system changes is to write such information to a text log and then extract it into a data warehouse or Hadoop cluster for aggregation and processing, similar to the Data Warehouse and ETL issues described earlier: the high coupling of data with the Data warehouse.

In LinkedIn, it builds an event-handling system based on Kafka. There are hundreds of event types defined for various action, from PV, user hurrying feet (ad impressions), search, to service invocation and application exceptions, and so on.

To realize the benefits of the above event-driven system, look at a simple example of an event:
On the Job Opportunity page, provide an opportunity. This page should only be responsible for showing the opportunity, not too much to include other logic. However, you will find that doing this in a fairly large web site can easily involve more and more logic unrelated to the opportunity to show.

For example, we want to integrate the following system features:
1 we need to send data to Hadoop and data Warehouse for offline processing.
2 We need to count the number of page views to ensure that some browsing is not to crawl the content of the page or anything.
3 We need to aggregate the browsing information on this page and present it on the analysis page of the opportunity Publisher.
4 We need to record a user's browsing record for this page, to ensure that we provide this user with valuable, experienced, and useful work opportunities for this user, rather than repeating an opportunity over and over again (think of a game where the wife is not at home, the red, green and blue flashing effects, With the hot DJ Wind dance, or the swing focus of the career peak and Qi x small skirt Girls, and then click in to find the title of the party's ad bar. )。
5 Our referral system needs to record the browsing history of this page to correctly track the popularity of this job opportunity.

Soon, the page logic that only shows the opportunity becomes complicated. When we add this opportunity to the mobile end, we have to move the logic over, which adds to the complexity. Not yet, the tangled thing is that the engineers who handle this page need to have knowledge of other systems to ensure that those functions are properly integrated together.

This is a very simple example, in practice, the situation will only be more complex.
Event-driven can make this easier.

The page that is responsible for presenting the opportunity only needs to present the opportunity and document some of the factors associated with the presentation, such as the related attributes of the job opportunity, who browsed the page, and other useful information related to rendering. Pages do not need to maintain knowledge and understanding of other systems, such as recommender systems, security systems, opportunity publishers, and data warehouses, all of which need to be subscribers, subscribe to this event, and independently handle each of them. The page that presents the opportunity does not need to be modified by the addition of a new subscriber or consumer. 2.7 Building Extensible Log

Separating publishers and Subscribers is not new, but ensuring that multiple subscribers are able to process messages in real time, while ensuring scalability, is a difficult thing for the log system.

If the log build does not have the ability to be fast, low-cost, and scalable, then all the good things built on this log system are free of conversation.

Many people may think that the log system is a slow, heavy-cost job in a distributed system and is only used to handle information such as zookeeper that are more appropriate for processing.

But LinkedIn now (note: 2013) Handles 60 billion different message writes per day in Kafka (hundreds of millions of writes if the data center is mirrored).

How did they do that, brother Jay?

1 The log is split (partitioning the log)
2 Optimize throughput through batch reading and writing
3 Avoid unnecessary copy of data

Provides extensibility by cutting log to multiple partition:

1 each partition is an ordered log, but there is no global order between the partitions.

2 writes the message to which partition is completely controlled by the writer, and is segmented by a type of key such as user_id.

3 the partition makes the log additional operation, can not be in the partition (sharding) between the coordination on, at the same time, ensure that the system throughput and Kafka cluster scale is linear relationship.

4 Although there is no global order (in fact, thousands of consumers or subscribers, there is generally no value in discussing their global order), Kafka provides the assurance that the sender will send the message to a partition in what order, The message that is sent out from this partition is in what order (in what order and in what order).

5 each partition is copied according to the configured number, if a leader node is hung, the other nodes will become the new leader.

6 A log, like the file system, the linear reading and writing mode can be optimized, the small read-write log can be composed of a larger, high throughput operation. Kafka did a great job in this matter. Batches are used under various scenarios, such as the client sending data to the server, writing data to disk, replicating data between servers, transferring data to consumers, and confirming submission data.

7 Finally, Kafka in memory log, disk log, the log sent in the network, using a very simple binary format to facilitate the use of a variety of optimization techniques, such as 0 copy data transfer technology (zero-copy data transfer).

Many of the optimization techniques that come together can enable you to read and write data in the most capacity that the disk or network can provide, even when memory is full. 2.8 Logs and real-time processing

You think that Brother Jay has provided such a beautiful way to copy data to copy it.
You. Wrong. Up.

log is another way of saying that logs is the core of stream processing. 2.8.1 What is flow processing

Brother Jay thinks:
1 Stream processing is the infrastructure of continuous data processing.
2 The computational model of flow processing, just like mapreduce or other distributed processing framework, only needs to guarantee low latency.
3 The batch-processing pattern of data collection leads to the batch processing mode.
4 Continuous data collection mode, resulting in a continuous processing mode.
5) Brother Jay told us a census of the way to explain the batch process.

In LinkedIn, both active data and database changes are continuous.
Batch processing data is processed on a daily basis, and consecutive calculations make the window the same as one day.

So, stream processing is a process that:
6 when processing the data, with a concept of time, do not need to maintain a static snapshot of the data, so you can customize the frequency of users, output results, without having to wait for the dataset to reach a certain "end" state.
7 in this sense, flow processing is a generalization of batch processing , and considering the prevalence of real-time data, this is an extremely important generalization.
8 Many commercial companies are unable to build streaming engines, often because they cannot establish streaming data collection engines.
9 Stream processing spans the gap between real-time response services and offline batching infrastructures.
Log system, which solves the key problems in many stream processing modes, one of the biggest problems is how to provide the available data in the real-time multiple subscriber mode ( stream data collection ). 2.9 Data stream Map

The most interesting thing about streaming is that it expands the concept of what a data source (feeds) is.
Whether the raw data is logs, feeds, event, row-by-row data records, comes from the application's activity.
However, streaming can also allow us to process data from other feeds, which, in the opinion of consumers, is not the same, and these derived feeds can contain any degree of complexity.

A stream processing task should be: read data from logs, write output to logs or other systems.

As the input and output of the logs, connecting the processing itself, and other processing, constitute a diagram.

In fact, the log-core system allows you to view data capture, transformation, and data flow in a company or organization as a combination of a series of logs and written processing.

A stream handler that doesn't have to be big: it can be a process or a set of processes, but it can provide some additional infrastructure and support to facilitate the management of the code used for processing.

The introduction of logs has two purposes:

1 ensures that the data set can support multiple subscriber modes, and orderly.
2) can be used as a buffer. This is important, in the process of asynchronous data processing, if the upstream producer data faster, the speed of consumers can not keep up, in this case, either the processing process blocking, either introduce a buffer, or discard the data.
Discarding the data does not seem to be a good choice, and blocking processes can cause the processing of all data streams to be stuck in the process map. The log, which is a large, oversize, very large buffer, allows the process to be restarted so that a process fails without affecting the other processes in the flow map. This is critical for a large organization to extend the flow of data, because different teams have different processing tasks, and it is clear that the entire flow process is jammed because of a task that is not being changed.

Storm and Samza are such streaming engines and can use Kafka or other similar systems as their log systems.

(Note: Brother Jay is very fierce, before there are Kafka, after Samza.) ) 2.10 Real-time processing with state

Many flow-processing engines are stateless, one-time records, but many use cases require complex counts, aggregations, and joins operations in a window of time that is handled by a stream.
For example, click on the stream, join the user information.

This use case, then, requires the support of the state. Where data is processed, the state of the data needs to be maintained.

The problem is how to stay in the right state when the processor is likely to hang out.

Maintaining the state in memory may be the simplest, but cannot withstand crash.

If the state is maintained in only one window, when it hangs or fails, the processing can be replayed directly back to the starting point of the window, but if the window is 1 hours long, this may not work.

An easy way to do this is to put the state in a Remote Storage system or database, but it will lose the locality of the data and generate a lot of data round-trip between the networks (network round-trip).

Recall that the duality of the table and log in the database mentioned earlier.
A stream processing component that can use local storage or indexing to maintain state: Bdb leveldb Lucene fastbit

Used to recover the state after crash by recording the changelog about the local index. This mechanism, in fact, reveals a generalized state that can be stored as an arbitrary index type, and the input stream is split simultaneously (co-partitioned).

When the processing process crashes, it can recover the index from the changelog, which acts as a role in converting the local state into an incremental record based on a time backup.

This mechanism also provides a very elegant ability: the state of the process itself can also be logged as log, and obviously other processes can subscribe to this state.

Combined with the log technology in the database, the scenario of data integration can often be very powerful:

The log is extracted from the database and indexed in a variety of streaming systems, so it is possible to join a different flow of events. 2.11 Log Merge

Obviously, it is not possible to log the full time state change information with log.

Kafka uses the log merge or log garbage collection technology:

1 for event data, Kafka retains only one time window (which can be configured for a few days on time or in space)
2) for keyed Update,kafka using compression technology. This type of log can be used to reconstruct the state of the source system through playback techniques in another system.

If the full-time logs is maintained, the data will grow larger and longer as the time progresses.
Instead of simply discarding old log information, Kafka is merging to discard discarded records, for example, when a message's primary key has recently been updated.

2.12 System Construction 2.12.1 Distributed System

Log, the role of data flow system and data integration in distributed database is consistent: Abstract data flow maintains data consistency provides data recovery capability

You can view application systems and data streams throughout your organization as a separate, distributed database.
Consider a query-oriented stand-alone system, such as Redis, SOLR, Hive tables, and so on, as a special, data-indexed index.
The storm, Samza and other stream processing systems are considered as a kind of carefully designed trigger or materialized view mechanism.

A wide variety of data systems, the emergence of outbreaks, in fact, this complexity has long existed.
In the glorious period of relational databases (heyday), there are many types of optical relational databases for a company or organization.

Obviously, it's impossible to throw everything into a hadoop cluster and expect it to solve all the problems. So, how to build a good system might look like this:

Build a distributed system, each component is a small cluster, each cluster does not necessarily provide a complete security, performance isolation, or good scalability, but each problem can be (professionally) resolved.

Brother Jay thinks that all kinds of systems explode because it's difficult to build a powerful distributed system. And if you limit the use case to some simple, such as a query scenario, each system has enough capacity to solve the problem, but it is difficult to integrate these systems.

Brother Jay thinks there are three possible ways to build a system in the future:

1) Maintain the status quo. In this case, the data integration is still the biggest problem, so an external log system is very important (Kafka. )
2 The emergence of a powerful (like a relational database of glorious times) can solve all problems, which seems a bit unlikely to happen.
3. Most of the new generation systems are open source, this reveals a third possibility: the data infrastructure can be dispersed into a set of services, as well as application-oriented system APIs, various services division, each is incomplete, but can be professional to solve specific problems, in fact, through the existing Java technology stack can be discerned: Zookeeper: solves the synchronization and collaboration problems of distributed systems (and may also benefit from higher levels of abstraction such as Helix, curator). Mesos, YARN: Addressing virtualization and resource management issues. Embedded Components Lucene, LEVELDB: Solve indexing problem. Netty, jetty and higher levels of abstraction Finagle, rest.li solve the problem of remote communication. Avro, Protocol buffers, thrift, and umpteen Zlin: Resolving serialization issues. Kafka, Bookeeper: Provide backing log capabilities.

In some ways, building such a distributed system is like a version of Lego bricks. This obviously has little to do with end users who are more concerned with APIs, but it reveals a way to build a strong system and keep it simple:
Obviously, if the time to build a distributed system falls from a few years to a few weeks, the complexity of building an independent, huge system disappears, and this must be because of the emergence of more reliable and flexible "bricks". the position of 2.12.2 log in system construction

If a system, with the support of the external log system, then each individual system can be shared log to reduce its own complexity, brother Jay think the role of log is:

1 processing data consistency problem. Both immediate and final consistency can be achieved by serializing concurrent operations on nodes.

2 to provide data replication between nodes.

3 provides the semantics of "submit". For example, if you think your write operation will not be lost in the case of operation confirmation.

4 provide the external system can subscribe to the data source (feeds).

5 When a node loses data due to a failure, it provides the ability to recover, or reconstruct a new replication node.

6 to handle the load balance between the nodes.

Above, it is probably a complete distributed system should provide most of the functionality of the (Brother Jay does love log.) , the rest is the client API and things like building indexes, such as Full-text indexing, which requires getting all the partitions, while queries against primary keys only need to get data in one partition.

(That's the rest of the matter, too, Jay Brother Martial.) )

The system can be divided into two logical components (this powerful understanding and skill):

1) Log Layer
2) Service Layer

Log layer, which captures state changes in a serialized, orderly manner, while the service layer stores the indexes needed for an external query, such as a k-v store that may require B-tree, sstable indexes, and a search service that needs to be inverted.

Write operations can be directly into the log layer, but also through the service layer to do proxy. Writing to log creates a logical timestamp (log index), such as a digital ID, and if the system partition, then the service layer and log layer will have the same partitions (but their respective number of machines may be different).

The service layer subscribes to the log layer and logs in the fastest, log-stored order, synchronizing data and state changes into its own local index.

The client will get the semantics of the Read-your-write:

By being on any one by one nodes, when the query is carried with its write time stamp, the service Layer node receives the query, compares the timestamp with its local index, and, if necessary, delays the execution of the request in order to prevent the return of out-of-date data, until the index synchronization of the service node has been followed by a timestamp.

The nodes of the service layer may or may not need to know the concept of leader. In many simple use cases, the service layer does not build leader nodes, because log is the source of the facts.

There is also the problem of how to handle recovery problems after a node failure. You can do this by keeping a fixed sized time window in the log while maintaining snapshots of the data. You can also have log keep full backups of the data and use log merge technology to complete the log itself garbage collection. This approach moves the many complexities of the service layer to the log layer, because the service layer is system-related (system-specific), and the log layer can be universal.

Based on the log system, it can provide a set of complete, for development use, can be used as other systems of ETL data sources, and other systems to subscribe to the API.

Full Stack. :

Clearly, a log-core distributed system itself immediately becomes a role that can provide data load support and data flow processing for other systems. Similarly, a streaming system can consume multiple data streams at the same time, and provide services externally by indexing and outputting the data streams to another system.

based on the log layer and the service layer to build the system, make query related factors and system availability, consistency and other factors decoupled.

Perhaps many people think that maintaining a separate backup of data in the log, especially to make a full copy of the data, is too wasteful and extravagant, but that's not the case:

1) LinkedIn (note: 2013) The Kafka production cluster maintains 75TB of data per data center, while the application cluster requires more storage space and storage conditions (ssd+ more memory) than the Kafka cluster.
2 Full-Text search index, preferably all loaded into memory, and logs because all are linear read and write, so you can take advantage of Low-cost high-capacity disk.
3 because the Kafka cluster actually operates under multiple subscriber models, multiple systems consume data, so the cost of the log cluster is amortized.
4 All of these causes the overhead of an external log system (Kafka or similar system) to become very small. 2.13 Conclusion

In the end, brother Jay, not only has left a lot of academic, engineering valuable papers and reference links, but also very humbly left this sentence:

If You are made it this far you know most of what I know about logs.

End.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.