Learning notes: The Log (one of the best distributed technical articles I've ever read)

Source: Internet
Author: User


Preface


This is a study note.
The learning material comes from a log blog post by Jay Kreps.
The original text is very long, but I insist on reading, harvest a lot, but also deeply for Jay Brother's technical ability, architectural ability and understanding of the distributed system deeply impressed. At the same time, some of the understanding and Jay's views coincide with a little complacency.



Jay Kreps is the former LinkedIn principal staff Engineer, the co-founder of Confulent Corporation and the main author of Ceo,kafka and Samza.



The so-called notes, is read the article, pen to remember, because Jay brother himself this chapter is organized too good, and its own scientific literacy and philosophical literacy is also very high, so the private thought of the thing is not omitted.


I. Sources of information


The log:what every software engineer should know about real-time data ' s unifying abstraction


Second, notesThe value of 2.1 log


1) log is the core of the following system:


    • Distributed Graph Database
    • Distributed search engine
    • Hadoop
    • First generation and second generation K-V databases


2) log may be as long as the history of a computer and is the core of a distributed data system and real-time computing systems.
3) Log has many names:


    • Commit Log
    • Transaction Log
    • Write-ahead Log


4) If you don't understand log, you can't fully understand


    • Database
    • NoSQL Storage
    • K-V Storage
    • Copy
    • Paxos algorithm
    • Hadoop
    • Version Control
    • Or, any software system
2.2 What is log? 2.2.1 Overview




    • The record is appended to the tail of the log.
    • Records are read from left to right.
    • Each entry has a unique and ordered log entry sequence number.


The order of the records defines such a concept: time.
Because the farther left records the sooner.
The sequence number of the entry can be used as a timestamp, and it may seem strange to think of the order of the records as time, but soon you will find that it is convenient to decouple the "time" from any particular physical clock.
Log and Common Files, tables (table) are not so much different.


    • A file is a set of bytes
    • A table is a set of records
    • Log can be said to be some kind of file or table that sorts records by time


In that case, you might think that log is so simple, is it necessary to discuss it?
In fact, the core meaning of log is:


Log records what happened (they record, what happened, and when.).


This is usually the most central thing in a distributed system.
Note that there are several concepts that need clarification here:


    • The log discussed in this article differs from the application logs that programmers typically contact (application logs)
    • The application log is usually unstructured, logging error messages, debugging information, and tracking the running of the application, the logs that are presented to the person, such as a log that writes local files via log4j or syslog.
    • The log discussed in this article is accessed programmatically, not to people, such as "journal", "Data Logs".
    • The Application log is a special feature of the log discussed in this article.
Logs in the 2.2.2 database


The origin of log is unknown, just as the person who invented the dichotomy found it hard to realize that the invention was an invention.
Log appears as early as IBM's system R.
In a database, you need to keep a wide variety of data structures and indexes in sync when the database crashes.
To ensure atomicity and persistence, the database needs to record what it wants to modify before committing changes to the data structure and index.
So log records when something happens, and each table and index itself is a mapping of this historical information.
Because log is immediately persisted, when crash occurs, it becomes a reliable source of recovery for other persistent structures.


Log from a guarantee of acid characteristics of a realization, developed into a database between data replication means.


It is clear that a series of data changes that occur in the database are the most needed information to keep synchronization between databases.
Oracle, MySQL, PostgreSQL, all contain the log transport protocol, which sends a portion of log to the slave database (Slave) used to persist replication.
Oracle's Xstreams and goldenstate use log as a common data-subscription mechanism to provide database subscription data to non-Oracle.
MySQL and PostgreSQL also provide similar components, which are at the heart of the data system architecture.
Machine-oriented logs can be used not only in databases, but also in:


    • Messaging system
    • Traffic (data flow)
    • Real-time Computing
Logs in 2.2.3 Distributed system


Log solves the problem in two very important distributed Data systems:
1) Orderly data changes
2) Data distribution



The so-called state machine Replication principle (Replication Principle):


If two deterministic processes start from the same state and receive the same input in the same order, they will produce the same output and end in the same state.


The so-called deterministic (deterministic) refers to the process is time-independent, and its processing results are not affected by the additional input.
Can be understood by a non-deterministic example:


    • Multi-threaded execution sequence leads to different results
    • Execute the Gettimeofday () method
    • Other processes that cannot be duplicated


The so-called state can be any data on the machine, whether at the end of processing, in the machine's memory or on the disk.
The same input, which produces the same result in the same order, deserves your attention, which is why log is so important, which is an intuitive concept: if you enter two deterministic programs with the same log, they will produce the same output.
In the construction of distributed systems, the realization of this can make:
Let all the machines do the same thing, the statute is:
Build a distributed, consistent log system that provides input for all processing systems.


The purpose of the log system is to disperse the uncertainties on all input streams to ensure that all replication nodes that process the same input are kept in sync.


The best part of this approach is that you can treat the timestamp of the index log as the clock of all the replication nodes:


By using the largest timestamp in the log that the replication node has processed, as the unique ID of the replication node, the timestamp, combined with log, can uniquely represent the entire state of the node.


There are many ways to apply this approach:


    • Record a request for a service in log
    • Record the change in service status before and after replying to a request
    • Or, a series of conversion commands executed by the service, and so on.


Theoretically, we can record a series of machine instructions, or the name and parameters of the called method, as long as the data processing process behaves the same, these processes can guarantee cross-node consistency.
People who play the database often differentiate between logical logs and physical logs:


    • Physical log: All changes to the contents of the line are recorded.
    • Logical log: Not a change in the contents of a record, but an SQL statement that causes changes in row content, such as INSERT, UPDATE, Delete, and so on.


For distributed systems, there are usually two ways to handle replication and data processing:
1) State machine model (active-active)
2) Primary-back model (active-passive)



As shown in the following:



To understand the differences between the two approaches, consider a simple example:
Now, the cluster needs to provide a simple service to do arithmetic operations such as addition, multiplication, and so on. Initially, maintain a number, such as 0.


    • Active–active: Some operations such as "+1", "* * *" are recorded in the log so that each replication node needs to perform these operations to ensure that the final data state is consistent.
    • Active–passive: A separate master node that performs operations such as "+1", "* * *", and logs the results of operations such as "1", "3", "6", and so on.


The above example also reveals why the order is a key factor in maintaining consistency between replication nodes, and if the order of these operations is disrupted, the results of the operation are different.
Distributed log, which can be used as a data structure for some consistency algorithms:


    • Paxos
    • ZAB
    • RAFT
    • viewstamped Replication


A log that characterizes a series of decisions about what the next value is.


2.2.4 Changelog


From the database point of view, a set of records of data changes in the changelog and table, is dual and interoperability.
1) You can refactor a table of a state (or a record with a key in a non-relational storage system) according to the log that recorded the change in the data.
2) Conversely, if a table changes, you can count the changes into log.



This is exactly what you want in the near real-time copy of the cheats!



This is very similar to what version control does: Manage distributed, concurrent, state-changing changes.



The version Control tool maintains a patch that reflects the changes, which is actually log, which you interact with a branch snapshot that is checked out (checked-out), which is equivalent to a table in the database. You will find that in version control and distributed systems, replication is log-based: When you update a version, you simply pull the patch that reflects the version change and apply it to the current branch snapshot.


2.3 Data Integration 2.3.1 The meaning of data integration


Data integration is the use of data from all services and systems in an organization to become available.



In fact, the effective use of data is in line with Maslow's hierarchical requirements theory.
The bottom of the pyramid is to collect data and integrate it into the application system (whether it's a real-time computing engine, a text file, or a Python script).
These data, however, need to be transformed to maintain a uniform, standardized, and neat format for easy reading and processing.
When the above requirements are met, you can start to consider a variety of data processing methods, such as map–reduce or real-time query system.
Obviously, without a reliable, complete stream of data, Hadoop is just an expensive, hard-to-integrate heater (clusters are very costly?). )。
Conversely, if the data flow is reliable, usable, and complete, you can consider more advanced gameplay, better data models, and consistent, understandable semantics.
Then, attention can be shifted to visualizations, reports, algorithms, and predictions (digging, drilling, and depth).


2.3.2 Two complexities of data integration


Event



event data that records how events occur, not just what happens, this type of log is often used as an application log because it is generally written by the application system. But this, in fact, confuses the function of log.
Google's wealth, in fact, is created by a correlation pipeline based on the (user) Clickstream and likes and dislikes (experience), and the click Stream and impressions are the events.



Outbreak of a variety of professional data systems



The reasons for these systems exist:


    • Online Analysis (OLAP)
    • Search
    • Simple on-line storage
    • Batch Processing
    • Atlas Analysis
    • Et cetera (e.g. spark)


Clearly, integrating data into such a system is extremely difficult for data integration.


2.3.3 Data flow based on log structure


Each logical sense of the data source can be modeled on the basis of log.



The data source can be an application that logs events (clicks and PV) and can be a database table that accepts changes.



Each subscriber, as soon as possible, obtains a new record from the log generated by these data sources, applies to the local storage system, and increases its read offset (offset) in log. Subscribers can be any data system, such as caching, Hadoop, another site's database, or a search engine.



Log, in effect, provides a logical clock that, for data changes, can measure the state of different subscribers, because these subscribers have different read offsets in the log and are independent of each other, and this shift is like a "moment" in a time sense.






Consider an example of a database, and some cache servers:
Log provides the ability to synchronize all cache servers and to launch the "moment" at which they are located.



Suppose we write a log with the number x, to read from a cache server, in order not to read the old data, just make sure that we do not read anything from this cache until the cache server copies the data (synchronous) to the X location.



In addition, log provides the ability to act as a buffer to support the behavior of producers and consumers in an asynchronous manner.



One of the most critical reasons to support Asynchrony is that a subscription system can crash, be offline for maintenance, and then go back online, in which case each subscriber consumes data at its own pace.



A batch system, such as Hadoop, or a data warehouse, consumes data in hours or days, while a real-time system typically consumes data in seconds.
While the data source or log is unknown to subscribers to consumer data, it is necessary to seamlessly add subscribers and remove subscribers in pipeline.



More importantly, subscribers simply need to know the log, without any knowledge of the source of the data they consume, whether the data source is an RDBMS, Hadoop, or a recent popular k-v database, and so on.



The discussion of log, rather than the messaging system, is due to the different characteristics guaranteed by different message systems, and the fact that it is difficult to express some semantics comprehensively and accurately with the word message system, because the message system, more importantly, redirects the message.



However, log can be understood as a message system that provides durability guarantees and strong ordered semantics, which are called atomic broadcasts in a communication system.


2.4 at LinkedIn


LinkedIn's current major systems include (note: 2013):


    • Search
    • Social Graph
    • Voldemort (K-v storage)
    • Espresso (document Storage)
    • Recommendation engine
    • OLAP Query engine
    • Hadoop
    • Terradata
    • Ingraphs (Monitoring Atlas and Metrics Services)


Each system provides specialized advanced features in its specialized areas.



(This section is too long, Jay Brother is very able to kan, so pick the key to remember!) )



1) The introduction of the concept of data flow is due to the creation of an abstract cache layer on the Oracle database table, which provides the ability to build and update the index of search engines and social maps.



2) in order to better deal with some of LinkedIn's recommended algorithms, began to take the Hadoop cluster, but the team's experience in this block is still shallow, so took a lot of detours.



3) At the beginning, it was simply rude to think that just pull the data out of the Oracle Data Warehouse and throw it in to Hadoop. The result: first, the rapid export of data from Oracle Data Warehouse is a nightmare; second, and worse, the processing of some data in the Data warehouse is incorrect, causing the batch task of Hadoop not to output as expected, and to perform tasks through the Hadoop batch, usually irreversible, Especially after the report is out.



4) Finally, the team discarded the data from the data Warehouse, directly to the database and logs as the data source. Next, a wheel was created: K-v storage (Voldemort).



5) Even if the data copy is not big on the work, also occupy the team a lot of time to deal with, and worse, once the data processing pipeline a point in error, Hadoop immediately become waste firewood, because the great algorithm ran on the wrong data, only one consequence, is to produce more error data.



6) Even if the team constructs a high level of abstraction, specific configurations are required for each data source, which is the source of many errors and failures.



7) A large number of programmers want to follow up, each programmer has a large number of ideas, integration of this system, add this feature, integrate this feature, or want to customize the data source.



8) Jay began to realize:
First, although the pipelines they build are still very rough, they are extremely valuable. Even solving the problem of data being available in new systems, such as Hadoop, unlocks a large number of possibilities. Previously difficult calculations began to become possible. New products and analysis, only need to unlock the data in other systems, and integration, it can be easily done.



Second, it is clear that reliable data loading requires more robust support, and if you can capture all of the structures, you can fully automate the loading of Hadoop data without the need to add new data sources or manually modify the data patterns. The data magically appears in HDFs, and when a new data source is added, the hive table is automatically and adaptively generated with the appropriate columns.



Thirdly, the coverage of data is far from enough. Because it's hard to deal with a lot of new data sources.



9) In order to solve the problem of data loading after the new data source was added, the team started this attempt:






Soon, they found that this would not work, because the data flow is usually bi-directional for publishing and subscription, production, and consumption, which becomes an O (n^2) problem.
So, what they need is a model like this:






Each consumer needs to be isolated from the data source and ideally, these consumers will interact with only a single data repository, and this repository can provide them with the ability to access any of them.



10) message system + Log = Kafka,kafka was born.


2.5 Log and ETL, Data Warehouse relationship 2.5.1 Data Warehouse


1) A clean, structured, integrated data repository for analysis.
2) Although the idea is good, the way to get the data is a bit outdated: periodically get the data from the database and convert it to a more readable format.
3) before the Data Warehouse problem is: the clean data and data warehouse highly coupled .



Data Warehouse, should be a set of query function, these functions serve the report, search, ad hot analysis, including count (counting), aggregation (aggregation), filtering (filtering) and so on, so it should be a batch processing system.



But a high degree of coupling of clean data with such a batch system means that the data cannot be consumed by real-time systems, such as index building of search engines, real-time computing and real-time monitoring systems, and so on.


2.5.2 ETL


In Jay's opinion, ETL is nothing more than doing two things:



1) Extracting and cleaning data, unlocking data from a specific system
2) Reconstruct the data so that it can be queried through the data warehouse. For example, change the data type to fit a type of relational database, convert the pattern to a star or snowflake pattern, or decompose it into a column-oriented storage format.



However, the two things are coupled, the problem is very large, because the integrated, clean data, should be able to be other real-time systems, index building systems, low-latency processing system consumption.



The Data Warehouse team is responsible for collecting and cleaning the data, however, the producers of these data tend to produce data that is difficult to extract and clean due to the unclear data Warehouse team's processing requirements.
At the same time, because the core business team is not sensitive to the fact that the other teams in the company remain unison, the data coverage is very low, the data flow is fragile, and it is difficult to quickly respond to changes.



So, the better way is:






If you want to do some search on a clean data set, monitor the trend graph in real time, real-time alarm, it is inappropriate to use the original data warehouse or the Hadoop cluster as the infrastructure. What's worse, the data loading system built by ETL for the data warehouse is useless to other (real-time) systems.



The best model is that the data is cleaned before the data Publisher publishes it, because only the publisher knows exactly what their data is. All of the operations that are done at this stage should be non-destructive and reversible .



All real-time conversions of rich semantics, or added values, should be processed after the original log post is published (post-processing), including establishing a session for event data, or adding some fields of interest. The original log can still be used alone, but this kind of real-time application also derives the new parameterized log.



Finally, only data aggregation operations that correspond to specific target systems should be part of the data load, such as converting to a star or snowflake pattern for analysis and reporting in the Data Warehouse. Because of this phase, as the traditional ETL did, because of the very clean and canonical flow of data, (with log) it is now very simple.


2.6 Log files and events


The log-centric architecture also has the added benefit of being easy to implement a non-coupled, event-driven system.



The traditional way of capturing user activity and system changes is to write such information into a text log and then extract it into a data warehouse or a Hadoop cluster for aggregation and processing, similar to the Data warehouse and ETL problems described earlier: high coupling between data and data warehouses.



At LinkedIn, it builds an event data processing system based on Kafka. There are hundreds of event types defined for various actions, from PV, user bright (ad impressions) for ads, search, to service invocation and application exceptions, and so on.



To realize the benefits of this event-driven system, look at a simple example of an event:
On the Job Opportunities page, provide an opportunity. This page should only be responsible for showing the opportunity, and should not contain too much other logic. However, you will find that in a fairly large site, doing this can easily involve more and more logic that has nothing to do with the opportunity to display.



For example, we want to integrate the following system features:
1) We need to send data to Hadoop and the data Warehouse for offline processing.
2) We need to count the number of page views to make sure that some of the browsing is not for crawling web content or anything.
3) We need to aggregate the browsing information for this page and present it on the analysis page of the opportunity Publisher.
4) We need to record a user's browsing history of this page to ensure that we provide the user with a valuable, well-experienced opportunity to work for this user, rather than repeating an opportunity over and over again for this user (think of a game where the wife is not at home to play, the red-green-blue effect, With that exciting DJ wind dance, or the swing of the focus of the career peak and the X-Girls, and then point to find the title of the party's ad bar! )。
5) Our referral system needs to record the browsing history of this page to correctly track the popularity of this job opportunity.



Soon, the page logic that just shows the opportunity becomes complex. When we add this opportunity to the mobile side, we have to move the logic to the past, which adds to the complexity. Not yet, the tangled thing is that the engineer in charge of this page needs to have knowledge of other systems to ensure that the above functions are properly integrated.



This is only a very simple example, in practice, the situation will only be more complex.
Event-driven can make this thing simple.



The page that is responsible for presenting the opportunity only needs to present the opportunity and document some factors related to the presentation, such as the relevant attributes of the job opportunity, who viewed the page, and other useful information related to the presentation. The page does not need to maintain knowledge and understanding of other systems, such as recommender systems, security systems, Opportunity publishers ' analysis systems, and data warehouses, all of which need to be subscribers, subscribe to the event, and then independently perform their own processing. The page that presents the opportunity does not need to be modified because of the new subscriber or consumer's affiliation.


2.7 Building an extensible log


Separating publishers and Subscribers is not new, but it is difficult for the log system to ensure that multiple subscribers are able to process messages in real-time, while maintaining the ability to extend them.



If the build of log does not have fast, low overhead, and scalability, then everything that is built on this log system will be free.



Many people may think that the log system is a slow, heavy-duty chore in a distributed system and is used only for information such as meta-data that is more suitable for processing zookeeper.



But LinkedIn now (note: 2013) Handles 60 billion different message writes per day in Kafka (hundreds of millions of writes if the data center is mirrored).



How did they do it, Jay?



1) Split log (partitioning the log)
2) Optimize throughput with bulk Read and write
3) Avoid unnecessary copying of data



Provides the ability to expand by cutting log to multiple partition:






1) Each partition is an ordered log, but there is no global order between partitions.



2) writes a message to which partition is completely controlled by the writer, and is segmented by a key of some type (such as user_id).



3) The partitioning allows the additional operation of log to be done without coordination between shards (sharding), while ensuring that the throughput of the system is linearly related to the size of the Kafka cluster.



4) Although the global order is not provided (in fact, the consumer or subscribers are thousands, the global order in which they are discussed is generally of little value), Kafka provides a guarantee that the sender will send the message to a partition in what order. The message to be handed out from this partition is what order (in what order, in what order).



5) Each partition is replicated according to the configured number, and if one leader node is hung, the other nodes become the new leader.



6) A log, like the file system, the read-write mode of the thread can be optimized, and the small read-write log can be composed of larger, high-throughput operations. Kafka has done a great job on this matter. Batches are used in a variety of scenarios, such as clients sending data to the server, writing data to disk, copying data between servers, transmitting data to consumers, and confirming the submission of data.



7) Finally, Kafka in memory log, disk log, the log sent on the network, a very simple binary format to facilitate the use of various optimization techniques, such as 0 copy data transfer technology (zero-copy data transfer).



Many of the optimization techniques are converging to allow you to read and write data, even when the memory is full, in accordance with the maximum capabilities that a disk or network can provide.


2.8 Logs and real-time processing


You think Jay's got a beautiful way of copying data and copying it?
You! Wrong! The



Log is another way of saying that logs is the core of stream processing.


2.8.1 What is stream processing


Brother Jay thinks:
1) Stream processing is an infrastructure for continuous data processing.
2) The computational model of stream processing, just like mapreduce or other distributed processing frameworks, requires low latency.
3) Batch processing type of data collection mode, resulting in batch processing mode of data processing.
4) Continuous data collection mode, resulting in a continuous processing mode.
5) Brother Jay spoke about a U.S. census to explain the batch process.



At LinkedIn, both active data and database changes are continuous.
Batches process data on a daily basis, and successive computations set the window to resemble one day.



So, stream processing is a process like this:
6) When processing the data, with the concept of a time, do not need to keep a static snapshot of the data, so you can at the user-defined frequency, output results, without having to wait for the data set to reach an "end" state.
7) In this sense, stream processing is a generalization of batch processing , and considering the popularity of real-time data, this is an extremely important generalization.
8) Many commercial companies are unable to establish a streaming engine, often because the stream data collection engine cannot be established.
9) Stream processing spans the gap between real-time responsive services and offline batch infrastructure.
Log system solves many of the key issues in streaming mode, one of the biggest problems is how to provide the available data ( stream data collection ) in real-time multiple subscriber mode.


2.9 Data Flow graph


The most interesting thing about streaming is that it expands the concept of what a data source (feeds) is.
The logs, feeds, or event, row-by-row data records of the original data are derived from the activity of the application.
However, stream processing also allows us to process data from other feeds, which, in the eyes of consumers, is not two-dimensional, and these derived feeds can contain any degree of complexity.






A stream processing task, it should be: Read data from logs, write output to logs or other system.



As input and output logs, connecting these processes themselves, and other processing processes, constitutes a diagram.






A stream handler, in fact, does not have to be very tall: it can be a process or a set of processes, however, to facilitate the management of the code used for processing, you can provide some additional infrastructure and support.



The introduction of logs has two purposes:



1) ensures that the data set can support multiple subscriber patterns and is ordered.
2) can be applied as a buffer. This is important, in the process of non-synchronous data processing, if the upstream producers of data faster, the speed of the consumer cannot keep up, in this case, either make the processing process block, either introduce a buffer, or discard the data.
Discarding the data does not seem like a good choice, and blocking the processing process will cause the processing process in all of the data streams to get stuck. The log, which is a large, oversized, very large buffer, allows the process to be restarted so that a process fails without affecting other processes in the stream processing graph. This is critical for a large organization to extend the data flow, because different teams have different processing tasks, and obviously cannot be stuck because of a task error, the entire stream processing process.



Storm and Samza are the same flow-processing engines and can use Kafka or other similar systems as their log systems.



(Note: Jay Brother is quite fierce, before there is a Kafka, after the Samza. )


2.10 Stateful real-time processing


Many stream processing engines are stateless, one-time records, but many use cases require complex counts, aggregations, and joins operations within a time window of a certain size of the stream processing.
For example, in the click Stream, join user information.



Then, this use case requires state support. Where data is processed, the state of a certain data needs to be maintained.



The question is, how do you stay in the right state when the processor may be dead?



Maintaining the state in memory may be the simplest, but not crash.



If the state is maintained only within a certain time window, when it hangs or fails, the processing can be replayed directly back to the start of the window, but if the window is 1 hours long, it may not work.



There is also an easy way to put the state in a Remote Storage system or database, but this will lose the locality of the data and generate a lot of inter-network data round-trip (round-trip).



Recall the duality of the table and log in the database mentioned above.
A stream processing component that can use local storage or indexes to maintain state:


    • Bdb
    • Leveldb
    • Lucene
    • Fastbit


Used to restore state after crash by logging changelog about the local index. This mechanism, in fact, also reveals a generalization of the state that can be stored as an arbitrary indexed type, with the input stream being split (co-partitioned) at the same time.



When the processing process crashes, it can recover the index from the changelog, and log acts as a role that translates the local state into an incremental record of a time-based backup.



This mechanism also provides a graceful ability: the state of the process itself can also be recorded as log, and it is clear that other processes can subscribe to this state.



Combined with the log technology in the database, the data integration scenario can often be very powerful:


By extracting the log from the database and indexing it in a variety of stream processing systems, it is possible to join with different event streams.

2.11 Log Merge


Obviously, it is not possible to record full-time state change information with log.



Kafka uses log merge or log garbage collection technology:



1) for event data, Kafka only one time window (can be configured for a few days in time, or configured by space)
2) for keyed Update,kafka adopt compression technology. This type of log can be used to reconstruct the state of the source system in another system using replay technology.



If you maintain full-time logs, the data will become larger and longer as the time increases.
Kafka does not simply discard old log information, but instead discards obsolete records by merging, for example, the primary key of a message has recently been updated.





2.12 System Building 2.12.1 Distributed System


Log, which plays a consistent role in the data flow system and data integration of the distributed database:


    • Abstract Data Flow
    • Maintain data consistency
    • Provides data recovery capabilities


You can consider applications and data flows throughout your organization as a single, distributed database.
Consider a query-oriented standalone system, such as Redis, SOLR, Hive tables, and so on, as a special, data-based index.
The storm and Samza flow processing systems are considered as a well-designed trigger or materialized view mechanism.



A variety of data systems, explosive appearance, in fact, this complexity already exists.
In the glorious period of a relational database (heyday), there are many types of light relational databases for a company or organization.



Obviously, it's impossible to throw everything into a hadoop cluster and expect it to solve all the problems. So, how to build a good system might look like this:


Build a distributed system in which each component is a small cluster, and each cluster does not necessarily provide security, performance isolation, or good extensibility, but every problem can be addressed (professionally).


Jay felt that a variety of systems exploded because it was difficult to build a powerful distributed system. And if the use case is limited to some simple, such as query scenarios, each system has enough capacity to solve the problem, but it is difficult to integrate these systems.



Jay thinks there are three possible ways to build a system in the future:



1) Keep the status quo. In this case, data integration is still the biggest problem, so an external log system is important (kafka! )
2) The emergence of a strong (if glorious period relational database) can solve all the problems of the system, which seems a bit unlikely to happen (why?) )。
3) Most of the new generation of systems are open source, which reveals the third possibility: the data infrastructure can be separated into a set of services, and application-oriented system API, all kinds of services Division of the matter, each is incomplete, but can be specialized to solve specific problems, in fact, through the existing Java technology stack can be seen clues:


    • ZooKeeper: Solves the problem of synchronization and collaboration of distributed systems (and may also benefit from higher level of abstraction components such as helix, curator).
    • Mesos, YARN: Addressing virtualization and resource management issues.
    • Embedded Components Lucene, LevelDB: Solve indexing problems.
    • Netty, jetty and higher levels of abstraction Finagle, rest.li solve the problem of remote communication.
    • Avro, Protocol buffers, thrift, and umpteen Zlin: solving serialization problems.
    • Kafka, Bookeeper: provides backing log capability.


From a certain point of view, building such a distributed system is like a version of Lego bricks. This obviously doesn't have much to do with the end user who cares more about the API, but it reveals a way to build a powerful system and keep it simple:
Obviously, if the time to build a distributed system falls from a few years to a few weeks, the complexity of building a single, large system will disappear, and this is due to the emergence of more reliable and flexible "bricks".


The position of 2.12.2 log in system construction


If a system has the support of an external log system, each individual system can reduce its own complexity by sharing the log, which Jay considers to be the role of log:



1) Handle data consistency issues. Both immediate and eventual consistency can be achieved by serializing concurrent operations on the node.



2) Provide data replication between nodes.



3) Provide the semantics of "submit". For example, if you think that your write operation will not be lost, the operation is confirmed.



4) Provide a data source (feeds) that can be subscribed to by the external system.



5) provides the ability to recover when a node loses data due to failure, or to rebuild a new replication node.



6) Handle load balancing between nodes.



Above, is probably a complete distributed system should provide most of the features (Jay really love log! ), the rest is the client's API and things like some build index, such as full-text indexing needs to get all the partitions, and for the primary key query, only need to get the data in a partition.



(That's the rest of the story, Jay Brother Mighty!) )



The system can be divided into two logical components (this powerful understanding and skill):



1) Log Layer
2) Service Layer



Log layer, capturing state changes in a serialized, orderly manner, while the service layer stores the indexes required for external queries, such as a k-v store that may require B-tree, sstable indexes, and a search service that needs to be inverted.



Write operations can be directly into the log layer, or through the service layer to do proxy. Writing to log generates a logical timestamp (log index), such as a numeric ID, and if the system partition, the service layer and log layer will have the same partitions (but their respective number of machines may be different).






The service layer subscribes to the log layer, and follows log in the fastest, log-stored order, synchronizing data and state changes into its own local index.



The client will get Read-your-write semantics:


By the time stamp on any one by one nodes, when the query is carried with its write, the node of the service layer receives this query, compares the timestamp with its local index, and, if necessary, delays the execution of the request in order to prevent the return of stale old data until the index synchronization of this service node is followed by a timestamp.


The nodes of the service layer, perhaps need, perhaps do not need to know the concept of leader. In many simple use cases, the service layer does not build leader nodes, because log is the source of the facts.



There is also a question of how to deal with recovery problems after node failure. You can do this by keeping a fixed-size window of time in log while maintaining a snapshot of the data. You can also let log keep a full backup of your data and use the log merge technique to complete the log's own garbage collection. This approach moves the many complexities of the service layer to the log layer, because the service layer is system-dependent (system-specific), and the log layer can be used universally.



Based on the log system, you can provide a complete set of APIs that can be developed for use as ETL data sources for other systems and for other systems to subscribe to.



Full Stack! :






Obviously, a distributed system with log as its core is immediately a role that can provide data loading support and data stream processing to other systems. Similarly, a stream processing system can consume multiple streams of data at the same time and provide services externally by indexing and then outputting another system.



The system is constructed based on log layer and service layer, and the factors related to the query are decoupled from the usability and consistency of the system.



Perhaps a lot of people think that maintaining a separate backup of the data in log, especially for a full-volume copy of the data, is too wasteful, too extravagant, but it's not true:



1) LinkedIn (note: 2013) Kafka production clusters maintain 75TB per data center, while application clusters require more storage space and storage conditions (ssd+ more memory) than Kafka clusters.
2) The index of full-text search, preferably loaded into memory, and logs because all are linear read and write, so you can take advantage of cheap high-capacity disk.
3) because the Kafka cluster actually operates in multiple subscriber modes, multiple systems consume data, so the cost of the log cluster is amortized.
4) For all of these reasons, the overhead of an external log system (Kafka or similar system) becomes very small.



2.13 Conclusion
In the end, Jay has left a lot of scholarly, engineering, valuable papers and reference links, and humbly left the phrase:



If you made it the know most of the what I know about logs.



End.



Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.



Learning notes: The Log (one of the best distributed technical articles I've ever read)


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.