[Reprint] Scale elements: the composition and expansion of the data platform

Last Update:2015-07-31 Source: Internet

Author: User

Tags cassandra new set riak hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: http://mp.weixin.qq.com/s?__biz=MzA5ODM5MDU3MA==&mid=211277835&idx=1&sn= 45631d0f416b7bc9b8bfc2aad76f85cd&scene=1&key= 0acd51d81cb052bc5897df08137e6dc7203f2168bded4fa93b0589d0763c767a5856b761e9f52ce5007a9d2566cebbf2&ascene=0 &uin=mjk1odmyntyymg%3d%3d&devicetype=imac+macbookpro9%2c2+osx+osx+10.10.4+build (14E46) &version= 11020113&pass_ticket=6x6xap05q6gt38m0drlejecvkvzrtmxdwbalwdetuyc524aerunbhf%2f1bf6jmepd

As a software engineer, inevitably affected by the surrounding computer tools, language, framework, and even the implementation process will affect the software we build.

The database is also so, based on a particular way, that inevitably affects our handling of the variable and shared state of the application.

In the past more than 10 years, we have used different ways to explore the world. With different ideas of a small open source projects, they continue to grow, you have me, I have you. The platform integrates these tools, and each control typically improves some of the underlying hardware or system performance. The result is that the platform cannot solve certain problems through any single tool, either too cumbersome or confined to a specific part.

So today's data platforms are diverse, from simple cache tiers, multi-language persistence tiers, to the entire integrated data pipeline, to multiple solutions for a variety of specific needs. In some ways, it does have a good performance.

So the purpose of this dialogue is to explain how some popular ways and means of doing things work and why they do so. Let us first consider the basic elements that make up them, so that they can be considered holistically in subsequent discussions.

From an abstract point of view, when we are dealing with data, we are actually processing it locally (locality), local to CPU, local to other data we need. Getting data in an orderly manner is an important part of the computer's ability to serialize operations that are predictable.

(Translator Note: Locality is a predictive behavior in a computer that improves performance through caching, in-memory prefetch directives, processor pipeline branch prediction, and more). See the operating system essence and design principles. ）

If the data is fetched in an orderly manner from the hard disk, the data is pre-fetched into the hard disk cache, the page cache, and the different levels of CPU cache, which can greatly improve performance. However, this has little meaning for random data addressing, which is stored in main memory, hard disk, or network. In fact, Prefetching pulls down the ability to load random loads: Whether it's a cache or a front-end bus, it's full of data that's not very useful.

Hard drives are generally considered to be slightly less performance, while the main memory is slightly faster. This understanding is not always right, with one or two orders of magnitude difference between random and sequential main memory loads. Managing memory in a language tends to get worse.

The data stream performance obtained from the hard disk is really better than the random addressing of main memory, perhaps the hard disk is not as we think the tortoise, at least in order to get the situation is not very slow. Solid-state drives (SSDs), especially with PCIe interfaces, complicate things as they show different tradeoffs. However, the cached gains from these two acquisition modes are constant.

The translator notes that the data stream is a series of sequential, potentially infinite sequential data that is accessed sequentially and is read once or for a finite number of times.

Let's say we're going to create a simple database, starting with the basic part of the file.

Keep an orderly read and write, and the file will behave well on the hardware. We can put the written data into the end of the file and can read the data by sweeping the entire file. Any processing that we want can come true with the flow of data through the CPU, such as filtering, aggregating, and even doing some more complex operations, in short, perfect.

What happens to if data such as updates?

We have multiple options to update this value at a location. We need to make use of fixed-length fields, which is not a problem in our shallow thought experiments. However, updating data at a location means random input and output streams (IO), which can affect performance.

The alternative is to place the updated values at the end of the file and to process the expired data when the value is read.

For the first time, we put a "diary" or "Log" at the end of the file to ensure an orderly fetch and thus improve performance. In addition, if you need to update data somewhere, you can read about 300 times per second, provided that the updated data is brushed into the underlying media.

In fact, the full read file is very slow, get 1 billion gigabytes (GB) of data, the best hard disk will take a few seconds, this is a database full table scan time spent.

We often only need some specific data, such as a customer named "Bob", when scanning the entire file is not appropriate, we need an index.

We can use many different types of indexes, the simplest of which is a fixed-length ordered array, such as the customer name in this example, and the corresponding offset is stored in a heap file. An ordered array can be searched in binary search. Similarly, we can use tree structure, bitmap index, hash index, dictionary index, and so on. Here is a diagram of the structure of a tree.

The index is like adding an overview structure to the data, and the values are ordered, so that we can quickly get the data we want to read. But there is a problem with the overview structure, which requires random writing when the data comes in. Therefore, ideal, write-optimized only append files, considering that writing will break the file system, which slows everything down.

If you put many indexes into a database table, you must be familiar with the problem. Suppose we use a mechanical disk that maintains the integrity of an index's hard disk in this way, at a rate of about 1000 times times slower.

Fortunately, there are several solutions. Here we discuss three kinds, which are some extreme examples. In the real world, this is far less complex, but these concepts are especially useful when considering mass storage.

Translator Plus:

First memory-mapped file
The second smaller set of indexes, using meta-index or the Bloom filter algorithm to do some optimizations
The third simple matching algorithm (brute force) is also called column-oriented (oriented)

The first option is to put the index into the main memory, random write problems are separated into random storage memory (RAM), the heap files are still on the hard disk.

This is a simple but effective solution to solve the problem we write randomly. This approach has been applied in many databases, such as MongoDB, Cassandra, Riak, and other databases with this optimized type, which are often used in memory-mapped files.

Translator Note: The memory map file is a single fragment of virtual memory, you can establish a direct byte-to-byte association with a file or a part of a class file resource, that is, the location of the data in the file has a corresponding address space in memory, then read and write to the file can be done directly with pointers, without the need for read/write function, Input-output stream (IO) performance can be significantly improved when working with large files.

This strategy fails if the amount of data is far beyond the main memory. Especially when there are a large number of small objects, the problem is particularly conspicuous; The index grows very large, and the last store crosses the capacity of the available main memory. In most cases, this is not a problem, but it can be a burden if there is a huge amount of data.

A popular way to throw away a single "overview" index instead of using a relatively small set of indexes.

This is a simple idea: data comes in, and we write it in bulk to the memory. Once there is enough memory data, such as MB, we sort them and write them to the hard disk as a single small index. The last one is a small, unchanging index file that consists of a chronology.

So what are the benefits of doing this? These immutable collections of files are streamed in an orderly manner so that they can be written quickly and, most importantly, without having to load the entire index into memory. That's great!

Of course, it also has a drawback, when read operations need to ask very many small indexes. We turn random io (Randomio) write problems into read problems. But this is a good tradeoff strategy, and random reads are easier to optimize than random writes.

Storing a small meta-index (META-INDEX) in memory or using the Fabric filtering algorithm (Bloom filter) provides a low-memory way to evaluate whether a single index file needs to be queried in a read operation. Even with fast and orderly write operations, the read performance of this approach can be comparable to a single overview index.

In the actual development, occasionally also need to clean up the soliton update, but it is orderly reading and writing really good.

The structure we created is called the log-structure merge tree (structured), which is used in big data tools such as HBase, Cassandra, Google BigTable, etc., which can balance write and read performance with relatively small memory overhead.

Store the index in memory, or bypass the "random write Penalty" (Random-write penalty) using a write-optimized index structure such as the log structured merge tree. This is the third scenario for a purely simple matching algorithm (pure brute force).

Go back to the beginning of the file example and read it in full. There are many options for how to work with the data in a file. The simple matching algorithm (brute force) stores data through columns rather than rows, which is called column-oriented.

It is important to note that there is a bad naming term conflict between the real columnstore and the large table pattern that follows it. Although they have some similarities, in fact they are different, so it is wise to treat them as different things.

Column-oriented is a simple concept that differs from row-store data by splitting each row by columns and appending the data to the end of a single file. Then store each column in a separate file, and once you need to read only the columns you need.

This ensures that the file contains the same sequence, that is, the nth row of each column file contains the same address or offset. It's important to read more than one column at a time to serve a single query. means that the joining column is very fast, and if all the columns contain the same sequence, we can do so in a compact loop that has good cache and CPU utilization. Many implementations of a large number of use vectors (vectorisation) Further optimize the throughput of simple connections and filtering operations.

Write operations can increase the performance of only appending (being append-only) at the end of the file. The disadvantage is that when many files need to be updated, each column of the file needs to be written to the database separately. The most common solution is to use a similar log structure merge (LSM) method for bulk write operations. Many column-type databases improve read performance by adding a complete sequence to the table.

Splitting data through columns can greatly reduce the amount of data read from the hard disk, as long as the query operation is in a subset of all the columns.

In addition, the data in a separate column is generally well compressed. You can use the advantages of column data types to compress, especially when we are familiar with the data types of columns. This means that we can use effective, low-cost coding methods, such as stroke length coding, delta, bit combination (bit-packed), and so on. For some encodings, predicates can be used directly to compress streams.

A simple matching algorithm (brute force) is especially suitable for large-scale scanning operations, such as the average, maximum, minimum, grouping and other clustering functions are typical of this.

This is different from the previously mentioned "heap Files and indexes (' Heap file & index ') approach, and it's a good idea to ask yourself," What's the difference between such columns and how each field has indexed "heap and index"?

The key to the problem is the index file sequence: The Multipath lookup tree (Btree) is sorted according to the retrieved field, and the data connection at the end of the two retrieval involves a flow operation, and a random fetch is retrieved at the second index position at the other end. The balance tree is generally less efficient than containing two identical sequential indexed column connections, and we have once again improved serialization access.

Note: The conclusion is that the balance tree connection performance is not as good as two of the same sequence index columns

We all want to use the best technology as a data platform control, to improve some of these core functions, to a specific set of load.

This simple model works well when you store the index in memory rather than the heap file as a bundle-multi-non-relational database (NoSQL), such as Riak, Couchbase, or MongoDB, or even some relational databases.

Tools designed to handle massive datasets are happy to use the LSM approach, which allows for fast data acquisition and performance based on hard drive fabric reads. HBase, Cassandra, Rocksdb, LevelDB and even MONGO now support this approach.

The column (Column-per-file) engine for each file is commonly used for database massively parallel processing (MPP), such as redshift or vertica, and parquet in the Hadoop stack. The biggest problem with these data engines is the need for large traversal, which is the most important trait of these tools.

such as Kafka (Kafka) employs a simple, hardware-based, and efficient message specification. Messages can be simply appended to the end of a file, or read from a predetermined offset. You can read the message back and forth from an offset, and you can read it from the offset at the last end. I can see that it's a good sequential input and output (IO).

This differs from most message-oriented middleware in that the JMS (Java Messaging Service) and AMQP (Advanced Message Queuing Protocol) Descriptions document require additional indexes to manage selectors and session messages. This means that they end a behavior in a way that is more like a database than a file. The famous statement was that Jim Gray's queue in 1995 was the database (queue's is Databases).

It can be seen that all approaches require such tradeoffs, as a distributed approach that makes things simple and hardware more user-friendly.

We have analyzed some of the core methods of the storage engine, in fact, just made a few brief explanations, the real world these are more complex, but the concept is really useful. A distributed data platform is more than just a storage engine, it also needs to consider parallelism.

For distributed data across multiple computers we need to consider two core points, partitioning (partition) and Replication (replication). Partitions are sometimes referred to as sub-Libraries (sharding), which perform well on random reads and simple match workloads (brute force workloads).

In the case of a hash-based partitioning model, with a hash function, the data can be divided into a set of machines (the translator notes: The ideal result is this). Similar to how a hash table works, each bucket is filled with a machine node.

This allows direct access to the machine that contains this data through a hash function to read the data. This is a classic distributed model, and the only one that increases the linear distribution of the representation as the client requests it (the translator notes: The simple point is averaging). Requests are quarantined on a single computer and are serviced by a single computer in the cluster.

Partitioning provides parallel batch computing, such as aggregation functions or complex algorithms such as mob or machine learning. The biggest difference is that all computers use broadcast methods at the same time, and in a very short time, the strategy of divide and conquer solves large-scale computing problems.

The batch processing system handles large-scale problems very well, but seldom concurrency during execution, and it is easy to run out of cluster resources.

Two extreme and particularly simple ways: one end is accessed directly and the other end is broadcast in a separate place. It is important to note that the intermediate zone between the terminals is the best example of a two-level index across multiple computers in a non-relational database (NoSQL).

A secondary index differs from the primary key index, which means that the data partition no longer uses the value in the index. Instead of using the hash function to distribute directly, broadcast the request to all computers. This restricts concurrency, and any one node is related to each request.

That's the reason. Many key-value stores are reluctant to use a two-level index, even if it is widely used, as is the case with HBase and Voldemort. However, such as MongoDB, Cassandra, Riak and other databases using a two-level index, no matter what I say two-level index is quite useful. But it is important to understand the effects of their concurrency across the system.

Replication solves the concurrency bottleneck, and perhaps you are familiar with the backup, whether it's asynchronous to the slave server or to a NoSQL store such as MONGO or Cassandra.

In fact, the backup is not visible (only for recovery), read-only (increase concurrency), or read-write (increase the availability under the network partition), choose which method needs to make the tradeoff from the system consistency. This is a simple application of the CAP (consistency, availability, partition-tolerance) theory, but the cap theory is far from as simple as we might think.

Translator Note: Network partitions refers to a network device error that causes network separation, such as a database hangs.

Balancing consistency gives us an important question, when do we need to ensure consistency of data?

The cost of consistency is expensive, and in the world of databases, atomicity is guaranteed by linearization (linearisabilty), which ensures that all operations are arranged in an orderly manner. But the cost is also expensive, in fact it is completely forbidden, many databases do not use this as a standalone (isolation) Execution Unit. For this reason, it is rarely set to the default value.

In short, you want the distributed write system to maintain strong consistency and the system slows down.

Note consistency this term has two scenarios, in atomicity and cap, of course it means different. I usually use the definition in the CAP, and the data is the same for all nodes at some point.

The solution to the consistency problem is simply to avoid it. If it cannot be avoided, isolate it to allocate as few write operations and computer resources as possible.

It is generally not difficult to avoid consistency issues, especially when the data is a constant stream of facts, and the network log collection is a good example. There is no need to focus on consistency, as these logs will not change as a matter of fact.

Use cases that require consistency, such as transfers, use of coupon codes, and non-commutative behavior.

Of course, from a traditional perspective, some things need to be consistent, but in fact it is not necessarily. For example, if an action changes from a mutable state to a new set of related facts, you can avoid this state of change. The new field is usually updated directly, considering that there is a potential fraud in marking a transaction, we can simply use a fact stream to correlate with the original transaction.

Translator: Good point of view

It is useful to remove all conformance requirements in the data platform, or to isolate it. One way to isolate this is to use a single write principle that involves several aspects, such as datomic, and the other is to split the variable and non-volatile to isolate the consistency requirement.

These concepts, such as Bloom/calm, are extended to support unordered concepts in the default state, unless a sort is required. So we need to make some basic tradeoffs, so how do we use these features to build a data platform?

A typical application architecture might be this: there is a set of processing that writes data to a database and then reads it out, for many simple workloads this is not a problem, and many successful applications are based on this pattern. However, with the increase of throughput, this mode becomes more and more difficult to apply, and in the application domain This problem may be solved by message passing, actor (actors), load balancing.

Another problem is that this way the database is used as a black box, and the database is a transparent software. They provide a huge amount of features, but they also provide a very small mechanism for splitting atoms. There are many benefits to this, which are safe by default, but it is annoying to protect the system from being over-throttled and thus limiting the distribution of the systems.

The command query segregation of duties (CQRS command queries Responsibility segregation) can be simply addressed by this issue.

Moqtada

Achieve a Druid
Implement two operational Analysis Bridge (Operational/analytic Bridge)
Implement three-batch pipeline
Implementing the Quad Lambda Framework (LAMBDA Architecture)
Implementation of the five Kappa (kappa) framework called streaming data platform

The idea is simple, separate read and write workloads: write when the best write state is written, most of the examples such as a simple log file, read when the best read state. There are a variety of implementations, such as goldengate tools for relational databases, internal replication integrations such as MongoDB replica sets products.

Many of the underlying behavior of the database is this, Druid is a good example, it is an open-source, distributed, sequential, column-analysis engine. Columnstore performance is impressive, especially for large-scale data entry, and data must be scattered across many files. For better write performance, Druid stores recent new data in an optimal write state, and then gradually shifts to the best read storage state.

Once the query is druid, the request is distributed to the best write and best read controls, combining the results (removing redundancy) and returning it to the user. The Druid is sorted by time stamp for each record.

Such combinations provide CQRS benefits under a single abstraction.

Another similar approach is to manipulate the analysis Bridge (Operational/analytic Bridge) to split the best read and best write views with a single event stream. The flow is in a changing state, so the async view can be rewritten and enhanced in subsequent days.

The front-end provides synchronous read and write, which makes it easy and fast to read the data that has been written, and can support complex atomic transactions.

The backend uses an asynchronous, unchanging state advantage to improve performance, such as extending offline processing with replication, inverse normalization, and even a completely different storage engine. The message bridge between the front and back ends makes it easy for the application to listen to the data stream through the platform. This model is well suited for medium-sized deployments where there is at least a partial, unavoidable need for a mutable view.

Design the invariant state so that it is easy to support large datasets and more complex analysis. A unique implementation of the Hadoop stack-the bulk pipeline-is a typical example.

The most exciting part of the Hadoop stack is its many tools, whether it's fast read-write access, inexpensive storage, or batch processing, high-throughput messaging, or extracting, processing, and analyzing data, all from the Hadoop ecosystem.

The bulk pipeline fetches data from multiple resources, puts it in HDFs, and then processes it to provide a version of the raw data that is continuously optimized.

Data may be enriched, cleaned, deserialized, aggregated, moved to an optimal read pattern such as parquet, or loaded into the server layer or data mart, and the data processed can be retrieved and processed.

This framework is suitable for unchanging data, as well as for large-scale acquisition and processing of data, such as 100 kilobytes (TBs). This framework process is slow, in hours.

The problem with bulk pipelines is that usually we don't want to wait a few hours to get a result. A common practice is to add a flow layer, sometimes called the Lambda Framework (Lambda Architecture).

The lambda framework retains the bulk pipeline, but adds a detour to the fast flow layer, just like a slip in a busy town, where the flow layer uses such as storm, Samza flow processing tools.

The core of the lambda framework is our most willing to give a quick, cursory reply, but I would like to make a precise answer at the end.

The flow layer bypasses the batch layer and provides the best answer, and its core is in the flow view. These are written to a server layer. A good batch pipeline calculates the exact data and overwrites the previous values.

It's a good idea to use a response to balance the accuracy, with two branches encoded in the stream and batch processing layers, some implementations of which are problematic. The solution is to simply abstract this logic into a reusable universal library, such as processing, which is written in an external repository such as Python and R language. The second is that systems such as spark provide both streaming and batch processing, and of course the flow in spark is only a small batch.

Therefore, this model is suitable for a massive data platform such as 100TB, combining streams with existing, enriched, and batch analysis functions.

Another way to resolve slow data pipelines is called Kappa (Kappa) framework. At first I thought the name of the schema was wrong, and now I'm not sure. No matter what it is, I call it streaming data platform, in fact this has been called by someone.

The streaming data platform has the advantage over batch mode: Unlike storing data in HDFs for new bulk tasks, the data is distributed across a message system or such as a Kafka log. Batch processing becomes a recording system, and the data stream is processed in real time to create a three-tier structure: A view, an index, a service, or a data mart.

Similar to the flow layer of the lambda (lambda) framework, there is no batch layer. Obviously this requires the message layer to store and supply massive amounts of data, and has a powerful and efficient stream processor to handle this process.

There is no free lunch, the problem is very difficult, streaming data platform running speed and no equivalent batch processing system is much faster. However, switching the default method "store and process" to "stream and process" can greatly improve the likelihood of getting results quickly.

Streaming data platform can also be used to solve the "application integration" problem, application integration this thorny problem puzzled Informatica, Tibco and Oracle and other large suppliers for many years. is useful for many databases, but is not a transformative scenario. Application integration remains on the topic of finding practical solutions.

The streaming data platform provides a potential solution: use operations to analyze the cluster benefits of bridges-multiple asynchronous storage formats and the ability to recreate views-but this increases the consistency requirements in existing resources:

System records become logs, which is easy to enhance the immutability of data. Products such as Kafka retain sufficient amounts of data and throughput to be used as historical records. This means that the response is a process of repeating and regenerating the state, rather than testing it in a normal way.

Similar approaches are in place, earlier than the latest data lake or Goldengate tools, which put data into enterprise data warehouses. This approach is compromised by the lack of throughput of the replication layer and the management of complex schema changes. It seems that the last question has been solved, but as a last question, there is no conclusion.

Returning to locality, read and write sequential addressing, is the most trade-off part of a control's interior. We watched how to expand these controls to improve the basic performance of the sub-database table and replication. Re-examine consistency as a problem and isolate it when building the platform.

However, the data platform itself requires a single, global way to balance these controls to the best possible state. Continuous reconstruction, from the best write state to the best read state, from the consistency constraints to the flow, asynchronous, invariant state of the Open zone.

There are several things to remember, one is schema, and the other is the risk of time, distributed, and asynchronous systems. But these questions are manageable, if you take them seriously. There may be new tools and innovations in the future of big data that are gradually incorporated into the platform to solve the problems of the past and the present.

Translator Note: Schema refers to database integrity constraints.

[Reprint] Scale elements: the composition and expansion of the data platform

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Reprint] Scale elements: the composition and expansion of the data platform

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Reprint] Scale elements: the composition and expansion of the data platform

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support