Putting Apache Kafka to use:a Practical Guide to Building A Stream Data Platform-part 2

Source: Internet
Author: User

Transferred from: http://confluent.io/blog/stream-data-platform-2


In the first part of the live streaming data Platform Build Guide, Confluent co-founder Jay Kreps describes how to build a company-wide, real-time streaming data center. This was reported earlier by Infoq. This article is based on the second part of the collation. In this section, Jay gives specific recommendations for building a data flow platform.

Limit the number of clusters

The smaller the number of Kafka clusters, the simpler the system architecture, which means fewer integration points, lower incremental costs for new applications, and simpler data flow inference. But for the following considerations, it is not possible to have only one cluster:

    • Restrict activities to the on-premises data center. Jay recommends that all applications be connected to a cluster in the local data center.
    • Security reasons. Kafka does not have security controls, which typically means implementing network-level security and physical isolation of data types.
    • Reasons for SLA control. Kafka has some multi-tenancy features but is not perfect.
Streamline Data Flow

Data exchange at the center of a single infrastructure platform can greatly simplify data flow. If all systems are connected directly, it will look like this:

If you have a data flow platform as the hub, it looks like this:

In the first picture, two data pipelines need to be established between each of the two systems, whereas in the second picture, only one input and output connector is created for each system to connect to the streaming data pipeline. When there are more systems, the number of pipes in both cases can vary greatly.

Not only that, different systems may have different data models. When point-to-point integration, each system needs to deal with different data formats provided by different systems, and the data flow platform as the central integration, each system only need to process the data format of the streaming data platform. This minimizes the value of syntax conversions.

Specify a data format

Kafka does not force event data to be in any particular format, using JSON, XML, or Avro. However, assigning an event to a company-wide data format is critical. Data follows similar specifications, and data producers and consumers do not have to write different adapters for different formats. This is the most important thing to achieve at the beginning of the streaming data platform.

Based on experience, Jay recommends selecting Apache Avro as a unified data format. Avro is a JSON-like data model that can be represented in JSON or binary form. It has the following advantages:

    • can be mapped directly with JSON;
    • There is a very compact format;
    • Very high efficiency;
    • Provides bindings to a variety of programming languages;
    • is an extensible schema language defined in pure JSON;
    • Have the best compatibility concept.

This is critical to ensure data quality and ease of use. Avro can define a schema for the data, which brings the following benefits:

    • Enhanced architecture robustness: In a streaming data platform-centric architecture, applications are loosely coupled, and if there are no patterns, there is a strong case for inconsistent data between the systems.
    • Explicit semantics: The doc attribute of each field in the pattern clearly defines the semantics of the field.
    • Compatibility: Patterns handle changes in data formats so that systems like Hadoop or Cassandra can track upstream data changes and pass only changed data to their own storage without having to re-process it.
    • Reduces the manual labor of data scientists: patterns make data very prescriptive so that they no longer need to perform low-level data reprocessing.

In addition to the above recommendations, Jay describes some of their practices at LinkedIn.

Shared event Mode

When an activity is more common in multiple systems, it should be assigned a common pattern. A common example is application errors, which can be modeled in a very generic way, allowing the errorevent stream to capture errors across the enterprise.

Modeling specific data types

The Kafka data model is built to represent the data flow. In Kafka, a stream is modeled as a topic, which is the logical name of the data. Each message contains a key for data partitioning on the cluster and a data body containing AVRO data records. Kafka maintains the history of the stream based on the SLA (for example, 7 days) or the size (such as retention 100GB) or the key.

    • Pure Event Flow: Pure Event Flow describes the activities that occur within an enterprise. For example, in a Web enterprise, these activities are clicks, display pages, and various other user behaviors. Events of each type of behavior can be represented as a separate logical flow. For simplicity, it is recommended that the Avro mode and topic use the same name. The pure event stream will always be kept on time or size. Mixing multiple events in a single topic can lead to unnecessary complexity.
    • Application log: Structured logs can be treated the same as the other events described above, where the log refers to the semi-structured application log. At LinkedIn, all application logs are published to Kafka through a custom log4j output source.
    • System metrics: Collects statistical data such as UNIX performance data and application-defined metrics, and then publishes a statistical data stream in a common format for use by monitoring platforms in the enterprise.
    • Hadoop Data loading: The most important thing is to automate the data loading process without any custom settings or mapping between Kafka topic and Hadoop Datasets. LinkedIn has developed a system called Camus for this purpose.
    • Hadoop Data release: Publish the derived streams generated by Hadoop compute to the streaming data platform.
    • Database change: Because polling may lose intermediate states, LinkedIn chooses to integrate database logs directly. For pure event data, Kafka usually retains only a short period of time. However, for a database change flow, the system may need to be fully recovered from the Kafka change log. The Kafka feature log compaction can help to achieve this requirement.
    • Extracting database data As-is, and then converting it: it is not a good idea to clean up the data before it is released to the customer, because there may be many consumers who require different customers, resulting in cleanup work that needs to be done many times for various consumers, and the cleanup process itself may lose information. So, you publish the raw data stream, and then you create a derived stream that finishes the cleanup work based on it.
Stream processing

One of the goals of the streaming data platform is to stream data between data systems, and another goal is to stream data as it arrives. In a streaming data platform, stream processing can simply be modeled as transitions between streams, as shown in:

There are many benefits of republishing processing results to Kafka during stream processing. It decouples the various parts of the flow processing, different processing tasks can be achieved by different teams using different technologies, the downstream processing process is slow to reverse the upstream process, Kafka play the role of buffer.

The most basic way to implement stream processing is to use the Kafka API to read the input data stream for processing and to produce output data streams. This process can be implemented in any programming language. This method is simple and easy to operate, adapting to any language that has a Kafka client. However, some stream processing systems provide additional functionality that makes it easier to use them to build complex real-time stream processing. Common flow-processing frameworks include Storm, Samza, and spark streaming. For the difference between them, interested readers can view here, here and here.

Putting Apache Kafka to use:a Practical Guide to Building A Stream Data Platform-part 2

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.