About Apache pulsar-Next generation distributed messaging system

Last Update:2018-08-19 Source: Internet

Author: User

Tags failover

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original link: https://mp.weixin.qq.com/s/uwmLR-1Jo_VNXRFA0yYWlg

Apache Pulsar is an enterprise-class release Subscription (PUB-SUB) messaging system originally developed by Yahoo and open source at the end of 2016 and is now an incubator project for the Apache Software Foundation. Pulsar in Yahoo's production environment for more than three years, to help Yahoo's main applications, such as Yahoo Mail, Yahoo Finance, Yahoo Sports, Flickr, Gemini Advertising platform and Yahoo distributed key-value storage System Sherpa.

Concepts and terminology

Applications that send data to Pulsar are called producers (producer), while applications that read data from Pulsar are called consumers (consumer). Sometimes consumers are also called subscribers. Theme (topic) is the core resource of pulsar, a topic can be regarded as a channel, the consumer sends data to this channel, the consumer pulls the data from this channel.

Figure 1: Producers, consumers, and themes

The purpose of building pulsar is to support multi-tenancy (multi-tenant) scenarios. Pulsar's multitenant mechanism contains two resources: Asset and Namespace (namespace). The assets represent the tenants in the system. Suppose a pulsar cluster is used to support multiple applications (like Yahoo!), each asset in the cluster can represent an organization's team, a core function, or a product line. An asset can contain multiple namespaces, and a namespace can contain any number of topics.

Figure 2:pulsar the relationships among the various components

Namespaces are the most basic snap-in for pulsar. At the namespace level, we can set permissions, adjust replication options, manage data replication across clusters, control the expiration of messages, or perform other critical tasks. The theme in the namespace inherits the configuration of the namespace, so we can configure all the topics in the same namespace at once. Namespaces can be divided into two types:

Local-local namespaces are visible only within the cluster.
Global-The namespace is visible to multiple clusters, either in the same data center or as a cluster across geographic data centers. This feature depends on whether the cluster replication feature is enabled.

Although the scope of the local namespace and the global namespace is different, they can be shared across different teams or in different organizations. If the application obtains write access to the namespace, it can write data to all topics within that namespace. If the theme that is written does not exist, the theme is created.

Each namespace can contain one or more topics, each subject can have multiple subscribers, and each Subscriber can receive all messages that are published to the topic. To provide more flexibility to your application, Pulsar provides three subscription types that can coexist on the same topic:

Exclusive (exclusive) subscription – At the same time there can only be one consumer.
Shared subscription--can be subscribed by multiple consumers, and each consumer receives a subset of the messages.
Failover (failover) subscription-allows multiple consumers to connect to the same topic, but only one consumer can receive messages. Only when the current consumer fails, does the other consumer start receiving messages.

Figure 3 shows these three kinds of subscriptions. Pulsar's subscription mechanism decouples the producers and consumers of messages, providing greater flexibility for applications without adding complexity and development effort.

Figure 3: Different types of pulsar subscriptions

Data partitioning

The data that is written to the topic may be only a few megabytes or several terabytes. Therefore, in some cases the subject's throughput is low and sometimes high, depending on the number of consumers. How do you deal with some topics that have high throughput and are low? To solve this problem, Pulsar distributes a topic's data across multiple machines, known as partitions.

Partitioning is a common means of handling large amounts of data in order to ensure high throughput. By default, Pulsar's theme is not partitioned, but it is easy to create a partition theme and specify the number of partitions by using the command-line tool or API.

After creating the partition theme, pulsar can automatically partition the data without affecting the producers and consumers. That is, an application writes data to a topic, and after partitioning the topic, you do not need to modify the application's code. A partition is just an operations operation, and the application does not need to care about how the partition is performed.

The partition operation of a topic is handled by a process called broker, and each node in the Pulsar cluster runs its own broker.

Figure 4: Divide a topic across multiple brokers

The topic partition does not affect the application, besides, Pulsar also provides several message routing strategies that help us better distribute data across partitions and across consumers.

Single partition-The producer randomly picks a partition and writes the data to that partition. This policy is the same as the guarantee provided by non-partitioned topics, but it can be useful if more than one producer writes data to the same topic.
Polling (round robin) partitioning – the producer distributes the data evenly across the partitions by polling. For example, the first message is written to the first partition, the second one is written to the second partition, and so on.
Hash partition--Each message will have a key and which partition to write depends on the key it has. This partitioning method can guarantee the order.
Custom partitioning--the producer uses a custom function to generate the corresponding value for the partition, and then writes the message to the corresponding partition based on that value.

Durability

After the Pulsar broker receives the message and confirms it, it must ensure that the message is not lost under any circumstances. Unlike other messaging systems, Pulsar uses Apache bookkeeper to maintain durability. Bookkeeper provides a low-latency, persistent storage. Pulsar sends the message to more than one Bookkeeper node (depending on the replication factor) after receiving the message, the node writes the data to the pre-write log (write ahead log) and also saves a copy in memory. The node forces the log to be written to the persisted storage before the message is acknowledged, so that the data is not lost even if a power failure occurs. Because pulsar Broker sends data to multiple nodes, it only sends a confirmation message to the producer after most nodes (quorum) Confirm that the write is successful. Pulsar is the way to ensure that data is not lost even in the event of a hardware failure, network failure, or other failure. In the following article, we will delve into this detail.

Production environment Practice

Pulsar is currently helping Yahoo's main applications, such as Yahoo Mail, Yahoo Finance, Yahoo Sports, Gemini advertising platform and Yahoo distributed key-value storage System Sherpa. Many scenarios require strong durability guarantees, such as 0 data loss, while requiring high performance. Pulsar has been deployed to production since 2015 and is now operating on a large scale in the production environment of Yahoo.

Pulsar is deployed in more than 10 data centers with full grid replication capability
More than 100 billion messages processed per day
1.4 million themes supported
Overall message release latency is less than 5 milliseconds

Summarize

In this article, we briefly introduce some of the concepts of Apache Pulsar and explain how pulsar is guaranteed to persist by submitting data before sending a confirmation message, as well as by partitioning to improve throughput, and so on. In subsequent articles, we will delve into the overall architecture and feature details of pulsar, and we will provide some guidance on how to better use pulsar.

View English text:
Introduction to the Apache Pulsar pub-sub messaging platform:
https://streaml.io/blog/intro-to-pulsar/

About Apache pulsar-Next generation distributed messaging system

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More