About Apache pulsar-Next generation distributed messaging system

Source: Internet
Author: User
Tags failover

Original link: https://mp.weixin.qq.com/s/uwmLR-1Jo_VNXRFA0yYWlg

Apache Pulsar is an enterprise-class release Subscription (PUB-SUB) messaging system originally developed by Yahoo and open source at the end of 2016 and is now an incubator project for the Apache Software Foundation. Pulsar in Yahoo's production environment for more than three years, to help Yahoo's main applications, such as Yahoo Mail, Yahoo Finance, Yahoo Sports, Flickr, Gemini Advertising platform and Yahoo distributed key-value storage System Sherpa.

Concepts and terminology

Applications that send data to Pulsar are called producers (producer), while applications that read data from Pulsar are called consumers (consumer). Sometimes consumers are also called subscribers. Theme (topic) is the core resource of pulsar, a topic can be regarded as a channel, the consumer sends data to this channel, the consumer pulls the data from this channel.

Figure 1: Producers, consumers, and themes

The purpose of building pulsar is to support multi-tenancy (multi-tenant) scenarios. Pulsar's multitenant mechanism contains two resources: Asset and Namespace (namespace). The assets represent the tenants in the system. Suppose a pulsar cluster is used to support multiple applications (like Yahoo!), each asset in the cluster can represent an organization's team, a core function, or a product line. An asset can contain multiple namespaces, and a namespace can contain any number of topics.

Figure 2:pulsar the relationships among the various components

Namespaces are the most basic snap-in for pulsar. At the namespace level, we can set permissions, adjust replication options, manage data replication across clusters, control the expiration of messages, or perform other critical tasks. The theme in the namespace inherits the configuration of the namespace, so we can configure all the topics in the same namespace at once. Namespaces can be divided into two types:

    • Local-local namespaces are visible only within the cluster.

    • Global-The namespace is visible to multiple clusters, either in the same data center or as a cluster across geographic data centers. This feature depends on whether the cluster replication feature is enabled.

Although the scope of the local namespace and the global namespace is different, they can be shared across different teams or in different organizations. If the application obtains write access to the namespace, it can write data to all topics within that namespace. If the theme that is written does not exist, the theme is created.

Each namespace can contain one or more topics, each subject can have multiple subscribers, and each Subscriber can receive all messages that are published to the topic. To provide more flexibility to your application, Pulsar provides three subscription types that can coexist on the same topic:

    • Exclusive (exclusive) subscription – At the same time there can only be one consumer.

    • Shared subscription--can be subscribed by multiple consumers, and each consumer receives a subset of the messages.

    • Failover (failover) subscription-allows multiple consumers to connect to the same topic, but only one consumer can receive messages. Only when the current consumer fails, does the other consumer start receiving messages.

Figure 3 shows these three kinds of subscriptions. Pulsar's subscription mechanism decouples the producers and consumers of messages, providing greater flexibility for applications without adding complexity and development effort.

Figure 3: Different types of pulsar subscriptions

Data partitioning

The data that is written to the topic may be only a few megabytes or several terabytes. Therefore, in some cases the subject's throughput is low and sometimes high, depending on the number of consumers. How do you deal with some topics that have high throughput and are low? To solve this problem, Pulsar distributes a topic's data across multiple machines, known as partitions.

Partitioning is a common means of handling large amounts of data in order to ensure high throughput. By default, Pulsar's theme is not partitioned, but it is easy to create a partition theme and specify the number of partitions by using the command-line tool or API.

After creating the partition theme, pulsar can automatically partition the data without affecting the producers and consumers. That is, an application writes data to a topic, and after partitioning the topic, you do not need to modify the application's code. A partition is just an operations operation, and the application does not need to care about how the partition is performed.

The partition operation of a topic is handled by a process called broker, and each node in the Pulsar cluster runs its own broker.

Figure 4: Divide a topic across multiple brokers

The topic partition does not affect the application, besides, Pulsar also provides several message routing strategies that help us better distribute data across partitions and across consumers.

    • Single partition-The producer randomly picks a partition and writes the data to that partition. This policy is the same as the guarantee provided by non-partitioned topics, but it can be useful if more than one producer writes data to the same topic.

    • Polling (round robin) partitioning – the producer distributes the data evenly across the partitions by polling. For example, the first message is written to the first partition, the second one is written to the second partition, and so on.

    • Hash partition--Each message will have a key and which partition to write depends on the key it has. This partitioning method can guarantee the order.

    • Custom partitioning--the producer uses a custom function to generate the corresponding value for the partition, and then writes the message to the corresponding partition based on that value.

Durability

After the Pulsar broker receives the message and confirms it, it must ensure that the message is not lost under any circumstances. Unlike other messaging systems, Pulsar uses Apache bookkeeper to maintain durability. Bookkeeper provides a low-latency, persistent storage. Pulsar sends the message to more than one Bookkeeper node (depending on the replication factor) after receiving the message, the node writes the data to the pre-write log (write ahead log) and also saves a copy in memory. The node forces the log to be written to the persisted storage before the message is acknowledged, so that the data is not lost even if a power failure occurs. Because pulsar Broker sends data to multiple nodes, it only sends a confirmation message to the producer after most nodes (quorum) Confirm that the write is successful. Pulsar is the way to ensure that data is not lost even in the event of a hardware failure, network failure, or other failure. In the following article, we will delve into this detail.

Production environment Practice

Pulsar is currently helping Yahoo's main applications, such as Yahoo Mail, Yahoo Finance, Yahoo Sports, Gemini advertising platform and Yahoo distributed key-value storage System Sherpa. Many scenarios require strong durability guarantees, such as 0 data loss, while requiring high performance. Pulsar has been deployed to production since 2015 and is now operating on a large scale in the production environment of Yahoo.

    • Pulsar is deployed in more than 10 data centers with full grid replication capability

    • More than 100 billion messages processed per day

    • 1.4 million themes supported

    • Overall message release latency is less than 5 milliseconds

Summarize

In this article, we briefly introduce some of the concepts of Apache Pulsar and explain how pulsar is guaranteed to persist by submitting data before sending a confirmation message, as well as by partitioning to improve throughput, and so on. In subsequent articles, we will delve into the overall architecture and feature details of pulsar, and we will provide some guidance on how to better use pulsar.

View English text:
Introduction to the Apache Pulsar pub-sub messaging platform:
https://streaml.io/blog/intro-to-pulsar/

About Apache pulsar-Next generation distributed messaging system

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.