Apache Pulsar is the next-generation large-scale distributed messaging system for Yahoo! 2016, which is now part of the Apache Foundation. It has been deployed and used in Yahoo's production environment for nearly 4 years, serving Mail, Finance, Sports, Flickr, the Gemini Ads platform, Sherpa and the KV storage of Yahoo, etc., in Yahoo Global 8 All-in-one data centers, and supports the 200多万个 Topics.
Apache Pulsar has several features that are distinctly different from other messaging systems:
Excellent data persistence and sequencing. Each message provides a globally unique ID, multiple copies, and is returned to the user after a real-time brush.
Unified consumption Model: Supports Stream (such as Kafka) and Queue (such as RabbitMQ) two consumption models, support exclusive, failover and shared three consumption patterns.
Flexible extensibility: The linear and instantaneous completion of node expansion, no data copying and migration in the extension.
High throughput low latency, with a real-time brush disk, still provides high bandwidth (1.8 million messages/sec) and low latency (5ms at 99%).
Pulsar 2.1 Enriches the properties of "Stream Native" beyond the message system, such as Schema support, hierarchical storage, state functions, etc...
===========
Today Apache Pulsar officially released the 2.1.0 version! The release from version 2.0 is less than two months apart. In this short two-month period, the community has contributed very rapidly. The 2.1 release contains a number of new features and improvements that help pulsar evolve from a distributed messaging system into a full streaming native (stream Native) real-time data platform .
In this release, you can see several important features:
Pulsar IO: A non-server connector (connectors) framework based on PULSAR functions implementation, and a set of built-in connector implementations
Tiered Storage: Tiered storage
Stateful Functions: State function
Clients:go language Client
Schema: Support Avro and PROTOBUF
Pulsar IO
In the Pulsar 2.0 release, we first introduced the Pulsar Functions, a lightweight computing framework based on server-free (serverless). This computational framework provides the easiest way for users to write flow calculation logic. Since the release of Pulsar functions, the community has received a very enthusiastic response, and many community users are very fond of this feature. Because the learning cost of using it is basically zero, as long as you can write Java or Python functions, you can write the logic of stream computing in pulsar.
We have continued the concept of minimalism (simplicity first) in the process of developing pulsar 2.1. We implemented a non-server connector (connectors) framework pulsar IO on the pulsar functions to simplify user import data to pulsar and export data from Pulsar. The user does not need to write any code when using the connector. All you need to do is prepare a configuration file for the system you need to connect to, and then you can use the management tools provided by Pulsar to submit the appropriate connector to pulsar. Pulsar will be responsible for the rest, including fault-tolerant management, load balancing, and scaling up as the load is automatically scaled.
In addition, the 2.1 version contains six built-in connector implementations. They were:
You can refer to the Pulsar 2.1 tutorial to learn how to use the Cassandra Connector to export data from Pulsar to Cassandra.
We plan to include more connector implementations in future releases. If you are interested in pulsar and want to be Pulsar's code contributor, we welcome you to develop different connectors for Pulsar. The development of connectors is also straightforward, as is the process of writing a pulsar function for streaming.
Tiered storage
The biggest advantage of Apache Pulsar compared to other messaging/streaming systems is that it is based on the Apache Bookkeeper block Storage (Segment Storage) architecture . Within pulsar, a topic partition (Topic Partition, or a stream) is cut into chunks (segments) stored in bookkeeper. This means that the capacity of a topic partition is not limited by the capacity of a single machine. As long as the entire cluster has enough capacity, you can add unlimited data to the previous topic partition. if your cluster starts to have no capacity, you simply add a storage node to the cluster, and pulsar will start to automatically use the new storage node without rebalancing the existing data partition. However, if you keep accumulating historical data in bookkeeper, the overhead of the entire cluster becomes expensive.
Pulsar addresses the tradeoff between capacity and overhead by providing tiered storage (tiered Storage). Tiered storage by offloading old chunks of data from bookkeeper to cheaper storage systems (such as AWS S3, Google GCs, and HDFs), it really turns pulsar into a truly infinite stream of data (Infinite Streams) storage. For end users, they don't need to know whether the data is stored in bookkeeper or stored in a cheaper storage system, and the entire process is transparent to the user. This also means that users write a set of code to consume the latest streaming and historical data at the same time.
Currently, version 2.1 only supports S3. In the next 2.2 releases, we are about to support more cloud storage, such as Google Gcs,azure Blobstore and HDFs. If you are interested in tiered storage and want to contribute to other cloud storage, we also welcome your code contribution.
State function
One of the most challenging problems in stream computing is the management of the state of the solution . Pulsar functions also faces the same problem. Our intention to develop Pulsar functions is to simplify the logic of writing native stream processing for developers. We also want to simplify the management of state in stream computing. Since we introduced the State API in version 2.1, developers can use the state API to store some of the state of the computation in the underlying storage system. This state API is in-depth integration with Apache Bookkeeper's table service. It currently supports simple key/value operations as well as an Inc operation for count support.
The status function is published as a developer preview (Developer Preview) feature in version 2.1. We want to refine a set of APIs that really streamline the flow computing state management by collecting community feedback. If you have good ideas and suggestions, please contact us via pulsar mailing list, github or slack.
Schema
In version 2.0, Pulsar introduced native support for schema. This means that you can define the schema of the message when creating the pulsar theme, and then Pulsar will guarantee the integrity of the message based on the schema you specify. Version 2.0 supports only String,bytes and JSON three built-in schemas. Starting with version 2.1, pulsar native supports both types of Avro and PROTOBUF.
Schema will transform pulsar from a message system that supports unstructured data to a streaming data platform that also supports structured data. In the next 2.2 releases, the schema will be used as the most important cornerstone component for querying streaming data, the legendary pulsar SQL.
Client
In Release 2.1, we officially released the official go language client. The Go language client is based on the native C + + client implementation, so it is a client implementation that is directly used on the production line. In addition to the official native Client, Comcast released the client that they used in the native go language for the production line.
Conclusion
Apache Pulsar is the next generation messaging system for Yahoo Open source. At the beginning of 2017, Yahoo will pulsar to Apache for hatching. Over the past year, Apache Pulsar has released a total of 6 editions, including 2.0 milestones released in June. The latest version of 2.1 continues the pulsar minimalism principle, and in the real sense, pulsar is transformed from a distributed messaging system into a complete streaming native data platform. In the next one months, the 2.2 version, we will have more powerful features released. Welcome attention and participation in the Pulsar community.
Pulsar 2.1 's Download link: https://pulsar.incubator.apache.org/en/download/
Pulsar Project Links: https://pulsar.incubator.apache.org/
Pulsar's GitHub code base: Https://github.com/apache/incubator-pulsar
Pulsar's Slack channel:https://apache-pulsar.herokuapp.com/
Pulsar Mailing list: https://pulsar.incubator.apache.org/contact/
Original link: HTTPS://MP.WEIXIN.QQ.COM/S/KLF1UJFKVISXKM0VJ12FWG
Next-generation messaging system Apache PULSAR 2.1 Heavyweight release