I can finally write a Kafka article. I have indexed the articles related to Mina and set them on top in my blog. You can easily find them. This article introduces the distributed messaging system.
When we use a large number of distributed databases and distributed computing clusters, do we encounter such problems:
L I want to analyze user behavior (pageviews) so that I can design a better advertising space.
L I want to collect statistics on users' search keywords and analyze the current trends. This is very interesting. in economics, there is a theory of long skirts, that is to say, if the sales volume of long skirts is high, it means the economy is sluggish, because girls have no money to buy all kinds of stockings.
L some data, I think it is a waste of database storage, and I am afraid that the operation efficiency will be low when I directly store the hard disk.
At this time, we can use the distributed message system. Although the above description is more inclined to a log system, Kafka is used in a large number of log systems in practical applications.
First, we need to understand what a message system is. On the Kafka official website, Kafka is defined as a distributed publish-subscribe messaging system. Publish-subscribe refers to publishing and subscription. Therefore, Kafka is a message subscription and publishing system. The publish-subscribe concept is very important, because the design concept of Kafka can be mentioned from here.
We call the publish (publish) of a message as the producer, express the subscribe of the message as a consumer, and call the storage array in the middle as a broker, in this way, we can roughly describe such a scene:
Producers (blue, blue-collar, always work hard) to produce and store data in brokers. Consumers need to consume data and extract data from brokers, then complete a series of data processing.
At first glance, this is too simple. Doesn't it mean that it is distributed? Is it even distributed to put producer, broker, and consumer on three different machines. Let's take a look at the official figure of Kafka:
Multiple brokers work collaboratively. producer and consumer are deployed in various business logic and are frequently called. The three use zookeeper to manage and coordinate requests and forwarding. Such a high-performance distributed message publishing and subscription system is complete. Note that the process of producer to broker is push, that is, data is pushed to broker, while the process of consumer to broker is pull, the consumer actively pulls data, rather than actively sending data to the consumer through the broker.
Where does such a system reflect its high performance? Let's look at the description on the official website:
- Persistent messaging with O (1) disk structures that provide constant time performance even with Bytes TB of stored messages.
- High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
- Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
- Support for parallel data load into hadoop.
As for why there is O (1) Efficiency and why there is high throughput, we will talk about it in subsequent articles. Today we are mainly concerned with the design concept of Kafka. After learning about the performance, let's take a look at what Kafka can do. Apart from what I mentioned at the beginning, let's take a look at what Kafka is actually running and used in:
- LinkedIn-Apache Kafka is used at LinkedIn for activity stream data and operational metrics. This powers
Various products like LinkedIn newsfeed, LinkedIn today in addition to our offline analytics systems like hadoop.
- Tumblr http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html
- Mate1.com Inc.-Apache Kafka is used at mate1 as our main event bus that powers our news and Activity
Feeds, automated review systems, and will soon power Real Time configurations and log distribution.
- Tagged-Apache Kafka drives our new pub sub system which delivers real-time events for users in our latest
Game-deckadence. It will soon be used in a host of new use cases including group chat and back end stats and log collection.
- Boundary-Apache Kafka aggregates high-flow message streams into a uniied distributed pubsub service,
Brokering the data for other internal systems as part of boundary's real-time network analytics infrastructure.
- Datasift-Apache Kafka is used at Datasift as a collector of monitoring events and to track user's consumption
Of data streams in real time. http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html
- Wooga-we use Kafka to aggregate and process tracking data from all our Facebook games (which are hosted
Various providers) in a central location.
- Addthis-Apache Kafka is used at addthis to collect events generated by our data network and broker that
Data to our analytics clusters and real-time web analytics platform.
- Urban Airship-at urban airship we use Kafka to buffer incoming data points from mobile devices
For processing by our analytics infrastructure.
- Metamarkets-we use Kafka to collect realtime event data from clients, as well as our own internal service
Metrics, that feed our interactive analytics dashboards.
- Socialtwist-we use Kafka internally as part of our reliable email queueing system.
- Countandra-we use a Hierarchical Distributed counting engine, uses Kafka as a primary speedy Interface
As well as routing events for cascading counting
- Flyhajj.com-we use Kafka to collect all metrics and events generated by the users of the website.
At this point, you should have an understanding of what kind of Kafka is, its basic structure, and what it can do. Then, let's go back to the relations between producer, consumer, broker, and zookeeper.
As shown in the figure above, we can reduce the number of brokers by only one. Now let's assume that the deployment is as follows:
L server-1 broker is actually the Kafka server, because both producer and consumer are connected to it. The broker is mainly used for storage.
L server-2 is the server of zookeeper. You can check the specific role of zookeeper on the official website. Here you can imagine that it maintains a table, it records the IP address and port information of each node (it will be discussed later, and it also contains information about Kafka ).
L servers-3, 4, and 5 are all configured with zkclient. More specifically, Zookeeper addresses must be configured before running. The principle is also very simple, all the connections between them require zookeeper for distribution.
L The relationship between server-1 and server-2. They can be placed on a single machine, can be opened separately, and zookeeper can also be configured with a cluster. The purpose is to prevent a device from being crashed.
To put it simply, the sequence of running the entire system is as follows:
1. Start the server of zookeeper
2. Start the server of Kafka
3. If the producer produces data, it first finds the broker through zookeeper, and then stores the data into the broker.
4. If the consumer needs to consume data, it first finds the corresponding broker through zookeeper and then consumes it.
The preliminary understanding of Kafka is written here. Next I will write about how to build the Kafka environment. Finally, I would like to thank @ rockybean for his guidance and help.
From: http://my.oschina.net/ielts0909/blog/92972