1, Kafka is what.
Kafka, a distributed publish/subscribe-based messaging system developed by LinkedIn, is written in Scala and is widely used for horizontal scaling and high throughput rates.
2. Create a background
Kafka is a messaging system that serves as the basis for the activity stream of LinkedIn and the Operational Data Processing pipeline (Pipeline). Activity flow data is the most common part of data that almost all sites use to make reports about their site usage. Activity data includes content such as page views, information about the content being viewed, and search conditions. This data is typically handled by writing various activities to a file in the form of a log, and then periodically analyzing the files in a statistical manner. Operational data refers to 3 of the server's performance data (CPU, IO usage, request time, service log, and so on). There are a wide variety of statistical methods for operating data.
3. Basic architecture Diagram
4. Basic Concept Explanation
1) Broker
The Kafka cluster contains one or more servers, which are called broker. The broker side does not maintain the consumption status of the data and improves performance. Direct disk storage, linear read and write, fast: avoids duplication of data between the JVM's memory and system memory, and reduces the consumption of performance-creating objects and garbage collection.
2) Producer
Responsible for publishing messages to Kafka broke
3) Consumer
The message consumer, the client that reads the message to Kafka broker, consumer pulls the data from the broker and processes it.
4) Topic
Each message published to the Kafka Cluster has a category, which is called topic. (Physically different topic messages are stored separately, logically a topic message is saved on one or more brokers but the user only needs to specify the topic of the message to produce or consume data without worrying about where the data is stored)
5) Partition
Parition is a physical concept, and each topic contains one or more partition.
6) Consumer Group
Each consumer belongs to a specific consumer group (the group name can be specified for each consumer, and the default group if the group name is not specified)
7) Topic & Partition
Topic can logically be thought of as a queue, and each consumption must specify its topic, which can be simply understood to indicate which queue to put the message in. In order to make the Kafka throughput can be linearly improved, the topic is physically divided into one or more partition, each partition in the physical corresponding to a folder, the folder stores all messages and index files of this partition. If you create Topic1 and Topic2 two topic, with 13 and 19 partitions respectively, a total of 32 folders will be generated on the entire cluster (a total of 8 nodes are used in this article, where Topic1 and Topic2 replication-factor are 1).
5. Applicable Scenarios
1, Messaging
For some conventional messaging systems, Kafka is a good choice; partitons/replication and fault tolerance can make the Kafka have good scalability and performance advantages. But so far, we should be aware that Kafka does not provide "transactional "" Message transmission guarantee (message acknowledgement mechanism) "message packet" and other enterprise-class features; Kafka can only be used as a "regular" message system, to some extent, has not ensured that the message is sent and received absolutely reliable (for example, the message resend, message sent lost, etc.)
2, Website activity Tracking
Kafka can be the best tool for "Site activity tracking" and can send information such as Web page/user actions to Kafka. And real-time monitoring, or offline statistical analysis, etc.
3, Metrics
Kafka are typically used for operational monitoring data. This includes aggregated statistics from distributed applications that are used to produce centralized operational data feeds.
4. Log Aggregation
The Kafka feature determines that it is well suited as a "log collection center", application can send the operation log "bulk" "asynchronously" to the Kafka cluster instead of being stored locally or in db; Kafka can submit messages in batches/compressed messages, etc. For the producer end, it is almost impossible to feel the cost of performance. At this point consumer can make other systematic storage and analysis systems such as Hadoop