1. Apache Kafka
Challenge: ① collects massive amounts of data; ② analysis.
Analysis includes: User behavior data, application performance tracking, Dynamic Data displayed in log, event information ...
Kafka can process real-time information and quickly route it to multiple consumers. Provides seamless integration of information between producers, without blocking consumption, and the producer does not need to care who the consumer is.
It is an open-source, distributed, partitioned, and post-subscription messaging system that is based on replication log submissions.
① Persistent Messaging: ensure that messages are not lost, provide an O (1) constant-time performance disk design, and support high-capacity storage (TB). Information is persisted to the hard disk and replicated in the cluster to prevent data loss;
② High Throughput: processing Hundreds of MB of read and write operations per second;
③ Distributed: cluster-centric, message partitioning on Kafka servers (Maintenance of ordering semantics on each partition) and distribution of consumption on the cluster. Clusters can grow flexibly and transparently without downtime;
④ Multi-client: supports simple integration of clients from different platforms (Java, NET, PHP, Ruby, Python);
⑤ Real-time: messages generated by producer threads are immediately visible to consumer threads (this feature is very important for event-based systems, eg. complex event processing (CEP) systems)
Provides a real-time publish-subscribe solution that also supports parallel data loading in Hadoop.
in terms of production, there are different types of producers: eg.
① log generated by the front-end Web application;
② generate a producer agent for web analytics logs;
③ the producer adapter that generated the conversion log;
④ generates a producer service that invokes the trace log.
in terms of consumption: eg.
① offline consumers are using messages and storing them in Hadoop or traditional data warehouses for offline analysis;
② near real-time consumers, consuming information and storing it in NoSQL (Eg.hbase or Cassandra) for near real-time analysis;
③ like spark or storm, you can filter messages in memory to trigger alert events for related groups. 2. Why do we need Kafka?
Data typically includes user activity, event logins, page access, clicks, social networking activities such as like, share, and comment, actions
and system metrics (due to high throughput (millions of messages per second), typically handled by logging & Legacy Log Aggregation solutions-for offline analysis Eg.hadoop)
Very limited to building real-time processing systems.
Real-time analysis includes:
① based on search-related relevance, recommendations based on popularity, co-occurrence or sentiment analysis, advertising to the public, crawling from spam or unauthorized data, device sensors that send high-temperature alerts, any unusual user behavior, or hacker behavior of the application.
The real-time use of these multiple sets of data collected from production systems is a challenge due to the large volume of data being collected and processed.
The Kafka goal is to unify both offline and online processing by providing a mechanism:
The parallel load in the Hadoop system and the ability to consume the partitions on a set of machines in real time (it is useful to process streaming data).
From an architectural point of view, it is closer to a traditional messaging system, such as ACTIVEMQ or RABITMQ.
Reference: Learning Apache Kafka Second Edition