Kafka use the Getting Started Tutorial 1th/2 page

Kafka use the Getting Started Tutorial 1th/2 page _linux

Last Update:2017-01-18 Source: Internet

Author: User

Tags sleep zookeeper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduced

Kafka is a distributed, partitioned, replicable messaging system. It provides the functionality of a common messaging system, but has its own unique design. What does this unique design look like?

Let's first look at a few basic messaging system terms:

Kafka the message to topic as a unit.
• The program that will release the message to Kafka topic becomes producers.
• The process of subscribing to topics and consuming messages becomes consumer.
Kafka is run as a cluster and can consist of one or more services, each of which is called a broker.
Producers sends messages to the Kafka cluster over the network, and the cluster provides messages to consumers, as shown in the following illustration:

The client and the server communicate through the TCP protocol. Kafka provides Java clients and provides support for multiple languages.

Topics and logs

Let's take a look at an abstract concept Kafka provides: topic.

A topic is a generalization of a set of messages. The log for each Topic,kafka is partitioned, as shown in the following illustration:

Each partition consists of a sequence of ordered, immutable messages that are appended to the partition consecutively. Each message in the partition has a sequential serial number called offset, which is used to uniquely identify the message in the partition.

within a configurable time period, the Kafka cluster retains all published messages, regardless of whether the messages are consumed. For example, if a message's save policy is set to 2 days, it can be consumed within two days of the time a message is released. It will then be discarded to free up space. Kafka's performance is constant-level independent of the amount of data, so it's not a problem to keep too much data.

In fact, the only data that each consumer needs to maintain is the position of the message in the log, that is, offset. This offset has consumer to maintain: generally with consumer not Broken read message, the value of offset is increasing, but in fact consumer can read the message in any order, for example, it can make the offset setting an old value to reread the previous message. The combination of the features above

makes the Kafka consumers very lightweight: they can read messages without affecting the cluster and other consumer. You can use the command line to "tail" messages without affecting other consumer that are consuming messages. The

partitions can be used to achieve the following purposes: First, this makes the number of logs per log not too large to be saved on a single service. In addition, each partition can be published and consumed separately, providing a possibility for concurrent operations topic.

Distributed

Each partition has a copy of several services in the Kafka cluster so that the services that hold replicas can work together on data and requests, and the number of replicas can be configured. Replicas give Kafka the ability to fault tolerance.
Each partition is made up of one server as "leader", 0 or several servers as "followers", leader is responsible for processing the read and write of messages, and followers to replicate leader. If leader down, One of the followers will automatically become a leader. Each service in a cluster can play two roles at the same time: as a part of the partition that it holds, and as a followers of other partitions, the cluster will have a better load-balancing leader.

Producers

Producer publishes the message to the topic it specifies, and is responsible for deciding which partition to publish to. Typically, the partition is randomly selected by the load-balancing mechanism, but it can also be selected by a specific partition function. The use of more is the second kind.

Consumers

There are usually two modes of

Publishing messages: Queue mode (queuing) and publish-subscribe mode (publish-subscribe). In queue mode, consumers can read messages from the server at the same time, and each message is read only by one of the consumer; The message is broadcast to all consumer in the publish-subscribe mode. Consumers can join a consumer group to compete for a message in a topic,topic will be distributed to a member of a group. The consumer in the same group can be in different programs or on different machines. If all the consumer are in one group, this becomes the traditional queue pattern, and load balancing is achieved in each consumer. If all the consumer are not in different groups, this becomes the publish-subscribe model, and all messages are distributed to all consumer. More often, each topic has a number of consumer groups, each of which is a logically "subscriber", and each group consists of several consumer for fault tolerance and better stability. This is actually a publish-subscribe model, except that the Subscriber is a group rather than a single consumer.

More often, each topic has a number of consumer groups, each of which is a logically "subscriber", and each group consists of several consumer for fault tolerance and better stability. This is actually a publish-subscribe model, except that the Subscriber is a group rather than a single consumer.

The two-machine cluster has 4 partitions (P0-P3) and 2 consumer groups. Group A has two Consumerb groups and 4

Compared with the traditional message system, the Kafka can guarantee the order well.

Traditional queues keep ordered messages on the server, and if multiple consumers consume messages from this server at the same time, the server distributes messages to consumer in the order in which they are stored. Although the server publishes the messages sequentially, the messages are distributed asynchronously to each consumer, so when the message arrives it may have lost its original order, meaning that concurrent consumption will result in sequential confusion. To avoid failure, such a message system usually uses the concept of "dedicated consumer", which is to allow only one consumer to consume the message, which means, of course, the loss of concurrency.

In this respect Kafka do better, through the concept of partitioning, Kafka can provide a better order and load balance in multiple consumer groups concurrency. Distribute each partition only to one consumer group, so that a partition is consumed by only one consumer of the group, and the message of this partition can be consumed sequentially. Because there are multiple partitions, it is still possible to load balance between multiple consumer groups. Note that the number of consumer groups cannot be more than the number of partitions, that is, how many partitions allow for concurrent consumption.

Kafka can only guarantee the ordering of messages within a partition, which is not possible between different partitions, which already satisfies the needs of most applications. If you need the order of all messages in the topic, you can only have one partition for this topic, and of course, only one consumer group consumes it.

The next step is to build a Kafka operating environment.

Step 1: Download Kafka Click to download the latest version and unzip it.

 > Tar-xzf kafka_2.9.2-0.8.1.1.tgz > CD kafka_2.9.2-0.8.1.1

Step 2: Start the service

Kafka used the zookeeper, all start the Zookper, the following simple to enable a single instance of the Zookkeeper service. You can add a & symbol at the end of the command so that you can start and leave the console.

 > bin/zookeeper-server-start.sh config/zookeeper.properties &[2013-04-22 15:01:37,495] INFO Reading configuration from:config/zookeeper.properties (org.apache.zookeeper.server.quorum.QuorumPeerConfig) ...

start Kafka now:

 > bin/kafka-server-start.sh config/server.properties[2013-04-22 15:01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties) [2013-04-22 15:01:47,051] INFO property Socket.send.buffer.bytes is overridden to 1048576 (kafka.utils.VerifiableProperties) ...

Step 3: Create topic

Create a topic called "Test", which has only one partition, one copy.

> bin/kafka-topics.sh--create--zookeeper localhost:2181--replication-factor 1--partitions 1--topic test

You can view the topic created by using the List command:

> bin/kafka-topics.sh--list--zookeeper localhost:2181test

In addition to creating topic manually, you can configure broker to make it automatically create topic. Step 4: Send a message.

Kafka uses a simple command-line producer to read messages from a file or from standard input and send them to the server. Each of the default commands will send a message.

Run producer and lose some messages in the console that will be sent to the server:

> bin/kafka-console-producer.sh--broker-list localhost:9092--topic Test This is a messagethis was another message

CTRL + C can quit sending.

Step 5: Start Consumerkafka also has a command line consumer that'll dump out messages to standard output.

Kafka also has a command line consumer can read messages and output to standard output:

> bin/kafka-console-consumer.sh--zookeeper localhost:2181--topic Test--from-beginningthis is a messageThis are Anoth ER message

You run the consumer command line in one terminal, the other terminal runs the producer command line, you can enter a message in one terminal and another terminal reads the message.

Both commands have their own optional parameters, and you can see Help information without any parameters at run time.

Step 6: Build a cluster of multiple broker

Just started a single broker and now starts a cluster of 3 broker nodes that are also on the local computer:

Start by writing a configuration file for each node:

> CP config/server.properties config/server-1.properties

> CP config/server.properties config/server-2.properties

Add the following parameters to the newly copied file:

Config/server-1.properties:broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 config/server-2.properties:broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2

Broker.id is the only one node in the cluster, because on the same machine, different ports and log files must be developed to avoid overwriting the data.

We already have zookeeper and our single node started, so we just need to start the two new nodes:

You have just started a zookeeper and a node, now start another two nodes:

> bin/kafka-server-start.sh config/server-1.properties &...> bin/kafka-server-start.sh config/ Server-2.properties.

Create a topic with 3 replicas:

> bin/kafka-topics.sh--create--zookeeper localhost:2181--replication-factor 3--partitions 1--topic My-replicated-topic

Now that we've built a cluster, how do we know each node's information? You can run the describe topics command:

 > bin/kafka-topics.sh--describe--zookeeper localhost:2181--topic my-replicated-topictopic:  My-replicated-topic partitioncount:1 replicationfactor:3 configs:topic:my-replicated-topic partition:0 leader:1 replicas:1,2,0 isr:1,2,0

 explains the output below. The first line is a description of all the partitions, and each partition corresponds to one row, because we have only one partition, so we just add a line below.

Leader: Responsible for processing the read and write of messages, leader are randomly selected from all nodes. Replicas: All replica nodes are listed, regardless of whether the node is in the service. ISR: is a node in service.

In our example, Node 1 is run as a leader.

Sends a message to topic:

 > bin/kafka-console-producer.sh--broker-list localhost:9092--topic my-replicated- topic...my test message 1my test 2^c

consume these messages:

 > bin/kafka-console-consumer.sh--zookeeper localho st:2181--from-beginning--topic my-replicated-topic...my test message 1my test message 2^c

Test the fault tolerance capability. Broker 1 runs as leader, and now we kill it:

> PS | grep server-1.properties7564 ttys002 0:15.91/system/library/frameworks/javavm.framework/versions/1.6/home/bin/ Java...> kill-9 7564

Another node was selected Leader,node 1 no longer appears in the In-sync replica list:

> bin/kafka-topics.sh--describe--zookeeper localhost:218192--topic my-replicated-topictopic: My-replicated-topic  partitioncount:1  replicationfactor:3  configs:  topic:my-replicated-topic  partition:0 Leader:2  replicas:1,2,0 isr:2,0

Although the original leader was down, the previous message could be consumed:

> bin/kafka-console-consumer.sh--zookeeper localhost:2181--from-beginning--topic my-replicated-topic...my test Message 1my test Message 2^c

It seems that Kafka's fault-tolerant mechanism is still good.

In the previous article we built a Kafka server and could use Kafka command-line tools to create topic, send and receive messages.

Here we build the Kafka development environment.

Add dependencies

Building a development environment requires the introduction of a Kafka jar package, one way is to add the Kafka package lib under the jar package into the project Classpath, this is relatively simple. But we use another more popular approach: using MAVEN to manage Jar pack dependencies.

After you create the MAVEN project, add the following dependencies in Pom.xml:

Copy Code code as follows:

<dependency>
<groupId> org.apache.kafka</groupid >
<artifactId> Kafka_2.10</artifactid >
<version> 0.8.0</version>
</dependency>

After adding dependencies you will find that two of the jar bundle dependencies are not found. It's okay, I've got it all figured out for you. Click here to download these two jar packages, after decompression you have two choices, the first is to use the MVN Install command to install the jar package to the local warehouse, the other is to directly copy the extracted folder to the MVN local warehouse in the COM folder, such as my local warehouse is D : \MVN, after completing my directory structure is this:

Configure programs

The first is an interface that acts as a configuration file, configured with various connection parameters for Kafka:

Package Com.sohu.kafkademon;

Public interface kafkaproperties
{
 final static String Zkconnect = "10.22.10.139:2181";
 Final static String groupId = "group1";
 Final static String topic = "Topic1";
 Final static String Kafkaserverurl = "10.22.10.139";
 Final static int kafkaserverport = 9092;
 Final static int kafkaproducerbuffersize = 1024;
 Final static int connectiontimeout = 20000;
 Final static int reconnectinterval = 10000;
 Final static String Topic2 = "Topic2";
 Final static String topic3 = "TOPIC3";
 Final static String clientId = "simpleconsumerdemoclient";
}

producer

Package Com.sohu.kafkademon;

Import java.util.Properties;
Import Kafka.producer.KeyedMessage;

Import Kafka.producer.ProducerConfig; /** * @author leicui bourne_cui@163.com/public class Kafkaproducer extends Thread {private final Kafka.javaapi.prod Ucer.
 Producer<integer, string> Producer;
 Private final String topic;

 Private Final Properties Props = new properties ();
  Public kafkaproducer (String topic) {props.put ("Serializer.class", "Kafka.serializer.StringEncoder");
  Props.put ("Metadata.broker.list", "10.22.10.139:9092");
  Producer = new Kafka.javaapi.producer.producer<integer, string> (new Producerconfig (props));
 this.topic = topic;
  @Override public void Run () {int messageno = 1;
   while (true) {string messagestr = new String ("Message_" + Messageno);
   System.out.println ("Send:" + messagestr);
   Producer.send (New Keyedmessage<integer, string> (topic, messagestr));
   messageno++;
   try {sleep (3000); catch (InterruptedexcePtion e) {//TODO auto-generated catch block E.printstacktrace (); }
  }
 }

}

Consumer

Package Com.sohu.kafkademon;
Import Java.util.HashMap;
Import java.util.List;
Import Java.util.Map;

Import java.util.Properties;
Import Kafka.consumer.ConsumerConfig;
Import Kafka.consumer.ConsumerIterator;
Import Kafka.consumer.KafkaStream;

Import Kafka.javaapi.consumer.ConsumerConnector; /** * @author leicui bourne_cui@163.com/public class Kafkaconsumer extends Thread {private final Consumerconnector
 Consumer

 Private final String topic; Public Kafkaconsumer (String topic) {consumer = Kafka.consumer.Consumer.createJavaConsumerConnector (Createconsumer
  Config ());
 this.topic = topic;
  private static Consumerconfig Createconsumerconfig () {Properties props = new properties ();
  Props.put ("Zookeeper.connect", kafkaproperties.zkconnect);
  Props.put ("Group.id", kafkaproperties.groupid);
  Props.put ("zookeeper.session.timeout.ms", "40000");
  Props.put ("zookeeper.sync.time.ms", "200");
  Props.put ("auto.commit.interval.ms", "1000"); return new Consumerconfig (props);
  @Override public void Run () {map<string, integer> topiccountmap = new hashmap<string, integer> ();
  Topiccountmap.put (topic, New Integer (1)); Map<string, list<kafkastream<byte[], byte[]>>> consumermap = Consumer.createmessagestreams (
  TOPICCOUNTMAP);
  Kafkastream<byte[], byte[]> stream = Consumermap.get (topic). get (0);
  Consumeriterator<byte[], byte[]> it = Stream.iterator ();
   while (It.hasnext ()) {System.out.println ("Receive:" + New String (It.next (). message ());
   try {sleep (3000);
   catch (Interruptedexception e) {e.printstacktrace (); }
  }
 }
}

Run the following program, you can do a simple send Receive message: A simple send receive

Package Com.sohu.kafkademon;

/**
 * @author leicui bourne_cui@163.com * * Public
class Kafkaconsumerproducerdemo
{
 public static void Main (string[] args)
 {
  Kafkaproducer producerthread = new Kafkaproducer (kafkaproperties.topic);
  Producerthread.start ();

  Kafkaconsumer consumerthread = new Kafkaconsumer (kafkaproperties.topic);
  Consumerthread.start ();
 }

High Level consumer

The following is a program that compares the payload to send and receive:

Package Com.sohu.kafkademon;
Import Java.util.HashMap;
Import java.util.List;
Import Java.util.Map;

Import java.util.Properties;
Import Kafka.consumer.ConsumerConfig;
Import Kafka.consumer.ConsumerIterator;
Import Kafka.consumer.KafkaStream;

Import Kafka.javaapi.consumer.ConsumerConnector; /** * @author leicui bourne_cui@163.com/public class Kafkaconsumer extends Thread {private final Consumerconnector
 Consumer

 Private final String topic; Public Kafkaconsumer (String topic) {consumer = Kafka.consumer.Consumer.createJavaConsumerConnector (Createconsumer
  Config ());
 this.topic = topic;
  private static Consumerconfig Createconsumerconfig () {Properties props = new properties ();
  Props.put ("Zookeeper.connect", kafkaproperties.zkconnect);
  Props.put ("Group.id", kafkaproperties.groupid);
  Props.put ("zookeeper.session.timeout.ms", "40000");
  Props.put ("zookeeper.sync.time.ms", "200");
  Props.put ("auto.commit.interval.ms", "1000"); return new Consumerconfig (props);
  @Override public void Run () {map<string, integer> topiccountmap = new hashmap<string, integer> ();
  Topiccountmap.put (topic, New Integer (1)); Map<string, list<kafkastream<byte[], byte[]>>> consumermap = Consumer.createmessagestreams (
  TOPICCOUNTMAP);
  Kafkastream<byte[], byte[]> stream = Consumermap.get (topic). get (0);
  Consumeriterator<byte[], byte[]> it = Stream.iterator ();
   while (It.hasnext ()) {System.out.println ("Receive:" + New String (It.next (). message ());
   try {sleep (3000);
   catch (Interruptedexception e) {e.printstacktrace (); }
  }
 }
}

Don't be afraid of the file system!

Kafka relies heavily on file systems to store and cache messages. A traditional idea for hard drives is that hard drives are always slow, which makes many people wonder if file-based architectures can deliver superior performance. Actually the speed of the hard drive depends entirely on how it is used. A well-designed hard disk architecture can be as fast as memory.

In the 6-block 7200-turn SATA RAID-5 disk array of linear write speed is almost 600mb/s, but then write the speed is 100k/s, nearly 6,000 times times worse. Modern operating systems have done a great deal of optimization, using the Read-ahead and write-behind techniques, read the data in a block of prefetching, when written to the various small trivial logical writing organization into a large physical write. An in-depth discussion of this can be viewed here, and they find a linear access disk, many times faster than random memory accesses.

To improve performance, modern operating systems tend to use memory as a cache of disks, and modern operating systems are happy to make all of the free memory available as disk caching, although this may sacrifice some performance in cache recycling and redistribution. All disk reads and writes will pass through this cache, which is unlikely to be bypassed unless I/O is used directly. So while each program caches only one piece of data in its own thread, there is one more in the operating system's cache, which is equivalent to saving two of data.

In addition to discussing the JVM, the following two facts are well known:

Java objects occupy a very large amount of space, almost twice times more than the data to be stored or even higher.

• As the amount of data in the heap increases, garbage recovery becomes more and more difficult.

Based on the above analysis, if you cache data in memory, because you need to store two copies, have to use twice times the memory space, Kafka based on the JVM, and have to double the space again, plus to avoid the performance impact of the GC, in a 32G memory machine, have to use the 28-30g memory space. And when the system reboots, you must also brush the data into memory (10GB of memory is almost 10 minutes), even if you use a cold refresh (not a one-time brush into the memory, but in the use of data without the brush to memory) will cause the initial time the new can be very slow. But with the file system, you don't need to refresh the data even if the system reboots. The use of file systems also simplifies the logic of maintaining data consistency.

So unlike the traditional design that caches data in memory and then brushes it to a hard disk, Kafka writes data directly into the file system's log.

Current 1/2 page 12 Next read the full text

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More