This lesson is divided into 2 parts to explain:
The first part, explains the Kafka concept, the structure and the use case scene;
The second part, explains the Kafka installation and the actual combat.
Due to the time relationship, today's course is only about how to use the official website to verify the success of Kafka installation. The follow-up course will then explain how to integrate spark streaming and Kafka.
First, Kafka concepts, architectures, and use-case scenarios
Http://kafka.apache.org/documentation.html#introdution
1 , Kafka the concept
Apache Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn and later became part of the Apache project. Kafka is a fast, extensible, design-only, distributed, partitioned, and replicable commit log service.
What is a message component :
To the handsome and beautiful chat for example, handsome how to communicate with beautiful women? This is usually thought of, QQ, telephone, mail and other communications media, these communications media is the message component, the handsome guy to send chat messages to message components, message components to push the message to beauty, this is often said producer, consumer model. And the content can be categorized when the message is sent, the so-called topic theme. Kafka is such a communication component that binds different object components together and transmits data in a decoupled manner.
Apache Kafka compared with the traditional message system, there are the following different characteristics:
- Distributed system, easy to scale out;
- Low latency online, while providing high throughput for publishing and subscriptions;
- Store messages to disk, so you can process 1 days or even 1 weeks before content
2 , Kafka The Architecture
Since Kafka has the basic functions of the messaging system, it is bound to have components that compose the message system:
Topic,producer and consumer. The Kafka also has its own special Kafka cluster component.
Topic Theme:
Represents a category or type of data, work, entertainment, life has different topic, producers need to explain the description of data placed in those topic, inside is a small object, and the data data pushed to Kafka, the consumer access to data is the process of pull. A set of message data flows of the same type. These messages are partitioned in Kafka and have multiple replicas to prevent data loss. Messages for each partition are written sequentially and cannot be rewritten.
-Producer (producer): pushes data to any object in the Kafka system.
-Kafka Cluster (Kafka cluster): A group of servers that hold messages pushed to the Kafka system, also known as broker. Because the Kafka cluster uses zookeeper as the underlying support framework, the read and write requests for all messages are handled by an elected server as the leader, and the other servers are synchronously backing up data as follower accept leader for disaster recovery purposes.
-Consumer (consumer): any object that subscribes to messages from the Kafka system.
Consumers can have multiple, and some consumers can also form consumer group. Multiple consumer group compose the relationship of message broadcast, so each group can pull the same message data. Within the consumer group, the message data from each consumer to the consumer group is a first-in-one-out relationship, and a message data can only be used by a consumer of that group.
Data transfer is based on kernel (kernel) level (transfer speed is close to 0 copies-zerocopy) and no user space is involved. Linux itself is the software, the first startup process when the software starts is called INIT, the init process will enter the user space after startup, for example: in a distributed system, the application on machine a needs to read the Java service data on machine B, Since the JVM of the Java program is user-level and the data is on disk, the application of a on the read data will first enter the kernel space on machine B and then enter the user space of machine B, and after reading the data of the user space, the data is distributed to the network by the kernel space on the B machine. The machine a network card receives the transmitted data and then writes the data to the kernel space of the A machine, which eventually transfers to a user space for processing. Such as:
The external system reads the data from the Java program, transmits it to the kernel space, and relies on the NIC to write the data to the network, thereby transmitting it. In fact, Java itself is a layer of the core coat, Java socket programming, the operation of various data are in the JVM user space. While the Kafka operation data is placed in the kernel space, usually the kernel space processing data faster than the user space tens of thousands of times, so through the Kafka can achieve high-speed reading and writing data.
3 , Kafka The use case scenario
Similar, mobile phone and mailbox and so on familiar with the message component, Kafka can also:
-Support text/image
-Can store content
-Categories
From the point of view of content consumption, Kafka the mail in the mailbox as a topic.
Second, Kafka the installation and actual combat
Http://kafka.apache.org/documentation.html#quickstart
1 , installation, and configuration Zookeeper
Kafka cluster mode requires zookeeper to be installed in advance.
-Tip: Kafka Singleton mode does not require installation of additional zookeeper, you can use the built-in zookeeper.
-Kafka cluster mode requires at least 3 servers. This lesson uses the server hostname:master,slave1,slave2.
-The zookeeper version used in this lesson is Zookeeper-3.4.6.
1) Download Zookeeper
Enter the http://www.apache.org/dyn/closer.cgi/zookeeper/, you can choose the other image URL to download, the image recommended by the official website: http://mirror.bit.edu.cn/apache/zookeeper/. Tip: You can download the zookeeper installation files in the group directly.
Download zookeeper-3.4.6.tar.gz
1) Install Zookeeper
Tip: The following steps occur on the master server.
Take ubuntu14.04 example, put the downloaded file into the/root directory, unzip it with the following command:
Cd/root
TAR-ZXVF zookeeper-3.4.6.tar.gz
After decompression in the/root directory will be more than a zookeeper-3.4.6 new directory, with the following command to cut it to the specified directory is installed zookeeper:
Cd/root
MV Zookeeper-3.4.6/usr/local/spark
After that, a new directory of zookeeper-3.4.6 is created in the/usr/local/spark directory. Below we talk about how to configure the installed zookeeper.
2) Configuration Zookeeper
Tip: The following steps occur on the master server.
- Configuration. BASHRC
-Open File: VI/ROOT/.BASHRC
-Add before the path configuration line:
Export zookeeper_home=/usr/local/spark/zookeeper-3.4.6
-Last Modified path:
Export path=${java_home}/bin:${zookeeper_home}/bin:${hadoop_home}/bin:${hadoop_home}/sbin:${scala_home}/bin:${ Spark_home}/bin:${spark_home}/sbin:${hive_home}/bin:${kafka_home}/bin: $PATH
-Make the configured environment variable effective immediately: SOURCE/ROOT/.BASHRC
- Create the Data directory
-CD $ZOOKEEPER _home
-MkDir Data
- Create and open the Zoo.cfg file
-CD $ZOOKEEPER _home/conf
-CP Zoo_sample.cfg Zoo.cfg
-VI Zoo.cfg
- Configure Zoo.cfg
# Configure the directory where data such as zookeeper logs and server ID numbers are stored.
# do not use the default/tmp/zookeeper directory, because the/tmp directory data is easily deleted accidentally.
Datadir=. /data
# Zookeeper ports connected to the client
clientport=2181
# 2 Important ports for each server are configured at the end of the file by adding 3 lines: Leader Port and election port
# server. A=b:c:d: Where A is a number, indicating this is the first server;
# B is the hostname or IP address of this server;
# C represents the port where the server exchanges information with the Leader server in the cluster;
# D indicates that in case the Leader server in the cluster is hung up, a port is needed to re-elect the
# Select a new Leader, which is the port that the server communicates with each other when the election is performed.
# if it is a pseudo-cluster configuration, because B is the same, so different Zookeeper instance communication
# The port number cannot be the same, so assign them a different port number.
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888
- Create and open the myID file
-CD $ZOOKEEPER _home/data
-Touch myID
-VI myID
- Configure myID
According to ZOO.CFG configuration, the content of myID is 1.
3) Synchronize the installation and configuration of Master to Slave1 and Slave2
-Run the following command on the master server
Cd/root
SCP./.BASHRC [Email Protected]:/root
SCP./.BASHRC [Email Protected]:/root
Cd/usr/local/spark
Scp-r./zookeeper-3.4.6 [Email Protected]:/usr/local/spark
Scp-r./zookeeper-3.4.6 [Email Protected]:/usr/local/spark
-Run the following command on the SLAVE1 server
VI $ZOOKEEPER _home/data/myid
According to ZOO.CFG configuration, the content of myID is 2.
-Run the following command on the SLAVE2 server
VI $ZOOKEEPER _home/data/myid
According to ZOO.CFG configuration, the content of myID is 3.
4) Start Zookeeper service
-Run the following command on the master server
zkserver.sh start
-Run the following command on the SLAVE1 server
Source/root/.bashrc
zkserver.sh start
-Run the following command on the SLAVE1 server
Source/root/.bashrc
zkserver.sh start
5) Verify that the zookeeper is installed and started successfully
-Run the command on the master server: JPS and zkserver.sh status
[Email protected]:/usr/local/spark/zookeeper-3.4.6/bin# JPS
3844 Quorumpeermain
4790 Jps
zkserver.sh status
[Email protected]:/usr/local/spark/zookeeper-3.4.6/bin# zkserver.sh Status
JMX enabled by default
Using config:/usr/local/spark/zookeeper-3.4.6/bin/. /conf/zoo.cfg
Mode:follower
-run commands on the SLAVE1 server: JPS and zkserver.sh status
Source/root/.bashrc
[Email protected]:/usr/local/spark/zookeeper-3.4.6/bin# JPS
3462 Quorumpeermain
4313 Jps
[Email protected]:/usr/local/spark/zookeeper-3.4.6/bin# zkserver.sh Status
JMX enabled by default
Using config:/usr/local/spark/zookeeper-3.4.6/bin/. /conf/zoo.cfg
Mode:follower
-run commands on the SLAVE2 server: JPS and zkserver.sh status
[Email protected]:/usr/local/spark/zookeeper-3.4.6/bin# JPS
4073 Jps
3277 Quorumpeermain
[Email protected]:/usr/local/spark/zookeeper-3.4.6/bin# zkserver.sh Status
JMX enabled by default
Using config:/usr/local/spark/zookeeper-3.4.6/bin/. /conf/zoo.cfg
Mode:leader
At this point, the representative zookeeper has been installed and configured successfully.
2 , installation, and configuration Kafka
The Kafka version used in this lesson is kafka-2.10-0.9.0.1.
1) Download Kafka
Enter Http://kafka.apache.org/downloads.html, and left-click Kafka_2.10-0.9.0.1.tgz. Tip: You can download the Kafka installation files in the group directly.
Download kafka_2.10-0.9.0.1.tgz
1) Install Kafka
Tip: The following steps occur on the master server.
Take ubuntu14.04 example, put the downloaded file into the/root directory, unzip it with the following command:
Cd/root
TAR-ZXVF kafka_2.10-0.9.0.1.tgz
After decompression in the/root directory will be more than a kafka_2.10-0.9.0.1 new directory, with the following command to cut it to the specified directory is installed Kafka:
Cd/root
MV Kafka_2.10-0.9.0.1/usr/local
After that, a new directory of kafka_2.10-0.9.0.1 is created in the/usr/local directory. Below we talk about how to configure the installed Kafka.
2) Configuration Kafka
Tip: The following steps occur on the master server.
- Configuration. BASHRC
-Open File: VI/ROOT/.BASHRC
-Add before the path configuration line:
Export kafka_home=/usr/local/kafka_2.10-0.9.0.1
-Last Modified path:
Export path=${java_home}/bin:${zookeeper_home}/bin:${hadoop_home}/bin:${hadoop_home}/sbin:${scala_home}/bin:${ Spark_home}/bin:${spark_home}/sbin:${hive_home}/bin:${kafka_home}/bin: $PATH
-Make the configured environment variable effective immediately: SOURCE/ROOT/.BASHRC
- Open Server.properties
-CD $ZOOKEEPER _home/config
-VI Server.properties
- Configure Server.properties
Broker.id=0
port=9092
zookeeper.connect=master:2181,slave1:2181,slave2:2181
3) Synchronize the installation and configuration of Master to Slave1 and Slave2
-Run the following command on the master server
Cd/root
SCP./.BASHRC [Email Protected]:/root
SCP./.BASHRC [Email Protected]:/root
Cd/usr/local
Scp-r./kafka_2.10-0.9.0.1 [Email protected]:/usr/local
Scp-r./kafka_2.10-0.9.0.1 [Email protected]:/usr/local
-Run the following command on the SLAVE1 server
VI $KAFKA _home/config/server.properties
Modify the broker.id=1.
-Run the following command on the SLAVE2 server
VI $KAFKA _home/config/server.properties
Modify the broker.id=2.
4) Start Kafka Service
-Run the following command on the master server
CD $KAFKA _home/bin
Kafka-server-start.sh. /config/server.properties &
-Run the following command on the SLAVE1 server
Source/root/.bashrc
CD $KAFKA _home/bin
Kafka-server-start.sh. /config/server.properties &
-Run the following command on the SLAVE2 server
Source/root/.bashrc
CD $KAFKA _home/bin
Kafka-server-start.sh. /config/server.properties &
5) Verify that the Kafka is installed and started successfully
-Run command on any server to create topic "Hellokafka":
kafka-topics.sh--create--zookeeper master:2181,slave1:2181,slave2:2181--replication-factor 3--partitions 1--topic Hellokafka
-Run the command on any server to produce some messages for the created topic "Hellokafka":
kafka-console-producer.sh--broker-list master:9092,slave1:9092,slave2:9092--topic HelloKafka
Enter the following message content:
This is dt_spark!
I ' m rocky!
Life are short, you need spark!
-Run a command on any server to consume (pull) messages from the specified topic "Hellokafka":
kafka-console-consumer.sh--zookeeper master:2181,slave1:2181,slave2:2181--from-beginning--topic HelloKafka
After a while, you will see the printed message content:
This is dt_spark!
I ' m rocky!
Life are short, you need spark!
-run commands on any server to see all topic names:
kafka-topics.sh--list--zookeeper master:2181,slave1:2181,slave2:2181
-run commands on any server to view an overview of the specified topic:
kafka-topics.sh--describe--zookeepermaster:2181,slave1:2181,slave2:2181--topic HelloKafka
At this point, the representative Kafka has been installed and configured successfully.
Summarize:
With spark streaming you can handle a variety of data source types, such as database, HDFS, server log logs, network streams, which are more powerful than you might imagine, but are often not used by people, and the real reason for this is the spark, spark Streaming itself does not understand.
Written by: Imf-spark Steaming Enterprise Development Combat Team (Zhou Fei, Jiang Wei, etc.)
Main editor: Liaoliang
Note:
Data from: Dt_ Big Data DreamWorks (the fund's legendary action secret course)
For more private content, please follow the public number: Dt_spark
If you are interested in big data spark, you can listen to it free of charge by Liaoliang teacher every night at 20:00 Spark Permanent free public class, address yy room Number: 68917580
Life was short,you need to spark!
89th lesson: Spark streaming on Kafka parsing and installation combat