89th lesson: Spark streaming on Kafka parsing and installation combat

Source: Internet
Author: User

This lesson is divided into 2 parts to explain:

The first part, explains the Kafka concept, the structure and the use case scene;

The second part, explains the Kafka installation and the actual combat.

Due to the time relationship, today's course is only about how to use the official website to verify the success of Kafka installation. The follow-up course will then explain how to integrate spark streaming and Kafka.

First, Kafka concepts, architectures, and use-case scenarios

Http://kafka.apache.org/documentation.html#introdution

1 , Kafka the concept

Apache Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn and later became part of the Apache project. Kafka is a fast, extensible, design-only, distributed, partitioned, and replicable commit log service.

What is a message component :

To the handsome and beautiful chat for example, handsome how to communicate with beautiful women? This is usually thought of, QQ, telephone, mail and other communications media, these communications media is the message component, the handsome guy to send chat messages to message components, message components to push the message to beauty, this is often said producer, consumer model. And the content can be categorized when the message is sent, the so-called topic theme. Kafka is such a communication component that binds different object components together and transmits data in a decoupled manner.

Apache Kafka compared with the traditional message system, there are the following different characteristics:

    • Distributed system, easy to scale out;
    • Low latency online, while providing high throughput for publishing and subscriptions;
    • Store messages to disk, so you can process 1 days or even 1 weeks before content

2 , Kafka The Architecture

Since Kafka has the basic functions of the messaging system, it is bound to have components that compose the message system:

Topic,producer and consumer. The Kafka also has its own special Kafka cluster component.

Topic Theme:

Represents a category or type of data, work, entertainment, life has different topic, producers need to explain the description of data placed in those topic, inside is a small object, and the data data pushed to Kafka, the consumer access to data is the process of pull. A set of message data flows of the same type. These messages are partitioned in Kafka and have multiple replicas to prevent data loss. Messages for each partition are written sequentially and cannot be rewritten.

-Producer (producer): pushes data to any object in the Kafka system.

-Kafka Cluster (Kafka cluster): A group of servers that hold messages pushed to the Kafka system, also known as broker. Because the Kafka cluster uses zookeeper as the underlying support framework, the read and write requests for all messages are handled by an elected server as the leader, and the other servers are synchronously backing up data as follower accept leader for disaster recovery purposes.

-Consumer (consumer): any object that subscribes to messages from the Kafka system.

Consumers can have multiple, and some consumers can also form consumer group. Multiple consumer group compose the relationship of message broadcast, so each group can pull the same message data. Within the consumer group, the message data from each consumer to the consumer group is a first-in-one-out relationship, and a message data can only be used by a consumer of that group.

Data transfer is based on kernel (kernel) level (transfer speed is close to 0 copies-zerocopy) and no user space is involved. Linux itself is the software, the first startup process when the software starts is called INIT, the init process will enter the user space after startup, for example: in a distributed system, the application on machine a needs to read the Java service data on machine B, Since the JVM of the Java program is user-level and the data is on disk, the application of a on the read data will first enter the kernel space on machine B and then enter the user space of machine B, and after reading the data of the user space, the data is distributed to the network by the kernel space on the B machine. The machine a network card receives the transmitted data and then writes the data to the kernel space of the A machine, which eventually transfers to a user space for processing. Such as:

The external system reads the data from the Java program, transmits it to the kernel space, and relies on the NIC to write the data to the network, thereby transmitting it. In fact, Java itself is a layer of the core coat, Java socket programming, the operation of various data are in the JVM user space. While the Kafka operation data is placed in the kernel space, usually the kernel space processing data faster than the user space tens of thousands of times, so through the Kafka can achieve high-speed reading and writing data.

3 , Kafka The use case scenario

Similar, mobile phone and mailbox and so on familiar with the message component, Kafka can also:

-Support text/image

-Can store content

-Categories

From the point of view of content consumption, Kafka the mail in the mailbox as a topic.

Second, Kafka the installation and actual combat

Http://kafka.apache.org/documentation.html#quickstart

1 , installation, and configuration Zookeeper

Kafka cluster mode requires zookeeper to be installed in advance.

-Tip: Kafka Singleton mode does not require installation of additional zookeeper, you can use the built-in zookeeper.

-Kafka cluster mode requires at least 3 servers. This lesson uses the server hostname:master,slave1,slave2.

-The zookeeper version used in this lesson is Zookeeper-3.4.6.

1) Download Zookeeper

Enter the http://www.apache.org/dyn/closer.cgi/zookeeper/, you can choose the other image URL to download, the image recommended by the official website: http://mirror.bit.edu.cn/apache/zookeeper/. Tip: You can download the zookeeper installation files in the group directly.

Download zookeeper-3.4.6.tar.gz

1) Install Zookeeper

Tip: The following steps occur on the master server.

Take ubuntu14.04 example, put the downloaded file into the/root directory, unzip it with the following command:

Cd/root

TAR-ZXVF zookeeper-3.4.6.tar.gz

After decompression in the/root directory will be more than a zookeeper-3.4.6 new directory, with the following command to cut it to the specified directory is installed zookeeper:

Cd/root

MV Zookeeper-3.4.6/usr/local/spark

After that, a new directory of zookeeper-3.4.6 is created in the/usr/local/spark directory. Below we talk about how to configure the installed zookeeper.

2) Configuration Zookeeper

Tip: The following steps occur on the master server.

    1. Configuration. BASHRC

-Open File: VI/ROOT/.BASHRC

-Add before the path configuration line:

Export zookeeper_home=/usr/local/spark/zookeeper-3.4.6

-Last Modified path:

Export path=${java_home}/bin:${zookeeper_home}/bin:${hadoop_home}/bin:${hadoop_home}/sbin:${scala_home}/bin:${ Spark_home}/bin:${spark_home}/sbin:${hive_home}/bin:${kafka_home}/bin: $PATH

-Make the configured environment variable effective immediately: SOURCE/ROOT/.BASHRC

    1. Create the Data directory

-CD $ZOOKEEPER _home

-MkDir Data

    1. Create and open the Zoo.cfg file

-CD $ZOOKEEPER _home/conf

-CP Zoo_sample.cfg Zoo.cfg

-VI Zoo.cfg

    1. Configure Zoo.cfg

# Configure the directory where data such as zookeeper logs and server ID numbers are stored.

# do not use the default/tmp/zookeeper directory, because the/tmp directory data is easily deleted accidentally.

Datadir=. /data

# Zookeeper ports connected to the client

clientport=2181

# 2 Important ports for each server are configured at the end of the file by adding 3 lines: Leader Port and election port

# server. A=b:c:d: Where A is a number, indicating this is the first server;

# B is the hostname or IP address of this server;

# C represents the port where the server exchanges information with the Leader server in the cluster;

# D indicates that in case the Leader server in the cluster is hung up, a port is needed to re-elect the

# Select a new Leader, which is the port that the server communicates with each other when the election is performed.

# if it is a pseudo-cluster configuration, because B is the same, so different Zookeeper instance communication

# The port number cannot be the same, so assign them a different port number.

server.1=master:2888:3888

server.2=slave1:2888:3888

server.3=slave2:2888:3888

    1. Create and open the myID file

-CD $ZOOKEEPER _home/data

-Touch myID

-VI myID

    1. Configure myID

According to ZOO.CFG configuration, the content of myID is 1.

3) Synchronize the installation and configuration of Master to Slave1 and Slave2

-Run the following command on the master server

Cd/root

SCP./.BASHRC [Email Protected]:/root

SCP./.BASHRC [Email Protected]:/root

Cd/usr/local/spark

Scp-r./zookeeper-3.4.6 [Email Protected]:/usr/local/spark

Scp-r./zookeeper-3.4.6 [Email Protected]:/usr/local/spark

-Run the following command on the SLAVE1 server

VI $ZOOKEEPER _home/data/myid

According to ZOO.CFG configuration, the content of myID is 2.

-Run the following command on the SLAVE2 server

VI $ZOOKEEPER _home/data/myid

According to ZOO.CFG configuration, the content of myID is 3.

4) Start Zookeeper service

-Run the following command on the master server

zkserver.sh start

-Run the following command on the SLAVE1 server

Source/root/.bashrc

zkserver.sh start

-Run the following command on the SLAVE1 server

Source/root/.bashrc

zkserver.sh start

5) Verify that the zookeeper is installed and started successfully

-Run the command on the master server: JPS and zkserver.sh status

[Email protected]:/usr/local/spark/zookeeper-3.4.6/bin# JPS

3844 Quorumpeermain

4790 Jps

zkserver.sh status

[Email protected]:/usr/local/spark/zookeeper-3.4.6/bin# zkserver.sh Status

JMX enabled by default

Using config:/usr/local/spark/zookeeper-3.4.6/bin/. /conf/zoo.cfg

Mode:follower

-run commands on the SLAVE1 server: JPS and zkserver.sh status

Source/root/.bashrc

[Email protected]:/usr/local/spark/zookeeper-3.4.6/bin# JPS

3462 Quorumpeermain

4313 Jps

[Email protected]:/usr/local/spark/zookeeper-3.4.6/bin# zkserver.sh Status

JMX enabled by default

Using config:/usr/local/spark/zookeeper-3.4.6/bin/. /conf/zoo.cfg

Mode:follower

-run commands on the SLAVE2 server: JPS and zkserver.sh status

[Email protected]:/usr/local/spark/zookeeper-3.4.6/bin# JPS

4073 Jps

3277 Quorumpeermain

[Email protected]:/usr/local/spark/zookeeper-3.4.6/bin# zkserver.sh Status

JMX enabled by default

Using config:/usr/local/spark/zookeeper-3.4.6/bin/. /conf/zoo.cfg

Mode:leader

At this point, the representative zookeeper has been installed and configured successfully.

2 , installation, and configuration Kafka

The Kafka version used in this lesson is kafka-2.10-0.9.0.1.

1) Download Kafka

Enter Http://kafka.apache.org/downloads.html, and left-click Kafka_2.10-0.9.0.1.tgz. Tip: You can download the Kafka installation files in the group directly.

Download kafka_2.10-0.9.0.1.tgz

1) Install Kafka

Tip: The following steps occur on the master server.

Take ubuntu14.04 example, put the downloaded file into the/root directory, unzip it with the following command:

Cd/root

TAR-ZXVF kafka_2.10-0.9.0.1.tgz

After decompression in the/root directory will be more than a kafka_2.10-0.9.0.1 new directory, with the following command to cut it to the specified directory is installed Kafka:

Cd/root

MV Kafka_2.10-0.9.0.1/usr/local

After that, a new directory of kafka_2.10-0.9.0.1 is created in the/usr/local directory. Below we talk about how to configure the installed Kafka.

2) Configuration Kafka

Tip: The following steps occur on the master server.

    1. Configuration. BASHRC

-Open File: VI/ROOT/.BASHRC

-Add before the path configuration line:

Export kafka_home=/usr/local/kafka_2.10-0.9.0.1

-Last Modified path:

Export path=${java_home}/bin:${zookeeper_home}/bin:${hadoop_home}/bin:${hadoop_home}/sbin:${scala_home}/bin:${ Spark_home}/bin:${spark_home}/sbin:${hive_home}/bin:${kafka_home}/bin: $PATH

-Make the configured environment variable effective immediately: SOURCE/ROOT/.BASHRC

    1. Open Server.properties

-CD $ZOOKEEPER _home/config

-VI Server.properties

    1. Configure Server.properties

Broker.id=0

port=9092

zookeeper.connect=master:2181,slave1:2181,slave2:2181

3) Synchronize the installation and configuration of Master to Slave1 and Slave2

-Run the following command on the master server

Cd/root

SCP./.BASHRC [Email Protected]:/root

SCP./.BASHRC [Email Protected]:/root

Cd/usr/local

Scp-r./kafka_2.10-0.9.0.1 [Email protected]:/usr/local

Scp-r./kafka_2.10-0.9.0.1 [Email protected]:/usr/local

-Run the following command on the SLAVE1 server

VI $KAFKA _home/config/server.properties

Modify the broker.id=1.

-Run the following command on the SLAVE2 server

VI $KAFKA _home/config/server.properties

Modify the broker.id=2.

4) Start Kafka Service

-Run the following command on the master server

CD $KAFKA _home/bin

Kafka-server-start.sh. /config/server.properties &

-Run the following command on the SLAVE1 server

Source/root/.bashrc

CD $KAFKA _home/bin

Kafka-server-start.sh. /config/server.properties &

-Run the following command on the SLAVE2 server

Source/root/.bashrc

CD $KAFKA _home/bin

Kafka-server-start.sh. /config/server.properties &

5) Verify that the Kafka is installed and started successfully

-Run command on any server to create topic "Hellokafka":

kafka-topics.sh--create--zookeeper master:2181,slave1:2181,slave2:2181--replication-factor 3--partitions 1--topic Hellokafka

-Run the command on any server to produce some messages for the created topic "Hellokafka":

kafka-console-producer.sh--broker-list master:9092,slave1:9092,slave2:9092--topic HelloKafka

Enter the following message content:

This is dt_spark!

I ' m rocky!

Life are short, you need spark!

-Run a command on any server to consume (pull) messages from the specified topic "Hellokafka":

kafka-console-consumer.sh--zookeeper master:2181,slave1:2181,slave2:2181--from-beginning--topic HelloKafka

After a while, you will see the printed message content:

This is dt_spark!

I ' m rocky!

Life are short, you need spark!

-run commands on any server to see all topic names:

kafka-topics.sh--list--zookeeper master:2181,slave1:2181,slave2:2181

-run commands on any server to view an overview of the specified topic:

kafka-topics.sh--describe--zookeepermaster:2181,slave1:2181,slave2:2181--topic HelloKafka

At this point, the representative Kafka has been installed and configured successfully.

Summarize:

With spark streaming you can handle a variety of data source types, such as database, HDFS, server log logs, network streams, which are more powerful than you might imagine, but are often not used by people, and the real reason for this is the spark, spark Streaming itself does not understand.

Written by: Imf-spark Steaming Enterprise Development Combat Team (Zhou Fei, Jiang Wei, etc.)

Main editor: Liaoliang

Note:

Data from: Dt_ Big Data DreamWorks (the fund's legendary action secret course)

For more private content, please follow the public number: Dt_spark

If you are interested in big data spark, you can listen to it free of charge by Liaoliang teacher every night at 20:00 Spark Permanent free public class, address yy room Number: 68917580

Life was short,you need to spark!

89th lesson: Spark streaming on Kafka parsing and installation combat

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.