Now let's dive into the details of this solution and I'll show you how you can import data into Hadoop in just a few steps.
1. Extract data from RDBMS
All relational databases have a log file to record the latest transaction information. The first step in our flow solution is to get these transaction data and enable Hadoop to parse these transaction formats. (about how to parse these transaction logs, the original author did not introduce, may involve business information.) )
2, start Kafka Producer
The process of sending messages to the Kafka topic becomes a producer. Topic writes the same message to the Kafka. Transactional messages in the RDBMS are converted to the Kafka topic. For our example, we have a database of sales teams where transaction information is published to Kafka topic, and the following steps are necessary to start Kafka Producer:
$ cd/usr/hdp/2.4.0.0-169/kafka
$ bin/kafka-topics.sh--create--zookeeper www.iteblog.com:2181--replication-factor 1--partitions 1--topic Salesdbtransactions
Created topic "Salesdbtransactions".
$ bin/kafka-topics.sh--list--zookeeper www.iteblog.com:2181
Salesdbtransactions
3, set Hive
We create a table in hive to receive transaction information from the sales team database. In this example we will reconstruct a table named Customers:
[Iteblog@sandbox ~]$ beeline-u jdbc:hive2://-n hive-p Hive
0:jdbc:hive2://> use Raj;
CREATE TABLE Customers (ID string, name string, email string, street_address string, company string)
Partitioned by (Time string)
Clustered by (ID) into 5 buckets stored as Orc
Location '/user/iteblog/salescust '
Tblproperties (' transactional ' = ' true ');
In order to enable transactions in hive, we need to configure the following in hive:
Hive.txn.manager = Org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
4, the launch will be a flume agent used to write Kafka data to hive
Below we will create a flume Agent that will send the data in the Kafka topic to the corresponding table in hive. Follow these steps to set up the relevant environment variables before starting the flume agent:
$ pwd
/home/iteblog/streamingdemo
$ mkdir Flume/checkpoint
$ mkdir Flume/data
$ chmod 777-r Flume
$ Export Hive_home=/usr/hdp/current/hive-server2
$ Export Hcat_home=/usr/hdp/current/hive-webhcat
$ pwd
/home/iteblog/streamingdemo/flume
$ mkdir Logs
Then create a log4j properties file:
[Iteblog@sandbox conf]$ VI log4j.properties
Flume.root.logger=info,logfile
Flume.log.dir=/home/iteblog/streamingdemo/flume/logs
Flume.log.file=flume.log
Finally, our flume agent is configured as follows:
$ VI flumetohive.conf
Flumeagent1.sources = Source_from_kafka
Flumeagent1.channels = Mem_channel
Flumeagent1.sinks = Hive_sink
# define/configure Source
Flumeagent1.sources.source_from_kafka.type = Org.apache.flume.source.kafka.KafkaSource
Flumeagent1.sources.source_from_kafka.zookeeperConnect = sandbox.hortonworks.com:2181
Flumeagent1.sources.source_from_kafka.topic = Salesdbtransactions
Flumeagent1.sources.source_from_kafka.groupID = Flume
Flumeagent1.sources.source_from_kafka.channels = Mem_channel
Flumeagent1.sources.source_from_kafka.interceptors = I1
Flumeagent1.sources.source_from_kafka.interceptors.i1.type = Timestamp
flumeagent1.sources.source_from_kafka.consumer.timeout.ms = 1000
# Hive Sink
Flumeagent1.sinks.hive_sink.type = Hive
Flumeagent1.sinks.hive_sink.hive.metastore = thrift://sandbox.hortonworks.com:9083
Flumeagent1.sinks.hive_sink.hive.database = Raj
flumeagent1.sinks.hive_sink.hive.table = Customers
Flumeagent1.sinks.hive_sink.hive.txnsPerBatchAsk = 2
Flumeagent1.sinks.hive_sink.hive.partition =%y-%m-%d-%h-%m
Flumeagent1.sinks.hive_sink.batchSize = 10
Flumeagent1.sinks.hive_sink.serializer = Delimited
Flumeagent1.sinks.hive_sink.serializer.delimiter =,
Flumeagent1.sinks.hive_sink.serializer.fieldnames = Id,name,email,street_address,company
# Use a channel which buffers events in memory
Flumeagent1.channels.mem_channel.type = Memory
Flumeagent1.channels.mem_channel.capacity = 10000
flumeagent1.channels.mem_channel.transactionCapacity = 100
# Bind the source and sink to the channel
Flumeagent1.sources.source_from_kafka.channels = Mem_channel
Flumeagent1.sinks.hive_sink.channel = Mem_channel
5, start flume Agent
Use the following command to start the flume Agent:
$/usr/hdp/apache-flume-1.6.0/bin/flume-ng agent-n flumeagent1-f ~/streamingdemo/flume/conf/flumetohive.conf
6, start Kafka Stream
As an example, the following is a simulation of the transaction information that will be generated by the database in the actual system:
$ cd/usr/hdp/2.4.0.0-169/kafka
$ bin/kafka-console-producer.sh--broker-list sandbox.hortonworks.com:6667--topic SalesDBTransactions
1, "Nero Morris", "porttitor.interdum@Sedcongue.edu", "P.O. Box 871, 5313 quis Ave", "Sodales Company"
2, "Cody Bond", "ante.lectus.convallis@antebibendumullamcorper.ca", "232-513 molestie Road", "Aenean eget Incorporated "
3, "Holmes Cannon", "a@metusAliquam.edu", "P.O. Box 726, 7682 bibendum Rd.", Velit CRAs LLP
4, "Alexander Lewis", "risus@urna.edu", "Ap #375 -9675 lacus Av.", "Ut Aliquam Iaculis Inc."
5, "Gavin Ortiz", "sit.amet@aliquameu.net", "Ap #453 -1440 Urna. St. "," Libero Nec Ltd "
6, "Ralph Fleming", "sociis.natoque.penatibus@quismassaMauris.edu", "363-6976 lacus." St. "," Quisque fringilla PC "
7, "Merrill Norton", "at.sem@elementum.net", "P.O Box 452, 6951 egestas." St. "," Nec metus Institute "
8, "Nathaniel Carrillo", "eget@massa.co.uk", "Ap #438 -604 tellus St.", "Blandit Viverra Corporation"
9, "Warren Valenzuela", "tempus.scelerisque.lorem@ornare.co.uk", "Ap #590 -320 Nulla Av.", "Ligula aliquam erat Incorporated "
"Donovan Hill", "facilisi@augue.org", "979-6729 Donec Road", "Turpis in Condimentum Associates"
One, "Kamal Matthews", "augue.ut@necleoMorbi.org", "Ap #530 -8214 convallis, St.", "Tristique senectus Et Foundation"
7. Receive Hive Data
After completing all of the above steps, now you send the data to Kafka and you will see the data sent to the hive in a few seconds.