Build an ETL Pipeline with Kafka Connect via JDBC connectors

Last Update:2016-11-06 Source: Internet

Author: User

Tags postgresql zookeeper kafka connect

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is a in-depth tutorial for using Kafka to move data from PostgreSQL to Hadoop HDFS via JDBC connections.

Read this eguide to discover the fundamental differences between IPaaS and Dpaas and how the innovative approach of Dpaas Gets to the heart of today's most pressing integration problems, brought-to-you-partnership with liaison.

Tutorial:discover How to build a pipeline with Kafka leveraging DataDirect PostgreSQL JDBC driver to move the data from P Ostgresql to HDFS. Let ' s go streaming!

Apache Kafka is a open source distributed streaming platform which enables you to build streaming data pipelines between Different applications. You can also build real-time streaming applications this interact with streams of data, focusing on providing a scalable, High throughput and low latency platform to interact with data streams.

Earlier this year, Apache Kafka announced a new tool called Kafka Connect which can helps users to easily move datasets in And out of Kafka using connectors, and it have support for JDBC connectors out of the box! One of the major benefits for DataDirect customers are so you can now easily build an ETL pipeline using Kafka leveraging Your datadirect JDBC drivers. Now your can easily connect and get the data from your data sources into Kafka and export the data from there to another DA Ta source.

Image from https://kafka.apache.org/Environment Setup

Before proceeding any further with this tutorial, make sure so you have installed the following and is configured PR Operly. This tutorial was written assuming you be also working on Ubuntu 16.04 LTS, you have PostgreSQL, Apache Hadoop, and Hive I nstalled.

Installing Apache Kafka and required tools to make the installation process easier for people trying t For the first time, we'll be installing confluent Platform. This takes care of installing Apache Kafka, Schema Registry and Kafka Connect which includes connectors for moving files, JDBC Connectors and HDFS Connector for Hadoop.
1. to begin with, install Confluent's public key by running the command: wget-qo-http://packages.conf luent.io/deb/2.0/archive.key | sudo apt-key add-
2. Now add the repository to your sources.list by running the following command: sudo add-apt-repository "deb http://packages.confluent.io/deb/2.0 stable main"
3. Update your package lists and then install the Confluent platform by running the following commands: sudo apt-g ET updatesudo apt-get install confluent-platform-2.11.7
Install datadirect PostgreSQL JDBC Driver
1. Download datadirect PostgreSQL JDBC driver by visiting.
2. Install the PostgreSQL JDBC driver by running the following command:java -jar PROGRESS_DATADIRECT_JDBC_POSTGRESQL_ALL.jar
3. Follow the instructions on the screens to install the driver successfully (you can install the driver in evaluation mode WH Ere you can try it for a days, or in license mode, if you have bought the driver)
Configuring data sources for Kafka Connect
2. Create A new file called Postgres.properties, paste the following configuration and save the file. To learn more on the modes that is being used in the below configuration file, visit this page. name=t EST-POSTGRES-JDBCCONNECTOR.CLASS=IO.CONFLUENT.CONNECT.JDBC.JDBCSOURCECONNECTORTASKS.MAX=1CONNECTION.URL=JDBC: datadirect:postgresql://<;server>:<port>; user=<user>; Password=<password>;D Atabase=<dbname>mode=timestamp+incrementingincrementing.column.name=<id >timestamp.column.name=<modifiedtimestamp>topic.prefix=test_jdbc_table.whitelist=actor
4. Create another file called Hdfs.properties, paste the following configuration and save the file. To learn more on HDFS connector and configuration options used, visit this page.name=hdfs-sinkconnector.class=io.confluent.connect.hdfs.HdfsSinkConnectortasks.max=1topics=test_jdbc_actorhdfs.url=hdfs://<;server>:<port>flush.size=2hive.metastore.uris=thrift://<;server>:<port>hive.integration=trueschema.compatibility=BACKWARD
6. Note that Postgres.properties and Hdfs.properties has basically the connection configuration details and behavior of the JDBC and HDFS connectors.
8. Create a symbolic link for datadirect Postgres JDBC driver in Hive Lib folder by using the following command:ln -s /path/to/datadirect/lib/postgresql.jar /path/to/hive/lib/postgresql.jar
10. Also make the DataDirect Postgres JDBC driver available on Kafka Connect process ' s CLASSPATH by running the following Comm And:export CLASSPATH=/path/to/datadirect/lib/postgresql.jar
12. Start the Hadoop cluster by running following commands:cd /path/to/hadoop/sbin./start-dfs.sh./start-yarn.sh
Configuring and Running Kafka Services
Download the configuration files for Kafka, zookeeper and Schema-registry services
Start the Zookeeper service by providing the Zookeeper.properties file path as a parameter by using the command:zookeeper-server-start /path/to/zookeeper.properties
Start the Kafka service by providing the Server.properties file path as a parameter by using the command:kafka-server-start /path/to/server.properties
Start the Schema Registry service by providing the Schema-registry.properties file path as a parameter by using the Comman D:schema-registry-start /path/to/ schema-registry.properties

Ingesting Data into HDFS using Kafka Connect

To-start ingesting data from PostgreSQL, the final thing, which is the start Kafka Connect. You can start Kafka Connect by running the following command:

connect-standalone /path/to/connect-avro-standalone.properties \ /path/to/postgres.properties /path/to/hdfs.properties

This would import the data from PostgreSQL to Kafka using DataDirect PostgreSQL JDBC drivers and create a topic with name T Est_jdbc_actor. Then the data was exported from Kafka to HDFS by reading the topic Test_jdbc_actor through the HDFS connector. The data stays in Kafka, so can reuse it to export to any other data sources.

Next Steps

We Hope this tutorial helped your understand on how can build a simple ETL pipeline using Kafka Connect leveraging Data Direct PostgreSQL JDBC drivers. This tutorial isn't limited to PostgreSQL. In fact, you can create an ETL pipelines leveraging any of our DataDirect JDBC drivers so we offer for relational databases Like Oracle, DB2 and SQL Server, Cloud sources Likesalesforce and Eloqua or bigdata sources like CDH Hive, Spark SQL and Cassandra by following similar steps. Also, subscribe to our blog via email or RSS feeds for more awesome tutorials.

Discover the unprecedented possibilities and challenges, created by today's fast paced data climate andwhy your current in Tegration solution is not a enough, brought to a partnership with liaison.

Build an ETL Pipeline with Kafka Connect via JDBC connectors

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More