Loading Data into HDFS

Source: Internet
Author: User
Tags dns names hadoop fs

How to use a PDI job to move a file into HDFS.

Prerequisites

In order to follow along with this how-to guide you'll need the following:

    • Hadoop
    • Pentaho Data Integration
Sample Files

The sample data file needed is:

File Name Content
Weblogs_rebuild.txt.zip unparsed, raw weblog data
Step-by-Step Instructionssetup

Start Hadoop If it is not already running.

Create a Job to Put the Files into Hadoop

The this task is load a file into HDFS.

Speed Tip
Can download the kettle Job load_hdfs.kjb If you don ' t want to do every step
  1. Start PDI on your desktop. Once It is running choose ' file ', ' new ', ' Job ' from the menu system or click on the ' New File ' icon on the Toolb AR and choose the ' Job ' option.

  2. Add a Start Job Entry: You need to tell PDI where to start the job, so expand the "General" section of the Design palette and drag a ' start ' job Entry onto the job canvas.


  3. Add a Copy Files Job Entry: You'll copy files from your local disk to HDFS, so expand the ' Big Data ' section of the Design palette and drag a ' Hadoo P Copy Files ' job entry onto the job canvas. Your Canvas should look like this:


  4. Connect the start and Copy Files Job Entries: Hover the mouse over the ' Start ' node and a tooltip would appear. Click on the output connector [the green arrow pointing to the right] and drag a connector arrow to the ' Hadoop Copy Files ' node. Your Canvas should look like this:


  5. Edit The copy files Job Entry : Double-click on the ' Hadoop Copy Files ' node-to-edit its properties. E Nter This information:
    1. File/folder Source (s): The Folder containing the sample files, want to add to the HD FS.
    2. File/folder Destination (s): Hdfs://<namenode>:<port>/user/pdi/weblogs/raw
    3. Wildcard (REGEXP): Enter ^.*\.txt
    4. Click The Add button to add the above entries to the list of files, wish to copy.
    5. Check the "Create destination folder" option to ensure the Weblogs folder are created in HDFS the first time this Job is executed.
      When do your window should look like this (your file paths could be different):
      click ' OK ' to close the window. 

  6. Save the Job: Choose ' File ', ' Save as ... ' from the menu system. Save the transformation as ' LOAD_HDFS.KJB ' into a folder of your choice.

  7. run the job: Choose ' Action ', ' Run ' from the menu system or click on the Green Run button on the Job toolbar . An ' Execute a job ' window would open. Click on the ' Launch ' button. An ' execution Results ' panel would open at the bottom of the PDI window and it'll show you the progress of the job as it Runs. After a few seconds the job should finish successfully:

    If Any errors occurred the job step that failed would be a highlighted in red and you can use the ' Logging ' tab to view error Messages.
Check Hadoop
    1. Run the following command:
      Hadoop Fs-ls/user/pdi/weblogs/raw

      This should return:
      -RWXRWXRWX 3 Demo Demo 77908174 2011-12-28 07:16/user/pdi/weblogs/raw/weblog_raw.txt

      Summary

      Learned how to copy local files into HDFS using PDI ' s graphical design tool. You can use this tool to put files into the HDFS from many different sources.

Troubleshooting
    • Make sure has the correct shim configured and that it matches your Hadoop cluster ' s distro and version.
    • problem:hadoop Copy files step creates an empty file with HDFS and hangs or never writes any data.
      Check:the Hadoop Client side API that Pentaho calls to copy files to HDFS requires that PDI have network connectivity to T He nodes in the cluster. The DNS names or IP addresses used within the cluster must resolve the same relative to the PDI machine as they does in the Cluster. When PDI requests to put a file into HDFS, the Name Node would return the DNS names (or IP address ' depending on the config uration) of the actual nodes that the data would be is copied to.
    • Problem:permission denied:user=xxxx, Access=execute, inode= "/user/pdi/weblogs/raw": raw:hadoop:drwxr-x---
      When isn't using Kerberos security, the Hadoop API used by this step sends the username of the logged in user when trying to Copy the file (s) regardless of what username is used in the Connect field. The user must set the environment variable hadoop_user_name. You can modify Spoon.bat or spoon.sh by changing the OPT variable:
      opt="$OPT ....-dhadoop_user_name=hadoopnametospoof"
Loading Data into a Hadoop Cluster Simple Chrome Extension to browse HDFS volumes
This documentation was maintained by the Pentaho community, and members were encouraged to create new pages in the Appro Priate spaces, or edit existing pages, need to be corrected or updated.

Please don't leave comments on the Wiki pages asking for help. They'll be deleted. Use the forums instead.

Pentaho Big Data
Archive
Includes
What ' s New?
Pentaho Big Data Community Home
Downloads
How to ' s
Hadoop
Configuring Pentaho for your Hadoop distro and Version
Loading Data into a Hadoop Cluster
Loading Data into HDFS
Simple Chrome Extension to browse HDFS volumes
Loading Data into Hive
Loading Data into HBase
Transforming Data within a Hadoop Cluster
Extracting Data from the Hadoop Cluster
Reporting on Data in Hadoop
Advanced Pentaho MapReduce
Understanding how Pentaho works with Hadoop
Mongodb
Cassandra
MapR
Instaview
Getting Started for Java developers
Blend of the Week
Instaview Template of the Week
Pentaho Labs-big Data
Service Solahart Jakarta Utara-082113812149
Splunk
Sqrrl
Browse Space
  • Pages
  • Labels
  • Attachments
  • News
  • Advanced
ADD Content
  • Notation Guide
Your Account
Anonymous
  • History
  • Log in
  • Sign Up

Loading Data into HDFS

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.