Hadoop usage (6)

Source: Internet
Author: User
Chapter 2 Introduction 1st Writing Purpose

Introduce pig, a hadoop extension that has to be said.

1.2 What is pig

Pig is a hadoop-based large-scale data analysis platform, which provides the SQL-LIKE language Pig Latin, the compiler of this language converts SQL-like Data Analysis requests into a series of optimized mapreduce operations. Pig provides a simple operation and programming interface for complex concurrent computing of massive data.

Features of 1.3 pig

1. Focus on massive data set analysis (Ad-Hoc analysis, ad-hoc representative: a solution that has been custom designed for a specific problem );
2. Run on the cluster computing architecture. Yahoo pig provides multi-layer abstraction to simplify parallel computing for common users; these abstractions automatically translate user requests to queries into effective parallel evaluation plans, and then execute these plans on the physical cluster;
3. provides SQL-like Operation syntax;
4. Open source code;

1.4
Major pig users

1. Yahoo

2. Twitter

1.5
About pig and hive

For developers, using Java APIs directly may be boring or error-prone, and it also limits the flexibility of Java programmers in hadoop programming. Therefore, hadoop provides two solutions to make hadoop programming easier.

• Pig is a programming language that simplifies common tasks of hadoop. Pig can load data, express conversion data, and store final results. Pig's built-in operations make semi-structured data meaningful (such as log files ). Pig can also use the custom data types added in Java and support data conversion.

• Hive plays the role of data warehouse in hadoop. Hive adds the data structure in HDFS (hive superimposes
Structure on data in HDFS), and allows you to use SQL-like syntax for data query. Like pig, hive's core functions are scalable.

Pig and hive are always confusing. Hive is more suitable for Data Warehouse tasks. hive is mainly used for static structures and jobs that require frequent analysis. Similar to SQL, hive makes it an ideal intersection of hadoop and other bi tools. Pig gives developers more flexibility in the big dataset field and allows developers to develop concise scripts to convert data streams so that they can be embedded into larger
Application. Pig is relatively lightweight than hive, and its main advantage is that it uses hadoop directly.
Java APIs can greatly reduce the amount of code. Because of this, pig is still attracting a large number of software developers.

 

Chapter 4 install pig2.1 download pig

Download the latest pig version:

Http://www.apache.org/dyn/closer.cgi/pig

What I download is pig-0.10.0.tar.gz.

2.2 install pig

Extract

Tar zxvf pig-0.10.0.tar.gz

Enter the Directory

CD pig-0.10.0

Note that pig is a hadoop tool, so you do not need to modify the original hadoop configuration.

Add pig to environment variables:

Input

Cd ~

Go to the user's home directory

Vi. bashrc

Add environment variable configuration at the bottom

Save and execute

.. Bashrc

Enter pig-help for testing. If the setting is successful, the following page is displayed:

If you want to obtain the pig source code, you can use SVN to download

Http://svn.apache.org/repos/asf/pig/trunk

2.3 configure hadoop

Go to the $ pig_home/conf directory.

Modify the configuration file and add it to pig. properties.

FS. Default. Name = HDFS: // localhost: 9000

Mapred. Job. Tracker = localhost: 9001

Point to local pseudo-distributed HDFS and mapreduce

Run pig locally

Pig-x local

The following page is displayed:

Run with hadoop

Directly input pig or pig-x mapreduce

The following error may occur:

Cannot find
Hadoop configurations in classpath (neither hadoop-site.xml nor core-site.xml
Was found in the classpath ).

Configuration required ~ /. Bashrc or/etc/profile. We recommend that you configure the. bashrc file and add

Export
Hadoop_home =/home/hadoop/hadoop-1.0.3

Export
Pig_classpath = $ hadoop_home/Conf

After the configuration is complete, you can access

 

 

 

 

Chapter 2 grunt shell3.1
Basic commands

Quit exits GRUNT

Kill hadoopid to terminate the running hadoop Process

Set Debug on open the debug level

Commands include:

Help, quit, kill jobid, set debug [ON | off], set job. Name
'Jobname'

File Commands include:

Cat, CD, copyfromlocal, copytolocal, CP, ls, mkdir, MV, PWD, RM, RMF, exec, run

3.2
Query Test

Find the/tutorial/data/excite-sall.log file under the installation directory, data is divided into three columns, separated by tabs, the first column is the user ID, the second column is the Unix timestamp, and the third column is the query record.

Enter the following statement:

Grunt> log =
Load 'tutorial/data/excite-small.log 'As (user, time, query );

Load data to an alias called log.

The query returns four tuples and displays them:

Grunt> LMT =
Limit log 4;

Grunt> dump
LMT;

Expected result:

Read/write operations in pig:

Load

Load data from a file to a link

Limit

Limit the number of tuples to n.

Dump

Displays the content of a link for debugging.

Store

Store data in a link to a directory

Input and execute:

Grunt> log =
Load '/home/lgstar888/hadoop/pig-0.10.0/tutorial/data/excite-small.log'
(User: chararray, time: Long, query: chararray );

Grunt> grpd =
Group log by user;

Grunt> CNTD =
Foreach grpd generate group, count (log );

Grunt> store
CNTD into 'output ';

Used to count the number of initiated queries per user

Use grunt> describe log;

View schema

Log: {User:
Chararray, time: Long, query: chararray}

Diagnostic operators in pig:

Describe alias;

Displays the schema of a link.

Explain

Displays the execution plan used to calculate a link.

Illustrate alias

Gradually show how data is converted

AVG

Average Value

Concat

Connect two strings

Count

Calculate the number of tuples in a package

Diff

Compares two fields in a single tuples.

Max

Calculate the maximum value in a single column package

Min

Calculates the minimum value in a single column package.

Size

Calculate the number of elements

Sum

Calculates the sum of values in a single column package.

Isempty

Check whether a package is empty

More related use and configuration have been sorted

Http://code.google.com/p/mycloub/

 

Hadoop usage:
  1. Hadoop usage (1)
  2. Hadoop usage (2)
  3. Hadoop usage (III)
  4. Hadoop usage (4)
  5. Hadoop usage (5)
  6. Hadoop usage (6)

Hbase practices:

  1. Hbase practice-(1.1 about nosql)

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.