Chapter 2 Introduction 1st Writing Purpose
Introduce pig, a hadoop extension that has to be said.
1.2 What is pig
Pig is a hadoop-based large-scale data analysis platform, which provides the SQL-LIKE language Pig Latin, the compiler of this language converts SQL-like Data Analysis requests into a series of optimized mapreduce operations. Pig provides a simple operation and programming interface for complex concurrent computing of massive data.
Features of 1.3 pig
1. Focus on massive data set analysis (Ad-Hoc analysis, ad-hoc representative: a solution that has been custom designed for a specific problem );
2. Run on the cluster computing architecture. Yahoo pig provides multi-layer abstraction to simplify parallel computing for common users; these abstractions automatically translate user requests to queries into effective parallel evaluation plans, and then execute these plans on the physical cluster;
3. provides SQL-like Operation syntax;
4. Open source code;
1.4
Major pig users
1. Yahoo
2. Twitter
1.5
About pig and hive
For developers, using Java APIs directly may be boring or error-prone, and it also limits the flexibility of Java programmers in hadoop programming. Therefore, hadoop provides two solutions to make hadoop programming easier.
• Pig is a programming language that simplifies common tasks of hadoop. Pig can load data, express conversion data, and store final results. Pig's built-in operations make semi-structured data meaningful (such as log files ). Pig can also use the custom data types added in Java and support data conversion.
• Hive plays the role of data warehouse in hadoop. Hive adds the data structure in HDFS (hive superimposes
Structure on data in HDFS), and allows you to use SQL-like syntax for data query. Like pig, hive's core functions are scalable.
Pig and hive are always confusing. Hive is more suitable for Data Warehouse tasks. hive is mainly used for static structures and jobs that require frequent analysis. Similar to SQL, hive makes it an ideal intersection of hadoop and other bi tools. Pig gives developers more flexibility in the big dataset field and allows developers to develop concise scripts to convert data streams so that they can be embedded into larger
Application. Pig is relatively lightweight than hive, and its main advantage is that it uses hadoop directly.
Java APIs can greatly reduce the amount of code. Because of this, pig is still attracting a large number of software developers.
Chapter 4 install pig2.1 download pig
Download the latest pig version:
Http://www.apache.org/dyn/closer.cgi/pig
What I download is pig-0.10.0.tar.gz.
2.2 install pig
Extract
Tar zxvf pig-0.10.0.tar.gz
Enter the Directory
CD pig-0.10.0
Note that pig is a hadoop tool, so you do not need to modify the original hadoop configuration.
Add pig to environment variables:
Input
Cd ~
Go to the user's home directory
Vi. bashrc
Add environment variable configuration at the bottom
Save and execute
.. Bashrc
Enter pig-help for testing. If the setting is successful, the following page is displayed:
If you want to obtain the pig source code, you can use SVN to download
Http://svn.apache.org/repos/asf/pig/trunk
2.3 configure hadoop
Go to the $ pig_home/conf directory.
Modify the configuration file and add it to pig. properties.
FS. Default. Name = HDFS: // localhost: 9000
Mapred. Job. Tracker = localhost: 9001
Point to local pseudo-distributed HDFS and mapreduce
Run pig locally
Pig-x local
The following page is displayed:
Run with hadoop
Directly input pig or pig-x mapreduce
The following error may occur:
Cannot find
Hadoop configurations in classpath (neither hadoop-site.xml nor core-site.xml
Was found in the classpath ).
Configuration required ~ /. Bashrc or/etc/profile. We recommend that you configure the. bashrc file and add
Export
Hadoop_home =/home/hadoop/hadoop-1.0.3
Export
Pig_classpath = $ hadoop_home/Conf
After the configuration is complete, you can access
Chapter 2 grunt shell3.1
Basic commands
Quit exits GRUNT
Kill hadoopid to terminate the running hadoop Process
Set Debug on open the debug level
Commands include:
Help, quit, kill jobid, set debug [ON | off], set job. Name
'Jobname'
File Commands include:
Cat, CD, copyfromlocal, copytolocal, CP, ls, mkdir, MV, PWD, RM, RMF, exec, run
3.2
Query Test
Find the/tutorial/data/excite-sall.log file under the installation directory, data is divided into three columns, separated by tabs, the first column is the user ID, the second column is the Unix timestamp, and the third column is the query record.
Enter the following statement:
Grunt> log =
Load 'tutorial/data/excite-small.log 'As (user, time, query );
Load data to an alias called log.
The query returns four tuples and displays them:
Grunt> LMT =
Limit log 4;
Grunt> dump
LMT;
Expected result:
Read/write operations in pig:
Load
Load data from a file to a link
Limit
Limit the number of tuples to n.
Dump
Displays the content of a link for debugging.
Store
Store data in a link to a directory
Input and execute:
Grunt> log =
Load '/home/lgstar888/hadoop/pig-0.10.0/tutorial/data/excite-small.log'
(User: chararray, time: Long, query: chararray );
Grunt> grpd =
Group log by user;
Grunt> CNTD =
Foreach grpd generate group, count (log );
Grunt> store
CNTD into 'output ';
Used to count the number of initiated queries per user
Use grunt> describe log;
View schema
Log: {User:
Chararray, time: Long, query: chararray}
Diagnostic operators in pig:
Describe alias;
Displays the schema of a link.
Explain
Displays the execution plan used to calculate a link.
Illustrate alias
Gradually show how data is converted
AVG
Average Value
Concat
Connect two strings
Count
Calculate the number of tuples in a package
Diff
Compares two fields in a single tuples.
Max
Calculate the maximum value in a single column package
Min
Calculates the minimum value in a single column package.
Size
Calculate the number of elements
Sum
Calculates the sum of values in a single column package.
Isempty
Check whether a package is empty
More related use and configuration have been sorted
Http://code.google.com/p/mycloub/
Hadoop usage:
- Hadoop usage (1)
- Hadoop usage (2)
- Hadoop usage (III)
- Hadoop usage (4)
- Hadoop usage (5)
- Hadoop usage (6)
Hbase practices:
- Hbase practice-(1.1 about nosql)