Hadoop learning; Large datasets are saved as a single file in HDFs; Eclipse error is resolved under Linux installation; view. class file Plug-in

Source: Internet
Author: User
Tags gtk new set shuffle hadoop fs

sudo apt-get install eclipse

Open eclipse after installation, prompting for an error

An error has occurred. See the log file
/home/pengeorge/.eclipse/org.eclipse.platform_3.7.0_155965261/configuration/1342406790169.log.

Review the error log and then resolve

Open the log file and see the following error

! SESSION 2012-07-16 10:46:29.992-----------------------------------------------
eclipse.buildid=i20110613-1736
Java.version=1.7.0_05
Java.vendor=oracle Corporation
BootLoader Constants:os=linux, Arch=x86, WS=GTK, NL=ZH_CN
Command-Line arguments:-os LINUX-WS gtk-arch x86


! ENTRY Org.eclipse.osgi 4 0 2012-07-16 10:46:31.885
! MESSAGE Application Error
! STACK 1
Java.lang.UnsatisfiedLinkError:Could not load SWT library. Reasons:
No swt-gtk-3740 in Java.library.path
No SWT-GTK in Java.library.path
Can ' t load library:/home/pengeorge/.swt/lib/linux/x86_64/libswt-gtk-3740.so
Can ' t load library:/home/pengeorge/.swt/lib/linux/x86/libswt-gtk.so

Solutions

Copy the relevant files to ~/.swt/lib/linux/x86
cp/usr/lib/jni/libswt-*3740.so ~/.swt/lib/linux/x86_64 and restart it.

Eclipse under the Usr/lib/eclipse

http://www.blogjava.net/hongjunli/archive/2007/08/15/137054.html troubleshoot viewing. class files

A typical Hadoop workflow generates data files (such as log files) elsewhere, and then copies them into HDFs, which is then processed by mapreduce, usually without directly reading an HDFs file, which is read by the MapReduce framework. and resolves it to a separate record (key/value pair), unless you specify the import and export of data, it is almost impossible to program to read and write HDFs files

Hadoop file commands can interact with the HDFs file system as well as with the local file system or with the Amazon S3 file system

Hadoop fs-mkdir/user/chuck Create folder Hadoop fs-ls/view Hadoop FS-LSR/view subdirectory Hadoop fs-put example.txt. Adding a file to/user/chuck after the point is equivalent to/user/chuck

Note that the example.txt is placed under the root of the user, such as student users, can be/home/student/example.txt more than the operation of the local file into the HDFs

When you put data into HDFs to run Hadoop processing, the process will output a new set of HDFs files to view Hadoop fs-cat/user/chuck/pg20417.txt read Hadoop fs-get/user/chuck/ Pg20417.txt

Pipeline Hadoop fs-cat/user/chuck/pg20417.txt that can use UNIX in Hadoop | Head View last 1000 bytes Hadoop fs-tail/user/chuck/pg20417.txt

Delete Files Hadoop fs-rm/user/chuck/pg20417.txt

View Hadoop command Help, for example, to understand LS you can hadoop fs-help ls

The Hadoop command line has a getmerge used to merge HDFs into the local computer file before merging, and the main class in Hadoop for file operations is located in Org.apache.hadoop.fs


after the input data is divided into different nodes, the data exchange between nodes in the "Shuffle" stage, the only time for communication between nodes is the "shuffle" stage , this communication constraint is very helpful for extensibility

MapReduce provides a way to serialize key-value pairs, so only those classes that are serialized can act as keys or values in this framework, implementing the writable interface can be a value, implementing the Writablecomparable<t> interface can be a key and a value, The key needs to be compared. Some predefined classes implement the Writablecomparable interface Ti

The method of implementation is: How to read the data, how to write the data, the comparison of the data sort

You can begin the first phase of mapper, a class that needs to inherit the Mapreducebase base class and implement the Mapper interface as a mapper

Constructor method void Configure (jobconif job) extracts an XML configuration file, or a parameter in an application's main class, to call the function before data processing

destructor method void Close () Mapper the end of a method that completes all end work, such as closing a database connection, opening a file, and so on.

Mapper has only one method map for handling a single key-value pair

The reduce function, which generates a (possibly empty) list by iterating over those values associated with the specified key

There is an extremely important step between mapper and reduce: outputting the results of mapper to different reducer, which is Partitioner's work

Multiple reducer implement parallel computing, the default practice is to hash the keys to determine reducer,hadoop through the state Hashpartitionner to enforce this policy, but sometimes you make mistakes

(Shanghai, Beijing) and (Shanghai, Guangzhou), these two lines can be sent to different reducer routes to depart from Hong Kong, if Shanghai as key, then processing two times, if Beijing as the departure place, for key, but also processing two times, if the canton can be, but also to deal with two times, At this time, Beijing and Guangzhou are two of the key's redundant

At this time we should be tailor-made to partitioner, only need to departure, the same departure route to the same reducer

A partitioner needs to implement the Configure function (applying Hadoop jobs on Partitioner), implementing the Getpartition () function (returning an integer between 0 and the reduce task count. Point to key-value pair to send reducer)

The position that the key is placed by Partitioner (which reducer)
HDFS support the combination of multiple files into a large file to HDFs processing (high efficiency) after processing to meet the use of MapReduce, one of the principles of mapreduce processing is to divide the input data into blocks, which can be processed in parallel on more than one computer, In Hadoop terms these are called input shards, which should be small enough to achieve granularity parallelism or too small

Fsdatainputstream expands the Java.io.DataInputStream to support random reads, and MapReduce requires this feature because a machine may be assigned to start processing a shard from the middle of the input file, and if there is no random access, it needs to be read from the beginning to the location of the Shard

HDFs is designed to store data that is fragmented and processed by MapReduce, and HDFs is stored and distributed on multiple machines, each block is a shard, and if each shard/block is processed by the machine on which it resides, it is automatically implemented in parallel, and multiple nodes are responsible for the data block to achieve reliability. MapReduce can optionally select a node that contains a copy of the Shard/data block

Input shards are a logical division, and HDFS blocks are a physical division of input data that, when they are consistent, is very efficient, but never fully consistent in practice, records may cross the bounds of a block of data, and a compute node that processes a particular shard gets a fragment of the record from a block of data


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.