Hadoop learning; Large datasets are saved as a single file in HDFs; Eclipse error is resolved under Linux installation; view. class file Plug-in

Source: Internet
Author: User
Tags gtk new set shuffle hadoop fs

sudo apt-get install eclipse

Open eclipse after installation, prompting for an error

An error has occurred. See the log file
/home/pengeorge/.eclipse/org.eclipse.platform_3.7.0_155965261/configuration/1342406790169.log.

Review the error log and then resolve

Open the log file and see the following error

! SESSION 2012-07-16 10:46:29.992-----------------------------------------------
eclipse.buildid=i20110613-1736
Java.version=1.7.0_05
Java.vendor=oracle Corporation
BootLoader Constants:os=linux, Arch=x86, WS=GTK, NL=ZH_CN
Command-Line arguments:-os LINUX-WS gtk-arch x86


! ENTRY Org.eclipse.osgi 4 0 2012-07-16 10:46:31.885
! MESSAGE Application Error
! STACK 1
Java.lang.UnsatisfiedLinkError:Could not load SWT library. Reasons:
No swt-gtk-3740 in Java.library.path
No SWT-GTK in Java.library.path
Can ' t load library:/home/pengeorge/.swt/lib/linux/x86_64/libswt-gtk-3740.so
Can ' t load library:/home/pengeorge/.swt/lib/linux/x86/libswt-gtk.so

How to Solve

Copy the relevant files to ~/.swt/lib/linux/x86 and you can
Cp/usr/lib/jni/libswt-*3740.so ~/.swt/lib/linux/x86_64 and then restart it.

Eclipse under the Usr/lib/eclipse

http://www.blogjava.net/hongjunli/archive/2007/08/15/137054.html troubleshoot viewing. class files

A typical Hadoop workflow generates data files (such as log files) elsewhere, and then copies them into HDFs, which is then processed by MapReduce. Typically, an HDFs file is not read directly. They rely on the MapReduce framework to read. and resolves it to a separate record (key/value pair) unless you specify the import and export of the data. Otherwise, almost no programming is used to read and write HDFs files.

The Hadoop file command can interact with the HDFs file system as well as with the local file system, as well as with the Amazon S3 file system

Hadoop fs-mkdir/user/chuck Create folder Hadoop fs-ls/view Hadoop FS-LSR/view Subfolders

Hadoop fs-put Example.txt. Adding a file to/user/chuck is the equivalent of/user/chuck.

Suppose you put it in a directory that does not exist. Then the system defaults to renaming the file. Instead of creating a new directory

Note that the example.txt here is placed under the root folder user. For example, student users. Can be/home/student/example.txt the local file into HDFs

When you put data into HDFs to perform hadoop processing, the process will output a new set of HDFs files to view the Hadoop fs-cat/user/chuck/pg20417.txt

Read the Hadoop fs-get/user/chuck/pg20417.txt. Read the file into the current Linux directory, where the dots represent the current directory

A pipeline capable of using UNIX in Hadoop Hadoop fs-cat/user/chuck/pg20417.txt | Head view the last 1000 bytes of Hadoop fs-tail/user/chuck/pg20417.txt;

View Files Hadoop fs-text/user/chuck/pg20417.txt

Delete Files Hadoop fs-rm/user/chuck/pg20417.txt

View Hadoop command Help, for example, to understand LS to be able to Hadoop fs-help ls

The Hadoop command line has a getmerge that is used to merge HDFs into the local computer file before merging, and the main class in Hadoop for file operations is located in Org.apache.hadoop.fs


after the input data is divided into different nodes, the data exchange between nodes in the "Shuffle" stage, the only time for communication between nodes is the "shuffle" stage , this communication constraint is very helpful for extensibility

MapReduce provides a way to serialize key-value pairs. So only those classes that are serialized can act as keys or values in this framework. The implementation of the writable interface can be a value, the implementation of the Writablecomparable<t> interface can be keys and values, the key needs to be compared. Some pre-defined classes implement the Writablecomparable interface Ti

The method of implementation is: How to read the data, how to write the data, the comparison of the data

Be able to start the first phase of Mapper, a class to be mapper. Need to inherit mapreducebase base class and implement Mapper interface

The constructor method void Configure (jobconif job) extracts the XML configuration file, or the parameters in the application's main class, to call the function before data processing

destructor method void Close () mapper ends with a method that finishes all work, such as closing a database connection, opening a file, and so on.

Mapper only has a method map that handles a single key-value pair

The reduce function, which iterates through the values associated with the specified key. Generate a (possibly empty) list

There is an extremely important step between mapper and reduce: outputting the results of mapper to different reducer, which is Partitioner's work

Multiple reducer implement parallel computing, the default practice is to hash the keys to determine reducer,hadoop through the state Hashpartitionner to enforce this policy, but sometimes you make mistakes

(Shanghai, Beijing) and (Shanghai, Guangzhou), these two lines can be sent to different reducer routes departing from Hong Kong, if Shanghai as key. It is handled two times if Beijing is the place of departure. As key. is also handled two times. If you take Guangzhou as a possible. is also processed two times, when Beijing and Guangzhou as key of the respective two is redundant

At this time we should be tailor-made to the partitioner, only need to departure to hash, the same departure route to the same reducer

A partitioner needs to implement the Configure function (applying Hadoop jobs on Partitioner) to implement the Getpartition () function (returns an integer between 0 and the reduce task count. Point to key-value pair to send reducer)

The position that the key is placed by Partitioner (which reducer)
HDFs support to combine multiple files into a large file to HDFs processing (high efficiency) after processing to meet the use of MapReduce, one of the principles of mapreduce processing is to cut the input data into chunks, which can be processed in parallel on more than one computer, In Hadoop terms these are referred to as input shards, which should be small enough to achieve granular parallelism. It can't be too small.

Fsdatainputstream extended the Java.io.DataInputStream to support random reads, and MapReduce needed this feature because a machine might be assigned to start processing a shard from the middle of the input file. Assuming there is no random access, you need to read from the beginning to the location of the Shard

HDFs is designed to store data that is fragmented and processed by MapReduce, and HDFs is stored in blocks and distributed across multiple machines, each of which is a shard. Assuming that each shard/block is handled by the machine on which it resides, it takes its own initiative to implement parallelism, and multiple nodes are responsible for the data blocks to achieve reliability. MapReduce is free to select a node that includes a copy of a shard/block of data

The input shard is a logical division, and the HDFS data block is the physical division of the input data. When they are consistent, they are highly efficient. In practice, however, there is never a complete agreement that records may cross the bounds of a block of data, and a compute node that processes a particular shard gets a fragment of the record from a block of data


Hadoop learning; Large datasets are saved as a single file in HDFs; Eclipse error is resolved under Linux installation; view. class file Plug-in

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.