Spatialhadoop: Effective analysis of your spatial data

Source: Internet
Author: User
Tags bz2 hadoop fs
First, Introduction

Spatialhadoop is an open-source mapreduce extension designed to process spatial data on apachehadoop clusters. Spatialhadoop has built-in high-level language for space, spatial data types, spatial indexes, and efficient space operations. second, installation and configuration Spatialhadoop 1, configuration Spatialhadoop

Spatialhadoop designed a generic run that can run on any configured Hadoop cluster. Spatialhadoop has been tested on Hadoop1.2.1, but it also supports other releases of Hadoop.

First of all, you need a configured Hadoop cluster. You can then add the Spatialhadoop class and configuration file to the configured Hadoop cluster so that the new Spatialhadoop command can be used. The following steps are how to install Spatialhadoop in a configured Hadoop. A. Download Spatialhadoop

Download Spatialhadoop via the following connection

http://spatialhadoop.cs.umn.edu/spatialhadoop-2.1-hadoop-1.2.1-bin.tar.gz B. Configuring in a cluster

Unzip the downloaded Spatialhadoop file to a local directory. Add the local java_home installation path to the conf/hadoop-env.sh. 2. Spatialhadoop Virtual Machine

	For the convenience of everyone, we can directly download Spatialhadoop virtual machine, import the downloaded virtual machine in Virtuebox, the virtual machine has been configured for us Spatialhadoop, we can directly use. There are several steps to installing a virtual machine:

A. Click on the link below to download the latest Virtuebox

Https://www.virtualbox.org/wiki/Downloads

B. Click on the link below to get the latest Spatialhadoop virtual machine

Http://spatialhadoop.cs.umn.edu/SpatialHadoop-vm-2.1.ova

C. Install the Virturebox, click on the upper left corner: Management, import the virtual computer, import the downloaded spatialhadoop virtual machine.

D. Start the Spatialhadoop virtual machine with the user name and password for the virtual machine: Shadoop

The E.spatialhadoop is located in the ' ~/spatialhadoop-* ' directory of the virtual machine. Now we can run Spatialhadoop in pseudo-distributed mode. third, start Spatialhadoop and run an instance 1. Change the default directory name for Spatialhadoop

In Ubuntu, enter the terminal command to enter the Spatialhadoop directory and start the Spatialhadoop cluster:

For the sake of convenience, we renamed the Spatialhadoop default directory with the following command:

Note: After changing the Spatialhadoop directory named Shadoop, we need to configure the Hdfs-site.xml and Mapred-site.xml files under the Shadoop/conf folder.

To enter the ~/shadoop/conf directory, we use the following command: Sudogedit hdfs-site.xml

<configuration>

<property>

<name>dfs.data.dir</name>

<value>/home/shadoop/shadoop/hdfs/data</value>

</property>

<property>

<name>dfs.name.dir</name>

<value>/home/shadoop/shadoop/hdfs/name</value>

</property>

</configuration>

To enter the ~/shadoop/conf directory, we use the following command: Sudogedit mapred-site.xml

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>hdfs://localhost:9001</value>

</property>

<property>

<name>mapred.local.dir</name>

<value>/home/shadoop/shadoop/hdfs/mapred</value>

</property>

</configuration>


2. Start Spatialhadoop cluster

Go to the ~/shadoop directory and enter the following command: bin/start-all.sh start the cluster. Then enter the JPS command to see if the five daemons started successfully.


3. Simple example

When our Spatialhadoop is configured, we need to run some examples to let us understand the features of Spatialhadoop. Below we will generate a random file, index the file with a grid, and then make some spatial queries on the indexed file. The classes required for this example are contained in Spatialhadoop *.jar, and you can enter the "bin/shadoop" command to use these classes.

A Generate a random file that contains randomly generated rectangles, enter the following command:

Bin/shadoop Generate Test MBR:0,0,1000000,1000000SIZE:1.GB Shape:rect

This command generates a 1G file named "Test", in which all the rectangles are contained in a rectangle with an angle of (0,0) and a length width of 1000000,1000000 respectively.

b Use the grid index to index the file, and enter the following command:

Bin/shadoop index test Test.grid mbr:0,0,1000000,1000000sindex:grid shape:rect

C See how the grid index divides this file

Bin/shadoop ReadFile Test.grid

To execute the command, we can see how many parts of the file have been converted to the boundaries of the parts.

D Perform a rangequery operation on the file

Bin/shadoop rangequery Test.grid rq_resultsrect:500,500,1000,1000

The command above is to perform a range query that is set to a corner of (500,500), a length of 1000,1000, and a running result saved in the Rq_results folder on HDFs.

E Perform a KNN operation on a file

Bin/shadoop KNN Test.grid knn_results point:1000,1000k:1000

The query point for this KNN query is (1000,1000), k=1000, and the results of the run are stored in the Knn_results folder in HDFs.

F Perform a spatial join operation on a file

First, sir. into another file, the file has been indexed by the grid.

Bin/shadoop Generate Test2.grid mbr:0,0,1000000,1000000size:100.mb Sindex:grid

Now, connect two files with the following command using the Distributed connection algorithm:

Bin/shadoop DJ Test.grid Test2.grid sj_results Four, running interface visualization example

First enter the Spatialhadoop directory and start the cluster with "bin/start-all.sh", and finally use "JPS" to see if the five daemons started successfully. As shown in the figure:


To operate on a Hadoop cluster, you need to turn off the security mode of Hadoop and use the "Bin/hadoop dfsadmin–safemode leave" command to turn off the security mode of Hadoop. As shown in the figure:


We put the data sets that need to be processed in the downloads directory under the home folder. As shown in the figure:


We create a folder in the root of HDFs, store the datasets we need to process, use the "bin/hadoop fs–mkdir/parks" command to create the ' Parks ' folder, and use "bin/hadoopfs–ls/" to see if the folder is created. As shown in the figure:

We then upload the ' parks.tsv.bz2 ' dataset in the Downloads directory to the ' Parks ' folder in the HDFs root directory, which will be used to "bin/hadoopfs–put ~/downloads/parks.tsv.bz2/ Parks ", you can then use" Bin/hadoop fs–ls/parks "to see if the data set was successfully uploaded. As shown in the figure:



We can also use "localhost:50070" to view the basic status of HDFs, as shown in the following figure:


We can also click on "Browse Thefilesystem" To view the files stored in HDFs.

We then enter ' localhost:50070/visualizer.jsp ' in the browser to enter the visual interface of the data processing, as shown in the figure:


Click on the ' Parks ' folder on the left and click the ' Preprocess ' button to start working on the file. There are two steps to working with datasets, one for FILEMBR operations on datasets and another for plot operations on datasets. We can go to the ' localhost:50030 ' page to track two tasks. When the following interface is present, both tasks are completed.


Then we go to the ' localhost:50070/visualizer.jsp ' interface and click on the ' Parks ' folder to see that the dataset is visualized, as shown in the figure:


To learn more, you can log in:

Http://spatialhadoop.cs.umn.edu/index.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.