Hadoop NCDC Data Download method

Source: Internet
Author: User
Tags install homebrew

I was looking at the "Hadoop authoritative guide", which provided a sample of NCDC weather data, the download link provided is: Click to open the link, but it only provides 1901 and 1902 of these two years of data, this is too little! Not exactly "BIG DATA", so I now provide a way to get a sample of the weather data from 1901 to 2014. In the website: Click Open link to provide these packets, although each package is only dozens of K, but the packet is too much, the original is too little, now is too much, is a problem, it is impossible to manually download, in fact, write a few lines of shell statement can easily solve this problem!

Preparation: If you are under Ubuntu or Debian, you should have installed the Wget tool, and if it is not installed, you can install it with the following command:

[Plain]View Plaincopy
    1. sudo apt-get install wget

If in Mac OS X, the system is not installed by default wget, there are two ways to install wget, one is to manually download the source code package to build their own installation, the biggest problem is that it depends on the package a bit more, to compile, very troublesome. Therefore, the second method is recommended: First install homebrew with the following command:

[Plain]View Plaincopy
    1. Ruby-e "$ (curl-fssl https://raw.github.com/Homebrew/homebrew/go/install)"

After the installation is complete, install wget with the following command:

[Plain]View Plaincopy
    1. Brew Install wget

When you are finished, enter the following statements at the terminal, and of course you can save the following statements as a shell program file and give it permission to execute and then run. Direct input is recommended:

[Plain]View Plaincopy
    1. #! Bin/bash
    2. For i in {1901..2014}
    3. Do
    4. Cd/users/guo/documents/ncdc
    5. wget--execute robots=off-r-np-nh--cut-dirs=3-r index.html http://ftp3.ncdc.noaa.gov/pub/data/noaa/isd-lite/$i/
    6. CD isd-lite/$i
    7. Mkdir-p/users/guo/documents/ncdc/files/$i
    8. CP *.gz/users/guo/documents/ncdc/files/$i
    9. Cd/users/guo/documents/ncdc
    10. Rm-r isd-lite/
    11. Done

Simply explain: Increment the variable i from 1901 to 2014, do the back loop operation, "/USERS/GUO/DOCUMENTS/NCDC" and "/users/guo/documents/ncdc/files" is the folder I built myself, To save the data, so you create your own folder to hold the data, and "$i" is a reference to the current value of the variable i 、、、

I have done an optimization of the not "friendly" section, so that each year's data package into the corresponding folder, before I put all the files in a folder, before the next few decades of data an open folder directly to my finder to suspended animation, a few minutes later woke up, I am a solid-state drive Ah! now has the corresponding year folder, easy to manage the data!

After these days of repeated download and deletion, but still no download completed, I found that in the download of a particular packet, such as more than one w,bash will report a argument list too long error, because the current year folder is too many packets, The default value for bash is exceeded. This causes the downloaded packets to not be copied to the Files folder under the Year folder, or can not delete the downloaded packets, to solve the problem there are other ways to bypass, but I feel too troublesome. I just thought, why do you want to copy it, we just save the original downloaded folder is OK, and each year is stored separately! And I will let wget only save *.gz Files! So you have the latest and most perfect version of the following!

[Plain]View Plaincopy
    1. #! Bin/bash
    2. For i in {1901..2014}
    3. Do
    4. CD ~/noaa/
    5. wget--execute robots=off-r-np-nh--cut-dirs=4-r index.html* http://ftp3.ncdc.noaa.gov/pub/data/noaa/isd-lite/$i/
    6. Done

Transfer from http://blog.csdn.net/lzslywl/article/details/26678731

Hadoop NCDC Data Download method

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.