"Book serial" MapReduce is a kind of programming model that can be used in data processing. The model is simple, but it is not easy to write useful programs. Hadoop can run MapReduce programs written in various languages. In this chapter, we'll see the same program written in Java, Ruby, Python, and C + + languages. Most importantly, the MapReduce program is essentially running in parallel, so large data analysis tasks can be delegated to any operator with enough machines. The advantage of MapReduce is to deal with large datasets, so let's look at a dataset first.
A meteorological dataset
In our case, we want to write a program to excavate meteorological data. Many meteorological sensors, all over the world, collect meteorological data every hour and get a lot of log data. Because the data is semi-structured and stored in a documented manner, it is ideal for using mapreduce to process.
The format of the data
We will use the data provided by the National Climate Data Center (climatic, NCDC, http://www.ncdc.noaa.gov/). The data is stored in line and ASCII-encoded, with each row being a record. The storage format can support many meteorological elements, many of which can optionally be included in the collection range or the storage length required for its data is variable. For simplicity's sake, we'll focus on some basic elements (such as temperature), which are always fixed in length.
Example 2-1 shows a row of sampled data, where important fields are highlighted. The row data has been divided into rows to highlight each field, and in the actual file, the fields are consolidated into one line with no delimiters.
Example 2-1. Format of data records for national climate Data centres
0057 332130 # USAF Weather redevelop identifier
99999 # Wban Weather redevelop identifier
19500101 # Observation Date
0300 # observation Time
4 +51317 # Latitude (degrees x 1000)
+028783 # Longitude (degrees x 1000) F
M-12
+0171 # elevation (meters)
99999
V020
# Wind Direction (degrees)
1 # Quality Code
N
0072
1
00450 # Sky Ceiling height (meters)
1 # Quality Code
C
N
010000 # Visibility Distance (meters)
1 # Quality Code
N
9-0128 # Air Temperature (degrees Celsius x 10)
1 # Quality code-0139
# Dew point temperature (degrees Celsius x 10)
1 # Quality Code 10268
# Atmospheric pressure (hectopascals x 10)
1 # Quality Code
Data files are organized by date and weather stations. From 1901 to 2001, there is a catalogue for each year, each containing a package of meteorological data from each meteorological station and its documentation. For example, the 1999 corresponding folder contains the following records:
% ls raw/1990 | Head
010010-99999-1990.gz
010014-99999-1990.gz
010015-99999-1990.gz
010016-99999-1990.gz
010017-99999-1990.gz
010030-99999-1990.gz
010040-99999-1990.gz
010080-99999-1990.gz
010100-99999-1990.gz
010150-99999-1990.gz
Because there are thousands of meteorological stations, the entire dataset consists of a large number of small-capacity files. Typically, it is easier and more efficient to process a small number of large files, so these data need to be preprocessed to stitch each year's data files into a single file. See Appendix C for specific practices.
1234567 Next