Learn Spark (8)--spark Rdd integrated exercises with Tian Qi teacher

Source: Internet
Author: User
Tags commit session id split spark rdd
Comprehensive exercise: Calculate home addresses and work addresses from base station information

Requirements: Calculate the location of the cell phone according to its signal
When the phone is turned on, it will establish a connection with the nearby base station, and the connection and disconnection will be recorded to the log on the server.
So even if there is no mobile phone to open the network or GPS, you can also locate the location of the phone. The base station has a certain range of radiation,
And there are different signal levels depending on the signal strength, such as 2G, 3G, and 4G signals.

Although we do not know the location of mobile phone users, but we know the location of the base station, mobile phone users once into the base station
Radiation range, the cell phone will be connected to the base station. We can calculate the approximate position of the user. We'll be able to follow this
Some of the location information to do some recommended ads. For example, a nearby merchant, a product or service you might like.

If we now have some positional data, such as a cell phone number, a connection token (such as 1), a disconnected tag (such as 0),
Establish the time stamp for the connection, the time stamp for the disconnection, and so on. Disconnect time minus the time to establish the connection is the user at the base station
Time of stay. But this calculation is not very good, because in practice the user may stay for several days, or there is a connection but
No disconnection is in progress. So there is actually a concept of conversation.

In fact, the base station is not always kept connected, it may be every time he will automatically disconnect once. Like disconnecting every other day.
Each base station has a base station ID, which is a UUID. Therefore, it may also be a base station related to the Base station table, such as the base station ID and latitude and longitude information.

We should join two tables to get information about how long the user is staying under the base station.

Here we don't consider the concept of session ID first. We're just asking a user to sort from high to low for a certain period of time during the day and night.
For example, between 8 o'clock in the morning and 6 o'clock, we can think of the user's place of work for the longest period of stay. Instead, at 8 o ' 6 o'clock in the morning to the next day,
We think it is the user's residence for the longest period of stay.
Knowing the user's work place and residence, we can make some recommendations.

But another problem is that a user may go through dozens of or even hundreds of base stations in a single day. How do we know which base station it stays under for the longest time?
And there is a problem, a user in the same base station passed more than once. A user, for example, has a base station between his company and his home, and he goes to a base station in the morning,
I went home again at noon and passed again at night after work. In this way, he will be in the same base station in the middle of many times. This allows a lot of data to be logged in the server log of the base station.

We now want to calculate which base station the user stays in for the longest time, in fact it is simple data slicing, then summing, and then join.

For the sake of understanding, we simulated some simple log data, a total of 4 fields: mobile number, timestamp, base station ID, connection type (1 means to establish a connection, 0 means to disconnect):

Base Station A:
    18688888888,20160327082400,16030401eafb68f1e3cdf819735e1c66,1
    18611132889,20160327082500,16030401eafb68f1e3cdf819735e1c66,1
    18688888888,20160327170000,16030401eafb68f1e3cdf819735e1c66,0
    18611132889,20160327180000,16030401eafb68f1e3cdf819735e1c66,0

Base Station B:
    18611132889,20160327075000,9f36407ead0629fc166f14dde7970f68,1
    18688888888,20160327075100,9f36407ead0629fc166f14dde7970f68,1
    18611132889,20160327081000,9f36407ead0629fc166f14dde7970f68,0
    18688888888,20160327081300,9f36407ead0629fc166f14dde7970f68,0
    18688888888,20160327175000,9f36407ead0629fc166f14dde7970f68,1
    18611132889,20160327182000,9f36407ead0629fc166f14dde7970f68,1
    18688888888,20160327220000,9f36407ead0629fc166f14dde7970f68,0
    18611132889,20160327230000,9f36407ead0629fc166f14dde7970f68,0

Base Station C:
    18611132889,20160327081100,cc0710cc94ecc657a8561de549d940e0,1
    18688888888,20160327081200, cc0710cc94ecc657a8561de549d940e0,1
    18688888888,20160327081900,cc0710cc94ecc657a8561de549d940e0,0
    18611132889,20160327082000,cc0710cc94ecc657a8561de549d940e0,0
    18688888888,20160327171000, cc0710cc94ecc657a8561de549d940e0,1
    18688888888,20160327171600,cc0710cc94ecc657a8561de549d940e0,0
    18611132889,20160327180500,cc0710cc94ecc657a8561de549d940e0,1
    18611132889,20160327181500, cc0710cc94ecc657a8561de549d940e0,0

The following is the Base station table data, a total of 4 fields, representing the base station ID and latitude and longitude and signal radiation type (such as 2G signal, 3G signal and 4G signal):
    9f36407ead0629fc166f14dde7970f68,116.304864,40.050645,6
    Cc0710cc94ecc657a8561de549d940e0, 116.303955,40.041935,6
    16030401eafb68f1e3cdf819735e1c66,116.296302,40.032296,6

Based on the log data of the 3 base stations above, it is required to calculate the maximum number of two locations within a day.
Because a mobile phone number may pass through a lot of base stations in a day, he may stay at home for 10 hours, stay in the company for 8 hours, and may be passing by some base station in the car.

Ideas:
For each cell phone number under which base station to stay the longest time, in the calculation, with "mobile phone number + base station" in order to locate under which base station stay at the time,
Because there will be a lot of user log data under each base station.


The country has a lot of base stations, each telecommunications branch is only responsible for calculating their own data. The data is stored on the server in the room below the base station.
It is common to use some tools to collect these data over the Internet. The amount of data collected may be very large,
This data is typically stored in a distributed file system, such as in HDFs.

We may calculate based on the amount of data for a week or one months, and the greater the time span, the more accurate the computed structure will be.

The relevant information is in the "Spark profile".

Important: A well-written spark program, if I don't want to commit to the spark cluster every time, can be
In the program, specify "run in local mode", that is, the following way:
New Sparkconf (). Setappname ("xxxx"). Setmaster ("local")
It means to impersonate a program to run locally, and it does not commit it to the cluster.
However, this is not a problem in Linux and Mac systems, and there are exceptions under Windows.
Because our spark program is going to read data from HDFs, it uses Hadoop's inputformat to
Read the data, you need to do something if you want to debug locally under Windows.
We know that Hadoop wants to compress and decompress, so compressing and decompressing requires a library written in C or C + +.
Library files written in C or C + + are not cross-platform. Therefore, to debug under Windows, you must first install these libraries.
We recommend debugging under Linux, if you don't have a Mac system, you can do it on a Linux virtual machine
Install an Idea development tool. Use the Linux graphical interface to debug.


The following is the complete code:
Import Org.apache.spark.rdd.RDD Import Org.apache.spark. {sparkconf, Sparkcontext} object mobilelocation {def main (args:array[string]) {val conf = new sparkconf (). Setap PName ("Mobilelocation"). Setmaster ("local[2]") val sc = new Sparkcontext (conf) val lines:rdd[string] = Sc.textfile  (args (0))//Shard//lines.map (_.split (",")). Map (arr = (arr (0), arr (1). Tolong, arr (2), args (3))) Val splited = Lines.map (line + = {val fields = Line.split (",") Val mobile = Fields (0) Val lac = Fields (2) VA  L TP = Fields (3) Val-time = if (tp = = "1")-fields (1). Tolong else fields (1). Tolong//Splicing data ((Mobile, LAC), Time)})//Group aggregation Val reduced:rdd[((String, String), Long)] = Splited.reducebykey (_+_) Val LMT = reduced. Map (x + = {//(base station, (mobile phone number, time)) (X._1._2, (X._1._1, x._2))})//connection val lacinfo:rdd[string] = Sc.te Xtfile (args (1))//Collate base Station data val Splitedlacinfo = Lacinfo.map (line + {val FIelds = Line.split (",") Val id = fields (0) Val x = Fields (1) Val y = Fields (2) (ID, (x, y))}) Connection Jion Val joined:rdd[(String, (String, Long), (String, string))] = Lmt.join (splitedlacinfo) println (Jo Ined.collect (). Tobuffer) Sc.stop ()}}


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.