Learn Spark (8)--spark Rdd integrated exercises with Tian Qi teacher

Last Update:2018-07-25 Source: Internet

Author: User

Tags commit session id split spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Comprehensive exercise: Calculate home addresses and work addresses from base station information

Requirements: Calculate the location of the cell phone according to its signal
When the phone is turned on, it will establish a connection with the nearby base station, and the connection and disconnection will be recorded to the log on the server.
So even if there is no mobile phone to open the network or GPS, you can also locate the location of the phone. The base station has a certain range of radiation,
And there are different signal levels depending on the signal strength, such as 2G, 3G, and 4G signals.

Although we do not know the location of mobile phone users, but we know the location of the base station, mobile phone users once into the base station
Radiation range, the cell phone will be connected to the base station. We can calculate the approximate position of the user. We'll be able to follow this
Some of the location information to do some recommended ads. For example, a nearby merchant, a product or service you might like.

If we now have some positional data, such as a cell phone number, a connection token (such as 1), a disconnected tag (such as 0),
Establish the time stamp for the connection, the time stamp for the disconnection, and so on. Disconnect time minus the time to establish the connection is the user at the base station
Time of stay. But this calculation is not very good, because in practice the user may stay for several days, or there is a connection but
No disconnection is in progress. So there is actually a concept of conversation.

In fact, the base station is not always kept connected, it may be every time he will automatically disconnect once. Like disconnecting every other day.
Each base station has a base station ID, which is a UUID. Therefore, it may also be a base station related to the Base station table, such as the base station ID and latitude and longitude information.

We should join two tables to get information about how long the user is staying under the base station.

Here we don't consider the concept of session ID first. We're just asking a user to sort from high to low for a certain period of time during the day and night.
For example, between 8 o'clock in the morning and 6 o'clock, we can think of the user's place of work for the longest period of stay. Instead, at 8 o ' 6 o'clock in the morning to the next day,
We think it is the user's residence for the longest period of stay.
Knowing the user's work place and residence, we can make some recommendations.

But another problem is that a user may go through dozens of or even hundreds of base stations in a single day. How do we know which base station it stays under for the longest time?
And there is a problem, a user in the same base station passed more than once. A user, for example, has a base station between his company and his home, and he goes to a base station in the morning,
I went home again at noon and passed again at night after work. In this way, he will be in the same base station in the middle of many times. This allows a lot of data to be logged in the server log of the base station.

We now want to calculate which base station the user stays in for the longest time, in fact it is simple data slicing, then summing, and then join.

For the sake of understanding, we simulated some simple log data, a total of 4 fields: mobile number, timestamp, base station ID, connection type (1 means to establish a connection, 0 means to disconnect):

Base Station A:

    18688888888,20160327082400,16030401eafb68f1e3cdf819735e1c66,1
    18611132889,20160327082500,16030401eafb68f1e3cdf819735e1c66,1
    18688888888,20160327170000,16030401eafb68f1e3cdf819735e1c66,0
    18611132889,20160327180000,16030401eafb68f1e3cdf819735e1c66,0

Base Station B:

    18611132889,20160327075000,9f36407ead0629fc166f14dde7970f68,1
    18688888888,20160327075100,9f36407ead0629fc166f14dde7970f68,1
    18611132889,20160327081000,9f36407ead0629fc166f14dde7970f68,0
    18688888888,20160327081300,9f36407ead0629fc166f14dde7970f68,0
    18688888888,20160327175000,9f36407ead0629fc166f14dde7970f68,1
    18611132889,20160327182000,9f36407ead0629fc166f14dde7970f68,1
    18688888888,20160327220000,9f36407ead0629fc166f14dde7970f68,0
    18611132889,20160327230000,9f36407ead0629fc166f14dde7970f68,0

Base Station C:

    18611132889,20160327081100,cc0710cc94ecc657a8561de549d940e0,1
    18688888888,20160327081200, cc0710cc94ecc657a8561de549d940e0,1
    18688888888,20160327081900,cc0710cc94ecc657a8561de549d940e0,0
    18611132889,20160327082000,cc0710cc94ecc657a8561de549d940e0,0
    18688888888,20160327171000, cc0710cc94ecc657a8561de549d940e0,1
    18688888888,20160327171600,cc0710cc94ecc657a8561de549d940e0,0
    18611132889,20160327180500,cc0710cc94ecc657a8561de549d940e0,1
    18611132889,20160327181500, cc0710cc94ecc657a8561de549d940e0,0

The following is the Base station table data, a total of 4 fields, representing the base station ID and latitude and longitude and signal radiation type (such as 2G signal, 3G signal and 4G signal):

    9f36407ead0629fc166f14dde7970f68,116.304864,40.050645,6
    Cc0710cc94ecc657a8561de549d940e0, 116.303955,40.041935,6
    16030401eafb68f1e3cdf819735e1c66,116.296302,40.032296,6

Based on the log data of the 3 base stations above, it is required to calculate the maximum number of two locations within a day.
Because a mobile phone number may pass through a lot of base stations in a day, he may stay at home for 10 hours, stay in the company for 8 hours, and may be passing by some base station in the car.

Ideas:
For each cell phone number under which base station to stay the longest time, in the calculation, with "mobile phone number + base station" in order to locate under which base station stay at the time,
Because there will be a lot of user log data under each base station.

The country has a lot of base stations, each telecommunications branch is only responsible for calculating their own data. The data is stored on the server in the room below the base station.
It is common to use some tools to collect these data over the Internet. The amount of data collected may be very large,
This data is typically stored in a distributed file system, such as in HDFs.

We may calculate based on the amount of data for a week or one months, and the greater the time span, the more accurate the computed structure will be.

The relevant information is in the "Spark profile".

Important: A well-written spark program, if I don't want to commit to the spark cluster every time, can be
In the program, specify "run in local mode", that is, the following way:
New Sparkconf (). Setappname ("xxxx"). Setmaster ("local")
It means to impersonate a program to run locally, and it does not commit it to the cluster.
However, this is not a problem in Linux and Mac systems, and there are exceptions under Windows.
Because our spark program is going to read data from HDFs, it uses Hadoop's inputformat to
Read the data, you need to do something if you want to debug locally under Windows.
We know that Hadoop wants to compress and decompress, so compressing and decompressing requires a library written in C or C + +.
Library files written in C or C + + are not cross-platform. Therefore, to debug under Windows, you must first install these libraries.
We recommend debugging under Linux, if you don't have a Mac system, you can do it on a Linux virtual machine
Install an Idea development tool. Use the Linux graphical interface to debug.

The following is the complete code:

Import Org.apache.spark.rdd.RDD Import Org.apache.spark. {sparkconf, Sparkcontext} object mobilelocation {def main (args:array[string]) {val conf = new sparkconf (). Setap PName ("Mobilelocation"). Setmaster ("local[2]") val sc = new Sparkcontext (conf) val lines:rdd[string] = Sc.textfile  (args (0))//Shard//lines.map (_.split (",")). Map (arr = (arr (0), arr (1). Tolong, arr (2), args (3))) Val splited = Lines.map (line + = {val fields = Line.split (",") Val mobile = Fields (0) Val lac = Fields (2) VA  L TP = Fields (3) Val-time = if (tp = = "1")-fields (1). Tolong else fields (1). Tolong//Splicing data ((Mobile, LAC), Time)})//Group aggregation Val reduced:rdd[((String, String), Long)] = Splited.reducebykey (_+_) Val LMT = reduced. Map (x + = {//(base station, (mobile phone number, time)) (X._1._2, (X._1._1, x._2))})//connection val lacinfo:rdd[string] = Sc.te Xtfile (args (1))//Collate base Station data val Splitedlacinfo = Lacinfo.map (line + {val FIelds = Line.split (",") Val id = fields (0) Val x = Fields (1) Val y = Fields (2) (ID, (x, y))}) Connection Jion Val joined:rdd[(String, (String, Long), (String, string))] = Lmt.join (splitedlacinfo) println (Jo Ined.collect (). Tobuffer) Sc.stop ()}}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Learn Spark (8)--spark Rdd integrated exercises with Tian Qi teacher

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Learn Spark (8)--spark Rdd integrated exercises with Tian Qi teacher

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support