Mobile Internet and daily life are inseparable, people are making data anytime, anywhere, access to data. All the data problems seem to be no longer a problem on the current mobile Internet. What we need to face is how to find the information that is truly useful to ourselves in the sea of data.
Big data generated on the phone needs to be reorganized before it can reveal useful information.
Of course, big data is not a problem in itself. Anytime, anywhere channels can access massive amounts of data; even we personally produce a very large amount of data every day, then how to make piles of data with its due value? I think the key issue is how to deal with that data, or the data will simply be a bunch of pushless data of no value.
Find a needle in the straw
Data processing, analysis is to pick out a needle from a large pile of straw.
This sentence has two meanings:
No matter what channel we get or what kind of data we get, a lot of data is meaningless, of which only a small amount of data is valid and we can draw some regular and valuable information from it. The original data needs Cleaning and tidying up. (This is more obvious for the mobile Internet. Generally speaking, almost 80% -90% of the mobile application data is worthless. Only less than 10% -20% of the data includes follow-up analysis Required information)
We need to work through subsequent data mining to find a stub beneath a layer of surface representation from a disorganized stack of straw instead of a straw! This is obviously not something that can be easily derived from statistics; More complicated than statistical analysis of the algorithm, to go from simple to complex level of logic associated.
From the previous chapter, we know that the source of big data on mobile phones is diverse. Different mobile Internet ports and different applications result in different data, and these different data support different services and different services There may be some cross-application between. So for these features, here we are based on these characteristics to make a brief introduction to the organization of big data, mobile big data organization can adopt the structure of the following figure.
Mobile big data organization
As can be seen from the above structure, mobile big data organization is mainly divided into the following three parts: the original data storage layer, computing layer and business data storage layer
Raw data storage layer
The main storage of data generated by different entrances, and different entrances of different data domains are also coexist At this point each data set is independent of each other, there is no contact between the data sets. The benefits of doing so is to maximize the integrity of the original data, correctness.
Calculation layer
The main task of this layer is to mine the data accessed by the original data storage layer and classify the mining results according to different services. Mining processing is roughly divided into the following two points: ① different data domains for internal analysis, mining. ② Association between different data mining analysis. Results classification is mainly to mine the results of the classification, such as music, restaurants, advertising and other applications.
Business Data Storage Layer
The business data storage layer mainly stores the data output by the computing layer separately to provide a more definite data interface for the upper-layer application.
When we store and manage data effectively, we can use the big data for some meaningful work, especially for the early days of the mobile Internet business undoubtedly has a very great guiding significance. Here we take the mobile phone reading as an example, the mobile phone How to use big data on a description.
Reading on the phone
Mobile reading as one of the few still have some common Internet applications and precipitation applications, is our early mobile application research object. The so-called "commonality" and "precipitation" is for mobile applications "fragmentation" is concerned. Although there are a lot of cell phone reading time debris, but has gradually formed a morning and one night valuable "bed" contiguous time, there is relatively complete time on-chip behavior, there are relatively complete user reflected on the mobile Internet Preferences information.
Although many reports show that users have vastly different characterizations on the Internet and on the mobile Internet, I still believe people will not change their feelings when they face a paragraph of text or a picture. Because human nature will not change easily. Internet and mobile Internet user behavior is different, it is because the application itself using the different scenarios to solve the problem caused by the difference. Not a person is a character on the Internet, and to the mobile Internet is another life.
This is our original intention of studying big data on mobile phones.
Below, we have done a mobile reading project as an example, mainly from the discovery, problem solving, verification results for everyone to start.
problem found
Through the mining mobile phone reading data analysis, we found the following questions:
(1) Reading behavior is similar between cities and cities
We have calculated the reading data of users in Hangzhou, Guangzhou and Shenzhen for a certain period of time and obtained the top 20 data in these three cities in these three months.
The table below shows the number of popular reading books overlapped between Hangzhou and Guangzhou and Shenzhen in the same period.
From the table, we can speculate that there is indeed a certain degree of similarities in reading between Hangzhou and Shenzhen and Guangzhou.
The formula is: Sim (A, B) = Same (A, B) * 2 / [Count (A) + Count (B)]
Sim (A, B) shows the reading similarity of cities A and B. Same (A, B) is the book that City A, B read together in a certain period of time, and Count (A) shows the book that city A read in a certain period of time.
From the table we can see that Hangzhou and Shenzhen, Guangzhou, reading similarities are high. In other words, Hangzhou-Hangzhou or Hangzhou-Guangzhou reading is more similar, we can consider them as a reading weathervane reference city.
So, if you want to do benchmark analysis, it is more inclined to choose Guangzhou as a benchmark for reading in Hangzhou, Hangzhou as a benchmark for Shenzhen, Hangzhou and Guangzhou can refer to each other.
(2) mobile reading "long tail" phenomenon is serious
In our statistics, we found that the mobile phone reading market basically operated by hot list and subjective recommendation. This creates the "long-tailed" effect we commonly see, and as the book grows, that effect is even worse!
(3) Different behavior patterns of users performance difference is relatively large
We use kmeans clustering algorithm, a common data mining algorithm, to show whether the differences in characteristics between different types of users are obvious.
Different behavior patterns of user performance difference is relatively large
solution
We analyzed the historical data, and we think the following solutions can be used to solve the different problems.
(1) data clustering recommended to send
By clustering users to push books, we found the following results:
Data clustering recommended to send
The first category: the proportion of people close to 60%. Such groups of people have always read the recommendation of the book (on average about e ^ 1), the PV on the same day on June 14 did not increase, but mentioned the obvious improvement in the next few days, after June 20 seems back Before the PV level.
The second category: a smaller number, only 57 people. Before recommending, there was basically no reading of recommended books, while reading recommended books after recommending had a large amount of PV (mean e ^ 2), and the follow-up was also better maintained. It could be thought that being pushed stimulated the interest of reading recommended books.
The third category: the proportion of people close to 40%. Such people in the June 13-17 reading push books between the PV volume increase is very obvious, but in this period of time, more silent, the pre-(June 13) read a few recommended books, late (6 On the 17th) read some increase, but still return to a more silent state.
(2) personalized push
Due to the way mobile phones are pushed, most of them are currently recommended by pop charts, which results in the more popular these pushed books and the more they push behind him. Repeated cycles continue, resulting in the system long tail phenomenon, data coverage and other issues. So it is best to set different recommendations based on different users. This can greatly improve these conditions in the system.
Result verification
Through our data mining processing, and then the user practical application, finally got the following feedback results:
(1) increase in the amount of PV
The first day of PV comparison
From the experimental analysis, we come to: the total feedback rate of the week, the experimental group was 1.30 times the control group.
The recommended total number of PVs recommended by the successful users on the first day was 8.17 times of that of the control group. PV per capita recommended book was 15.19 times that of the control group.
The total number of PVs recommended for successful users on the first day was 9.62 times of that of the control group. The PV content of all the books per capita was 17.88 times that of the control group.
Among the recommended successful users, the total number of PVs recommended by the experimental group in a week is about 5.45 times that of the control group; the PV quantity per capita recommended book is 4.88 times higher in the experimental group than in the control group.
The recommended successful users, experimental group a week the total amount of PV is 7.01 times the control group; per capita book volume of PV, the experimental group is 6.28 times the control group.
To sum up, after data mining, analysis, recommendation and other algorithms are handled, the indicators of mobile phone reading have been greatly improved, and their effects have not increased by tens of percentages but increased exponentially. It is the charm of big data applications!