Kong: Big Data analysis processing and user portrait practice
Live content is as follows:
Today we're going to chat about the field of data analysis I've been exposed to, because I'm a serial entrepreneur, so I focus more on problem solving and business scenarios. If I were to divide my experience in data analysis, it was just the two startups I had gone through, the first phase was "third-party data analysis" and the second was "first-party data analysis". So today we're going to talk about data analysis from these two points.
Third-party data analysis
First talk about "third-party data analysis", this is the main affinity for me to open a copy of the micro-bo data mining.
Cause: Give the "Weibo recommendation" to open and redo
Weibo just fire up, we found that the TOP1 has been a period of time are micro-bo, a lot of people will think, open every day in the brush Weibo? Or is there a huge operational team on Twitter?
I can give you the answer, in fact, basically is to deal with their own. But every day is very busy, there is no time to read so many micro-bo, so we played a "trick", through the program automation Weibo recommendation, "pull out" may become a point of explosion or meaningful microblogging.
To open a new algorithm, is to crawl their attention, as well as focus on people's attention as a seed, the first of these people's microblog forwarding history to establish a "history file", theoretically everyone can calculate a time and the amount of the correlation function curve, and then by monitoring these people's microblog, if at some point, The release of Weibo is more than a certain amount of historical files, we will think this is a recommended micro-blog, so open every day to see is a filtered microblogging.
In this process, to catch up with the outbreak of micro-bo growth, so in the calculation of the efficiency of historical archives, we use sequential sampling of continuous signal correlation algorithm, reduce computational complexity, and will also adjust the parameters of multiple, and we will be in the database manually added some new valuable seed users.
Forwarding: micro-blogging data mining to third-party data mining
Just told a few stories, in fact, followed by a lot of information on the micro-Bo data Mining, and later it evolved into a product called "Pulse Network", including micro-blog forward visual monitoring, find out KOL (Opinion leader) analysis of the cause of the explosion and so on. "Pulse net" although the surface is a micro-BO tool, but its essence is a group of elite reptiles. When it comes to today's topic, third-party data, we have to say crawlers.
In fact, I do third-party data analysis, all the user data from the network Open data capture, such as Weibo, watercress, everyone, know and so on, all the label data from the vertical site crawl, such as the car category is Autohome, tourism is the travel site and so on.
The so-called third-party data analysis is in fact relative to the data consumer's own data (first-party data). The data source for the data provider is also roughly divided into several types:
- Like the "Pulse net," a reptile.
- such as Admaster ads monitoring, Cookie tracking user page access records and so on
- Talkingdata This data analysis platform, user's application list data, etc.
So including our start-up companies (37degree), Admaster and Talkingdata have made DMP (data management platform), although the source is different, but will be based on a variety of data cleaning, and then calculate the label, For example, there are different types of Web sites, applications have different classifications, of course, the actual algorithm will be more complex than this.
To talk about some of the experiences I've done with third-party data:
First, the data crawl, that is, crawlers.
This crawler is not simple to send a request, parse the DOM can be, in combat mainly from the following aspects to solve:
- Find the right interface, including mobile packet capture, PC website, Wap station, Ajax asynchronous request
- Breakthrough restrictions and validations, including impersonation requests, bypassing verification codes, login verification, network agents
- Efficiency issues
First of all, let's talk about:
The first point of the crawler must be skillfully taken.
It is not right for many people to blindly crawl through all the web interfaces that can be crawled. Finding the right interface is the first step in making a crawler, so the time saved may be a number of points. For example, if you want to crawl the profile of Weibo users, there are several ways to do this:
- Web page
- Client, IOS, Android, tablet, etc.
- WAP website
The same business, each terminal has. All we have to do is find the simplest logic in it and limit the least number of interfaces to crawl.
For PC website, many people will be loaded by some Javascript asynchronous, that is, the interface that needs to click Interaction to trigger, so like to use the library of simulation browser to deal with, this kind of practice can also, large-scale processing will encounter performance and other aspects of the problem. In general, with Chrome's debugging tools, look at the Network, if necessary, the Preserve log to prevent the log from being redirected when cleared.
For mobile, you can set up the terminal Agent with Charles or Fiddler2, then grab the packet network request, so you can see a lot of request data, and then find what you need. This approach I generally use the most, because the mobile API is almost all structured data, not like the Web page also need to parse.
Then say the second question, break the limit:
Simulation request is definitely the first step, you have to know the HTTP protocol, simple such as UA, Referer, such as where the Cookie is set in the request header, there are some common protocols, such as XAuth protocol, OAuth2.0 protocol, etc. We were working on a crawler colleague at the principle level that needed to be assimilated.
Bypassing the verification code, with some simple OCR libraries, such as the early 12306 is very simple, complex can only find loopholes.
Login verification, generally speaking, there are two main problems:
One is the need for continuous request, the middle involves the addition of some cookies or parameters to pass the full tracking simulation;
The second is to clarify the mechanism of encryption, such as early Sina Weibo is two times SHA1 encryption plus salt, and later RSA encryption. For the PC page to figure out the encryption principle is relatively simple, is to learn to read Javascript code. Then need to have enough accounts to forge identity, and sometimes can not be simulated landing, with some of the authorization of OAuth to do, the rationale is similar, is to simulate to get access_token, for example, I read the OAuth2.0 RFC protocol, and then found a hidden vulnerability authorization.
Network agent will have to see the actual situation. Some requests are HTTP, some requests are HTTPS, we were caught most of the network public proxy site, and then combined with different crawl domain names, to verify the effectiveness of these agents, to ensure that there are a large number of available proxy libraries, including HTTP and HTTPS.
The last problem is efficiency, crawler is a big project. To give a simple example, I want to crawl micro-blog users of personal information, concern, micro-blog content, micro-blog content and concern also need paging, if it is 300 million of the micro-Bo account, an average of 5 requests, 1.5 billion requests, a day's request time is 86,400 seconds, so it is conceivable to grasp the pressure is still very large.
Our framework was divided into three main types, all of which were written by ourselves:
- Hadoop-based crawler
- Single NIC based on celery
- Multi-NIC distributed based on celery
In fact, a very important feature of the distribution is the message communication, the crawler frame core is frequent URL scheduling and resolution scheduling. If you are using a distributed parsing method to parse a site, the crawled content consumes a lot of bandwidth. In the multi-network card environment, the general intranet gigabit, the outside network hundred trillion, the communication to go inside the intranet, crawl outside the network, this has little effect on the bandwidth. But if it is a single network card environment is not the same, so a single NIC, basically will reduce the resolution schedule, the main information communication is still the URL scheduling, through the way of asynchronous parsing to maximize the use of good network resources.
Crawler this piece want to know more, you can look at my previous writing two article Crawler primer.
"Crawler Primer" Https://zhuanlan.zhihu.com/p/20334680?refer=zhugeio
"Reptile Primer" Https://zhuanlan.zhihu.com/p/20336750?refer=zhugeio
Algorithm analysis
Next, the algorithm analyzes this piece. Fetching data is only part of the challenge, but it is also the algorithm analysis, such as user portrait, label calculation, and clustering classification in machine learning application.
Impact algorithm
Our calculation of the impact of Weibo users is Pagerank-related algorithms, because the user's previous focus is similar to the link between the Web page, so we grabbed the attention of all users, using Pagerank algorithm to calculate the impact of these people rankings.
Consumption capacity calculation
Weibo users have a list of devices that publish Weibo, have the type of V authentication, and have a relationship of interest, so we have combined these dimensions to calculate the consumption capacity.
Label calculation
Pre-labeling some of the tag library, and then the user's microblog word frequency statistics, and then find the corresponding tag statistics weights, the tag library is mainly derived from vertical site crawl training.
In fact, the calculation of influence and consumption capacity is a great challenge, although these things are through the algorithm to achieve, but the efficiency is still a great challenge, such as 1 seconds to calculate 100 people, a day can only calculate 800多万个 users, calculate all users also want one months, so we do a lot of algorithms and performance optimization, Even sacrificing a certain amount of accuracy in exchange for efficiency. At first we used Pagerank, and then we tried Hadoop, which was not ideal, and the amount of computation was too great. Finally, we optimized the algorithm and replaced the Graphlab computing engine.
We actually did a lot of data processing before we did the above-mentioned computational analysis on Weibo data. As we all know, data can be broadly divided into two kinds, one is unstructured data, the other is structured data.
For example, most of the tweets that are captured are demographic attributes and Json, which are structured data, and the 140-word microblog content that you send is unstructured data. In the CV data matching project, CV content and job requirements are mostly unstructured.
For the matching analysis of unstructured speech data, the clustering classification algorithm is necessary. CV matching scenarios are mainly for finding resumes with the highest relevance to a company's own position, or a candidate who needs to quickly find a job with the highest degree of relevance, and the traditional catalog index for structured data is difficult to meet. For example, even if all the Java engineers, different companies to the name of the job may not be the same (Java engineers, back-end engineers, etc.), this time to look at the detailed job requirements, through the non-structural "job description" information clustering analysis to achieve matching.
Machine learning is divided into two main types, unsupervised learning and supervised learning.
We do the resume of the project using the unsupervised LDA algorithm, which is used in the field of advertising, the core principle you can think that we want to cluster classification is some direction, each line of text can be a bunch of vectors, vectors have length and direction, and finally we find the similarity of these vectors by comparison.
Again, for example, a resume is considered a processing unit, we prepared the position of the relevant thesaurus, and then these words can be regarded as the direction, we will resume TF-IDF algorithm processing, the removal of unrelated words, in fact, is a combination of words and word frequency, can be considered as vector and vector length, and then clustering, Because it is unsupervised, we need to estimate how many classifications we have, and then we are going to deploy the parameters and finally get some clustering.
User portraits In fact, like the above, we will design a lot of demographic dimensions, but also according to our data sources to find the dimension of potential speculation, then these dimensions may constitute portraits of people, such as influence, consumption ability, interest ability, brand label and so on, combined with the application of different fields, Labels often need to be extracted from the segmentation area, so it is mentioned to crawl the vertical site corpus, and then extract training, and finally to the user tag, or to the user cluster classification.
We have established a large user database, has been serving the advertising and marketing industries. But in doing this process, we deeply feel that today's enterprise analysis capability is insufficient, and too much analysis needs focus on "traffic acquisition", always want to get data from outside to match the user's label, but their business data analysis processing is very weak, on the other hand is not concerned about the user's engagement, That's why I thought about doing the first-party data analysis tool to help more enterprises start with content processing optimization.
First-party data analysis
Next, let's talk about first-party data analysis.
The first-party data is simple to understand is the own data, most of the company's own data is the database of user-generated business data, the higher the sense of data analysis of the company may try to collect some user's behavior data through the log. The so-called behavioral data is to include access to products, browse a series of use behavior, or collection, attention, purchase, search and a series of business behavior.
For most early-stage small companies, there is no data analytics platform, so most of the first-party data analysis relies on engineers writing scripts to look up databases on demand. A lot of time is wasted in communicating, confirming requirements, writing scripts, and waiting for results to operate in these processes.
For many internet companies with a certain ability, the company also began to build its own data analysis platform, and has begun to collect user behavior data for analysis, but most of the data use of behavior is limited to two kinds:
The first approach is traditional BI, based on relational databases such as Oracle, or on Hadoop-based statistical analysis. Generally speaking is to design a good data warehouse model, including some of the dimensions to be analyzed, and then summarize statistics based on the dimensions, such as in the product area, is to count the occurrence of some key actions, the common is to calculate page visits, independent users, retention rate and other indicators, in short, is used for statistical results.
The second approach is to use behavioral data for personalized data recommendations. This is more make sense, from the early Amazon recommendations to Facebook recommendations, this is what I highly respected: not only calculate the results, but the use of data optimization business.
Personalization recommendations are now commonly used in collaborative filtering. The basic is divided into two kinds, one is based on the heat, one is based on interest. The former is user-based, the latter is item-based, if you want to create a community of interest, then be sure to avoid the statistical results of the popular recommendation, and the focus of the recommendation is to preset some tags.
Combined with the practice of the above two types of companies, in fact, the user's product interaction behavior data is always considered as a black box, the recommendation algorithm although the use of these data is better but only a single user depth of the analysis practices, and horizontal user analysis end in the high-level summary of the report, It is not possible to explore and verify how the behavior of the user on the product affects the company's business metrics. A typical phenomenon is that many product iteration decisions rely on guesswork or intuition.
So based on these considerations, we want to create a more meaningful first-party data analysis platform, not just multi-dimensional cross-aggregated results.
So I think of doing Zhuge io, that GE IO and the past first-party data use what is the difference? We first from the business view is user-centric, so "traffic era" is concerned about the number of users results, BI is concerned about the results of the dimension aggregation, and we are concerned about the user.
Zhuge Io is now available at the Tsing Qingcloud Application Center, and you are welcome to use it.
We used to see some Dashboard charts, and the rise and fall may be difficult to find because the basis of the first analysis is the data of the dimension rollup.
But if all the data is based on independent user tracking, if we see the new numbers, or the results of aggregated distributions, we can find the people behind each specific number, can restore the user's rich business tags and dimensions, and can do more comparison and analysis.
You can say "behavior is a label". The user's behavior data, each previous interaction behavior, etc., can constitute a rich business label. Take "know" this community as an example, "pay attention to the XX topic" "Focus on XX users" "Point like xx content" "Share xx article", these behaviors are very rich and useful tags, and so on together. The tagging system, based on the behavior of the user within the product, is more meaningful than the previously mentioned third-party data. Because of behavior-based tagging, the follow-up can be counterproductive to users, and we can set more precise operational strategies for the user groups under different behavior tags, such as content recommendation, event promotion, precision push, etc.
Finally, technically speaking, the main use is actually the lambda architecture.
We use Kafka as the message center, using the concept of multi-tier producers and consumers, the use of data, such as real-time computing, labeling, import data Warehouse, backup and so on.
Data storage, also used SQL and NoSQL, but SQL is also used in the Columnstore database, NoSQL such as the Elastic Search index, hosting some fast query scenario, relational type is mainly used for complex computing, because we are not only users, events two topics, there are sessions, So the complexity is very high, in the middle we also used some small trick, later have the opportunity to share with you.
The above is my share of this time, thank you.
Q&a
- how to effectively eliminate useless data, reduce the volume of large numbers of registered users of data, so that data mining more valuable.
The first layer is through simple IP or other anti-spam rules filtering, the second layer is based on user behavior layer can do some filtering, such as to meet the completion of certain behavior or the number of visits more than days, etc., the third layer is the user cluster can find these differences users
- How to improve the efficiency of crawlers and eliminate worthless information
This problem and data analysis is very similar, is to clear the purpose, and then filter irrelevant data sources, such as if the calculation of the label, then determine the need for vertical sites, corpus range, and then start the directed crawl, some people will directly use the breadth of the first open-source crawler framework based on the URL crawl, more set up some correlation rules
- How to bypass the mechanism that is crawling against the crawler
Just already talked about some, must have the idea nimble, the latent loophole, the accessible access way, the near simulation, certainly has the method
- how to effectively prevent crawling crawl site data, to prevent the theft of hotlinking
Anti-crawling strategy is also a layer, the simplest is the UA or Referer or cookie HTTP protocol settings, will block a large number of primary crawler, and then is a higher level of request permission control, and finally may be to lose some user experience, verification code and so on, in addition to HTTPS is very important, Prevent gateway from large user token leaks
- what algorithms are appropriate for user Portrait mapping
This question is not good to answer ha, the share mentioned in the impact, tagging algorithm, in fact, according to the business application scenario and data source, very flexible
- for a number of data analysis, such as the weather of cloudy rain and snow how to set the right value reasonable
Weights need to set the result target, then do more testing, correlation analysis, leveling parameters
- How do you build an evaluation system that highlights the real value of big data while saving money?
This is actually what GE Io is doing, now the big data are mostly aggregation, the real big data to understand user behavior, and then personalized service, and product market strategy, improve ROI, also reduce the cost of user discovery value, then the enterprise is more likely to improve efficiency, and cost reduction.
Customer acquisition era We rely on third-party data for matching and user acquisition, but the customer engagement era, we have to do is to understand the user behavior, improve conversion rate, and enterprise efficiency
- The Enterprise collects user data online, how to balance the relationship between Enterprise utility and network user's personal privacy in big data analysis, respect and guarantee the personal privacy of network users?
The EU has a lot of regulations, at least more than at home now many enterprises through the implantation of the SDK to the developer program, by covering the capture of customer data, and then to support their business interests there are many, such as your use of Google and Apple services will often pop up whether to allow the collection of data better
Now data privacy leaks are common, such as people's text messages, network DNS hijacking are sold by some data vendors, black industry chain
In addition, future data analysis and exchange may be more likely to be based on enterprise free transactions, as well as user identity encryption
Technical Training | Big data analysis processing and user portrait practice