PHP development for capturing and analyzing million-level user data

Source: Internet
Author: User
This article mainly introduces the PHP development materials for capturing and analyzing user data for millions of users. For more information, see

This article mainly introduces the PHP development materials for capturing and analyzing user data for millions of users. For more information, see

This time, 1.1 million of user data is captured. The data analysis results are as follows:

Preparations before development

Install Ubuntu on the vmwarevm;

Install PHP5.6 or later;

Install curl and pcntl extensions.

Use PHP curl extension to capture page data

PHP's curl extension is a PHP-supported library that allows you to connect to and communicate with various servers using various protocols.

This program captures user data. to access a user's personal page, you must log on to the user before accessing the user. When we click a user profile picture link on the browser page to go to the user's personal Center page, we can see the user information because when we click the link, the browser can help you bring local cookies to the new page, so you can enter the user's personal Center page. Therefore, before accessing a personal page, you must first obtain the user's cookie information, and then carry the cookie information at each curl request. I used my own cookie to obtain the cookie information. On the page, I can see my own cookie information:

Copy them one by one, using "_ utma = ?; _ Utmb = ?; "This form forms a cookie string. Next, you can use this cookie string to send requests.

Initial example:

$ Url = 'HTTP: // www.zhihu.com/lele/mora-hu/about'; // here, mora-hu represents the user ID $ ch = curl_init ($ url); // initialize the session curl_setopt ($ ch, CURLOPT_HEADER, 0 ); curl_setopt ($ ch, CURLOPT_COOKIE, $ this-> config_arr ['user _ cooker']); // sets the request cookie curl_setopt ($ ch, CURLOPT_USERAGENT, $ _ SERVER ['HTTP _ USER_AGENT ']); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); // returns the information obtained by curl_exec () as a file stream, instead of direct output. Curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1); $ result = curl_exec ($ ch); return $ result; // The captured result

Run the above Code to obtain the personal Center page of The mora-hu user. You can use this result to process the page with a regular expression to obtain the information that needs to be crawled, such as name and gender.

Image anti-leech

When the returned results are output after regular expression processing, the user profile cannot be displayed when the user profile is output on the page. After reading the information, I learned that zhihu has done anti-leech processing on the image. The solution is to forge a referer in the request header when requesting images.

After obtaining the image link using a regular expression, send a request again. The source of the request with a piece indicates that the request comes from the forwarding of zhihu website. An example is as follows:

Function getImg ($ url, $ u_id) {if (file_exists ('. /images /'. $ u_id. ". jpg ") {return" images/$ u_id ". '.jpg ';} if (empty ($ url) {return '';} $ context_options = array ('HTTP' => array ('header' =>" Referer: "// With the referer parameter); $ context = stream_context_create ($ context_options); $ img = file_get_contents ('HTTP :'. $ url, FALSE, $ context); file_put_contents ('. /images /'. $ u_id. ". jpg ", $ img); return" images/$ u_id ". '.jpg ';}

After capturing your personal information, you need to access the user's followers and the list of users you have followed to obtain more user information. Then access the service layer by layer. As you can see, there are two links in the personal Center page:

There are two links here. One is followed, and the other is followed. The Link "followed" is used as an example. Use regular expression matching to match the corresponding link. After obtaining the url, use curl to carry the cookie and send a request again. After capturing the list page followed by the user, you can get the following page:

Analyze the html structure of the page. As long as you get the user information, you only need to enclose the p content and the user name. We can see that the url of the page that the user follows is:

The URLs of different users are almost the same. The difference lies in the user name. Get the username list using regular expression matching, splice URLs one by one, and then send requests one by one (of course, one by one is relatively slow. There is a solution below, which will be discussed later ). After entering the new user page, repeat the above steps to keep repeating until the data size you want is reached.

Linux statistics file count

After the script has been running for a while, you need to check how many images are obtained. When the data volume is large, it is a little slow to open the folder and check the number of images. The script runs in a Linux environment, so you can use the Linux Command to count the number of files:

The Code is as follows:


Ls-l | grep "^-" | wc-l


Here, ls-l is the long list that outputs the file information under this directory (the files here can be directories, links, device files, etc ); grep "^-" filters long list output information. "^-" Only retains general files. If only the directory is "^ d", wc-l indicates the number of rows of Statistics output information. The following is an example:

Process duplicate data when inserting MySQL

After running the program for a period of time, we found that many users have duplicate data. Therefore, we need to process the data when inserting duplicate user data. The solution is as follows:

1) check whether the data already exists in the database before the database is inserted;

2) Add a unique index. insert into... on duplicate key update...

3) Add a unique index. insert ingnore...

4) Add a unique index. replace...

Use curl_multi to capture pages with multiple threads

When a single process is started and a single curl is used to capture data, the speed is very slow. When the host crawls for one night, it can only capture 2 W of data, so I thought that I could request multiple users at a time when I sent a curl request to the new user page. Later I found the good thing curl_multi. Functions such as curl_multi can request multiple URLs at the same time, rather than one request. This is similar to the function of multiple threads in a process in linux. The following is an example of using curl_multi to implement multi-thread crawler:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.