I used a reptile to "steal" a day. 1 million users, only to prove that PHP is the best language in the world
See a lot of friends in the circle recommended Python crawler articles, all feel too small pediatric, processing content is originally PHP's strengths, Python's only advantage is estimated to be born Linux, and Perl, this feeling is not enough meaning of Linux, or Mac kind, Born with Python, Perl, PHP, Ruby, of course, I also hate to discuss a language is good or bad, every language exists there must be its truth, anyway PHP is the world's most useful language, we all know ^_^
A few days ago compared to the fire is a person in C # wrote a multi-threaded crawler, crawling QQ QQ 30 million users, of which 3 million users are QQ number, nickname, space name and other information, that is, the details are 3 million, ran two weeks, it's nothing, In order to prove that PHP is the best language in the world, although everyone knows the ^_^, I wrote a multi-process crawler in PHP, only a day, grabbed the knowledge of 1 million users, currently running to the 8th lap (DEPTH=8) related to each other (attention and attention) of users.
Crawler Programming:
Because you know you need to log in to get to the followers page, copy the cookie down to the Curl program to impersonate the login after you log in from Chrome.
Using two large independent loop process groups (user index process Group, user Details Process group), PHP pcntl extension, encapsulated a very useful class, used and Golang Ctrip is almost.
Below is the user details, the user index code is similar
Here is a digression, after testing, my 8-core MacBook, run 16 process the fastest, and 16-core Linux server, incredibly run 8 process the fastest, this is a bit confusing me, but since the test of the final process number, the best setting is good.
1, the user Index process group first to a user as the starting point, crawl this user's attention and attention, and then merge into the storage, because it is a multi-process, so when there are two processes in the same user storage when the user will appear duplicate, so the database user name field must establish a unique index, Of course, you can also use Redis these third-party caches to ensure atomicity, this is a matter of opinion.
After step one, we get the following list of users:
2, the user Details process group in accordance with the time sequence, get the first Storage user crawl details, and update time to update the current time, so that it can become a dead cycle, the program can run endlessly, constantly cycle to update user information.
Stable operation to the next day, suddenly there is no new data, check the discovery of change rules, I do not know is to prevent me, or happen to, anyway, to return the data is like this
The first feeling is to give me a random output data so that I can not collect, change the IP, analog camouflage Some data, are useless, suddenly feel this is familiar, will it be gzip? With a skeptical attitude, try Gzip, first of all of course tell me not to give me gzip compressed data
Put "accept-encoding:gzip,deflate\r\n"; Remove, and then egg!
It seems that it is mandatory to give me gzip compression data, so, then I unzip Bai, check the PHP decompression gzip, found on a function gzinflate, so get the content added:
$content = substr ($content, 10);
$content = Gzinflate ($content));
Of course, you can also use curl to bring your own:
curl_setopt (self:: $ch, Curlopt_encoding, ' gzip ');
Here I really want to say that PHP is really the best language in the world, on one or two functions, solved the problem completely, the program and happy to run up.
At the time of matching content, the careful attention also gave me countless help, for example, I want to distinguish between the user's gender:
Haha joking pull, actually is the style inside has Icon-profile-female and Icon-profile-male ^_^
My egg hurts to catch it so many users, in the end what is the use of it?
Actually no use, I am idle egg ache ^_^
With this information, you can actually do some other people start to blow a big data analysis pull
The most common of course is:
1. Gender distribution
2. Geographical distribution
3, career distribution, from the company
4. Ratio of male to female in each occupation
5. When do people generally know? Issues, concerns and issues that deserve attention
Of course, according to the number of followers, number of visitors, questions, answers and so on, to see what people are concerned about, livelihood, society, geography, politics, the entire internet is a panoramic pull.
Perhaps, you can also take the Avatar analysis, with the open-source yellow procedure, the screening of pornography, and then to save Dongguan? ^_^
Then, you can look at the people who came out of college and what they did at the end.
With this data, is it possible to open the brain hole ^_^
Here are some interesting graphs using these data, real-time chart data can go to http://www.epooll.com/zhihu/to see
http://www.bkjia.com/PHPjc/1054684.html www.bkjia.com true http://www.bkjia.com/PHPjc/1054684.html techarticle I use the crawler to "steal" a day of the 1 million users, only to prove that PHP is the best language in the world to see a lot of friends in the circle recommended Python crawler articles, are too small pediatric, where ...