I used a reptile to "steal" a day. 1 million users, only to prove that PHP is the best language in the world

Source: Internet
Author: User

I used a reptile to "steal" a day. 1 million users, only to prove that PHP is the best language in the world2015-08-06 Ape Circle

                    

I used a reptile to "steal" a day. 1 million users

just to prove that PHP is the best language in the world.

See a lot of friends in the circle recommended Python crawler articles, all feel too small pediatric, processing content is originally PHP's strengths, Python's only advantage is estimated to be born Linux, and Perl, this feeling is not enough meaning of Linux, or Mac kind, Born with Python, Perl, PHP, Ruby, of course, I also hate to discuss a language is good or bad, every language exists there must be its truth, anyway PHP is the world's most useful language, we all know!

A few days ago compared to fire is a person, with C # wrote a multi-threaded crawler, crawl QQ QQ Space 30 million users, of which 3 million users are QQ number, nickname, space name and other information, that is, there are details 3 million, ran two weeks, it's nothing, In order to prove that PHP is the best language in the world, although everyone knows the ^_^, I wrote a multi-process crawler in PHP, only a day, grabbed the knowledge of 1 million users, currently running to the 8th lap (DEPTH=8) related to each other (attention and attention) of users.

                 

Crawler programming

Because you know you need to log in to get to the followers page, copy the cookie down to the Curl program to impersonate the login after you log in from Chrome.

Using two large independent loop process groups (user index process Group, user Details Process group), PHP pcntl extension, encapsulated a very useful class, used and Golang Ctrip is almost.

Below is the user details, the user index code is similar

Here in a digression, after testing, my 8-core MacBook, run 16 process the fastest, and 16-core Linux server, incredibly run 8 process the fastest, this is a bit confusing to me, but since the test of the final process number, according to the last set is good.

1, the user Index process group first to a user as the starting point, crawl this user's attention and attention, and then merge into the storage, because it is a multi-process, so when there are two processes in the same user storage when the user will appear duplicate, so the database user name field must establish a unique index, Of course, you can also use Redis these third-party caches to ensure atomicity, this is a matter of opinion.

After step one, we get the following list of users:

2, the user Details process group in accordance with the time sequence, get the first Storage user crawl details, and update time to update the current time, so that it can become a dead cycle, the program can run endlessly, constantly cycle to update user information.

Stable operation to the next day, suddenly there is no new data, check the discovery of change rules, I do not know is to prevent me, or happen to, anyway, to return the data is like this

               

The first feeling is to give me a random output data so that I can not collect, change the IP, analog camouflage Some data, are useless, suddenly feel this is familiar, will it be gzip? With a skeptical attitude, try Gzip, first of all of course tell me not to give me gzip compressed data

Put "accept-encoding:gzip,deflate\r\n"; Change to "accept-encoding:deflate\r\n"; Remove the gzip, and then egg!

It seems that it is mandatory to give me gzip compression data, so, then I unzip Bai, check the PHP decompression gzip, found on a function gzinflate, so get the content added:

$content = substr ($content, 10);

$content = Gzinflate ($content));

Here I really want to say that PHP is really the best language in the world, on two functions, solved the problem completely, the program and happy to run up.

At the time of matching content, the careful attention also gave me countless help, for example, I want to distinguish between the user's gender:

                      

                      

Haha joking pull, actually is the style inside has Icon-profile-female and Icon-profile-male ^_^

My egg hurts to catch it so many users, in the end what is the use of it?

Actually no use, I am idle egg ache ^_^

With this information, you can actually do some other people opening the mouth of a disorderly blow a big data analysis pull. The most common of course are: gender distribution, geographical distribution, occupational distribution, the ratio of male to female in each occupation.

Of course, according to the number of followers, number of visitors, questions, answers and so on, to see what the people are concerned about, livelihood, society, geography, politics, the entire internet is a panoramic pull.

Perhaps, you can also take the Avatar analysis, with the open-source yellow procedure, the screening of pornography, and then to save Dongguan? ^_^ Then, you can also look at the people who come out of college and what they do at the end.

With this data, is it possible to open the brain hole ^_^

Here are some interesting graphs using these data, real-time chart data can go to http://www.epooll.com/zhihu/to see

Source: Yang Zetao

Java Learning (ID: java4fun) (← Long press copy)

focus on Java technology sharing.

(copy number, search public number to follow)

How to focus

1, will upgrade to the latest version, long press two-dimensional code, select "Identify the QR code" to pay attention to.

2, if the two-dimensional code is not recognized, the interface "add Friends" entry "Find the public number", enter "Java learning" can also be concerned.

3, for everyone can better Exchange Java technology, can join QQ group: 432784980 "Java learning".

Recommended articles:

"Recruiting a reliable iOS"-Reference answers (top)

Click to read the original view

I used a reptile to "steal" a day. 1 million users, only to prove that PHP is the best language in the world

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.