I used crawlers to steal 1 million users a day, just to prove that PHP is the best language in the world.

Source: Internet
Author: User

I used crawlers to steal 1 million users a day, just to prove that PHP is the best language in the world.

After reading the Python crawler articles recommended by many circles of friends, I think it is too pediatric. The processing content is originally the strength of PHP. The only benefit of Python is that it comes with Linux, just like Perl, linux is not very interesting at this point. It is still a good Mac. It comes with Python, Perl, PHP, and Ruby. Of course, I also hate to discuss the quality of a language, PHP is the best language in the world. Everyone knows it.

A person wrote a multi-thread crawler program using C # And captured 30 million QQ users in the QQ space a few days ago. Among them, 3 million users had QQ accounts, nicknames, space names, and other information, that is to say, if we have more details than 3 million, we have been running for two weeks. This is nothing. To prove that PHP is the best language in the world, although everyone understands it ^_^, I used PHP to write a multi-process crawler program. It took only one day to capture 1 million users. Currently, I ran to 8th circles (depth = 8) users that are associated with each other (followed by the followers.

Crawler program design:

Because you need to log on to obtain the logon page, copy the cookie from chrome logon and send it to the curl program for simulated logon.

Two independent cyclic process groups (user Indexing Process Groups and user details process groups) are used. The pcntl extension of php is used to encapsulate a very useful class, it's similar to golang's Ctrip.

The following is the user details. The user index code is similar

Put an aside question here. After testing, my 8-core Macbook ran the fastest 16-core process, while the 16-core Linux Server ran the fastest 8-core process, this is a bit confusing to me, but since we have tested the number of final processes, we just need to follow the best settings.

1. the user index process group starts with a user, captures the user's attention and followers, and then merges them into the database because it is a multi-process, therefore, when two processes process the same user database, duplicate users are generated. Therefore, a unique index must be created for the database username field, of course, we can also use the redis third-party cache to ensure atomicity. This is wise.

After step 1, we will get the following user list:

2. the user details process group obtains the details captured by the first user who enters the database in a forward chronological order, and updates the update time to the current time, which can become an endless loop, programs can run endlessly and continuously update user information.

The program ran stably until the next day, and suddenly there was no new data. After checking, I found that I had changed the rules. I didn't know whether it was to prevent me or even happen to me. The data returned to me was like this.

The first thought was that I could not collect the output data randomly. I had to change the IP address, simulate and Disguise some data, and it was useless. Suddenly I felt familiar with it. Will it be gzip? With a skeptical attitude, I tried gzip. First of all, I told zhihu not to compress the gzip data.

Remove "Accept-Encoding: gzip, deflate \ r \ n!

It seems that it is mandatory to compress gzip data for me. In this case, I decompress it. I checked the php decompress gzip and found a function gzinflate, add the obtained content to the following:

$ Content = substr ($ content, 10 );

$ Content = gzinflate ($ content ));

Of course, you can also use the curl built-in:

Curl_setopt (self: $ ch, CURLOPT_ENCODING, 'gzip ');

Here I really want to say that PHP is really the best language in the world. One or two functions have completely solved the problem and the program ran happily.

When Matching content, zhihu's carefulness also gives me countless help. For example, I want to distinguish user gender:

The style contains icon-profile-female and icon-profile-male ^_^.

I caught so many users, what is the purpose?

In fact, it is useless. I am just idle.

With this information, we can actually do some big data analysis and pulling with the help of others

Of course, the most common is:

1. Gender distribution

2. Regional distribution

3. Occupation distribution, from that company

4. Proportion of men and women in each occupation

5. When do you know about it? Issues and Concerns are worth noting

Of course, sort by the number of followers, number of visitors, number of questions, and number of answers, to see what people are paying attention to, people's livelihood, society, geography, politics, and the entire internet ..

Maybe you can also use your profile picture for analysis, use an open-source porn detection program to screen out porn and then save Dongguan? Pai_^

Then, you can also look at the university people who finally did what.

With this data, can I open the brain hole? ^_^

The following are some interesting charts made out of the data, real-time chart data can be viewed on the http://www.epooll.com/zhihu/

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.