To amuse oneself and write the reptile century Jia Yuan article

Last Update:2015-05-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The recent period of time do not know how to be like in what delirious, special want to use Python to write a web crawler, perhaps read the knowledge of a certain Daniel's share, deeply able to write a program in the vast ocean of Internet data to find interesting data cool, or just want to experience a program ape life value. When I decided to take a week or so to achieve this magnificent ideal, I immediately met the bottleneck, and it must be quickly resolved to give the question: what data should I crawl on the Internet?

After an in-depth investigation, I found that dating site is an ideal goal, because in the blog Park has never seen the relevant posts (in fact, most of the blog park posts feel very water), and dating site data is also a good research value (hey, my purpose is simple, purely for academic research) , it can analyze and analyze the current situation of marriage, or it is very meaningful. At present, China has three relatively large dating sites: Lily NET, cherish the network, the Century good edge, due to limited energy, which site should be selected? After rigorous textual research (in fact, casually Baidu a bit), the century Jiayuan is currently three dating sites on the most users, gave not to say, open arm on the real name of the century to register a user, carefully analyze the composition of the site page and download the strategy of the page.

User Home page URL composition is very simple, each user has a unique ID, in the URL bar after the domain name plus this ID is the user's personal page, and the personal page of the data is not generated dynamically, directly with the analysis of the page can get all the information. The only difficulty is that certain information, such as wages, whether the purchase or purchase of a car is not visible without landing. In order to get a complete data (the salary is so important data, how can not pay attention to, not difficult to download the data is worthless), I still spend a lot of energy, basically 60 of the time spent on how to solve this problem (for some Daniel is of course small case, I am only a novice).

The next question is, how do I get the user's ID? Have to say, the century good edge or provide a very good search function, after the selection of the search criteria, according to the packet capture information, the browser will be the user selected parameters post to the server, the server will be in JSON format to return 25 users short information, a page is exactly 25 users of the data, If you want to crawl the data continuously, do not need to modify any parameters, since you need to change the first page of this parameter, you can continue to download the user's personal homepage. I, in order to download some data as much as possible, the selection of the conditions are relatively loose, that is, 20-28-year-old women with photos, the results of the number of users let me surprised, but only Chinese, and then climbed the process also found a lot of search out of the user ID are blacklisted, China's first dating website is a misnomer!

After the analysis of the crawler strategy, the first thing to do is how to achieve automatic login, the most used is the Python's own module URLLIB2, grab the packet to compare the request sent by Python and the browser sent the request, Find Python sent the request in the head connection always not closed, not how to set no, then Baidu after a bit to know urllib2 not support reasons. Long March just took the first step to pull an egg, then the journey long how to go? Thanks to the great God of the forum, the original python has a third-party wood block requests, which is very convenient to achieve, greatly reducing the workload. So how did I get the automatic landing?

Import Socketimport Sslsock = Ssl.wrap_socket (Socket.socket ()) Sock.connect ((host,443)) Sock.sendall (Https_data) recv _data = Sock.recv (8912) sock.close ()

Because the user name and password will be authenticated by SSL encryption, there may be several times between the client and the server data exchange, I also found some on the internet on the automatic landing posts, most of them do not apply, the only practical is this piece of code, Only need to crawl the browser to the host name and the data sent (Https_data, including header and body), with the above a few lines of code can be verified by the server side, the only more cumbersome to receive the data is compressed, After extracting the URL of the jump and save the acquired Cookie for subsequent requests, the automatic landing even if it is done!

The rest is relatively simple is the more tedious work, the analytic page directly with the regular expression, used very efficient, do not need to understand the structure of the entire HTML. After parsing the data through the sqlite3 inserted into the database can be, this part of a pen, because there is really nothing to say. In order to speed up, I realized the time with multi-threading, the results are not as good as imagined, and then carefully pondering it!

The last left, or the code out to share with you! Run this code need to install python2.7 and the latest requests (I use this version), write bad place also please forgive me.

Download Link: Http://files.cnblogs.com/files/lrysjtu/jiayuan.rar

To amuse oneself and write the reptile century Jia Yuan article

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

To amuse oneself and write the reptile century Jia Yuan article

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

To amuse oneself and write the reptile century Jia Yuan article

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support