To amuse oneself and write the reptile century Jia Yuan article

Source: Internet
Author: User

The recent period of time do not know how to be like in what delirious, special want to use Python to write a web crawler, perhaps read the knowledge of a certain Daniel's share, deeply able to write a program in the vast ocean of Internet data to find interesting data cool, or just want to experience a program ape life value. When I decided to take a week or so to achieve this magnificent ideal, I immediately met the bottleneck, and it must be quickly resolved to give the question: what data should I crawl on the Internet?

After an in-depth investigation, I found that dating site is an ideal goal, because in the blog Park has never seen the relevant posts (in fact, most of the blog park posts feel very water), and dating site data is also a good research value (hey, my purpose is simple, purely for academic research) , it can analyze and analyze the current situation of marriage, or it is very meaningful. At present, China has three relatively large dating sites: Lily NET, cherish the network, the Century good edge, due to limited energy, which site should be selected? After rigorous textual research (in fact, casually Baidu a bit), the century Jiayuan is currently three dating sites on the most users, gave not to say, open arm on the real name of the century to register a user, carefully analyze the composition of the site page and download the strategy of the page.

User Home page URL composition is very simple, each user has a unique ID, in the URL bar after the domain name plus this ID is the user's personal page, and the personal page of the data is not generated dynamically, directly with the analysis of the page can get all the information. The only difficulty is that certain information, such as wages, whether the purchase or purchase of a car is not visible without landing. In order to get a complete data (the salary is so important data, how can not pay attention to, not difficult to download the data is worthless), I still spend a lot of energy, basically 60 of the time spent on how to solve this problem (for some Daniel is of course small case, I am only a novice).

The next question is, how do I get the user's ID? Have to say, the century good edge or provide a very good search function, after the selection of the search criteria, according to the packet capture information, the browser will be the user selected parameters post to the server, the server will be in JSON format to return 25 users short information, a page is exactly 25 users of the data, If you want to crawl the data continuously, do not need to modify any parameters, since you need to change the first page of this parameter, you can continue to download the user's personal homepage. I, in order to download some data as much as possible, the selection of the conditions are relatively loose, that is, 20-28-year-old women with photos, the results of the number of users let me surprised, but only Chinese, and then climbed the process also found a lot of search out of the user ID are blacklisted, China's first dating website is a misnomer!

After the analysis of the crawler strategy, the first thing to do is how to achieve automatic login, the most used is the Python's own module URLLIB2, grab the packet to compare the request sent by Python and the browser sent the request, Find Python sent the request in the head connection always not closed, not how to set no, then Baidu after a bit to know urllib2 not support reasons. Long March just took the first step to pull an egg, then the journey long how to go? Thanks to the great God of the forum, the original python has a third-party wood block requests, which is very convenient to achieve, greatly reducing the workload. So how did I get the automatic landing?

Import Socketimport Sslsock = Ssl.wrap_socket (Socket.socket ()) Sock.connect ((host,443)) Sock.sendall (Https_data) recv _data = Sock.recv (8912) sock.close ()

Because the user name and password will be authenticated by SSL encryption, there may be several times between the client and the server data exchange, I also found some on the internet on the automatic landing posts, most of them do not apply, the only practical is this piece of code, Only need to crawl the browser to the host name and the data sent (Https_data, including header and body), with the above a few lines of code can be verified by the server side, the only more cumbersome to receive the data is compressed, After extracting the URL of the jump and save the acquired Cookie for subsequent requests, the automatic landing even if it is done!

The rest is relatively simple is the more tedious work, the analytic page directly with the regular expression, used very efficient, do not need to understand the structure of the entire HTML. After parsing the data through the sqlite3 inserted into the database can be, this part of a pen, because there is really nothing to say. In order to speed up, I realized the time with multi-threading, the results are not as good as imagined, and then carefully pondering it!

The last left, or the code out to share with you! Run this code need to install python2.7 and the latest requests (I use this version), write bad place also please forgive me.

Download Link: Http://files.cnblogs.com/files/lrysjtu/jiayuan.rar

To amuse oneself and write the reptile century Jia Yuan article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.