PYTHON+RABBITMQ Crawl a dating website user data

Last Update:2017-07-14 Source: Internet

Author: User

Tags message queue rabbitmq

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

"Always ask for you but never say thank you ~ ~ ~", in the blog park and the above to absorb a lot of knowledge, will also grow here, here is very good, thank you blog Park and know, so today also put their own in the project during the things to share, hope to help friends ....

Say less nonsense, let's go~~~~!

Demand:

Project needs to do a dating site, the main technology has nginx, server cluster, Redis cache, MySQL master-slave replication, amoeba read and write separation, etc., I mainly use Rabbitmq+python to complete and achieve the data crawl work (database writing and image download Save), Speed words Company's computer crawl (i5+16g& website need to verify login), speed to force words should be faster (my company's network, I also do not much evaluation of ^_^), pro-Test crawl "~ Car Home" of the used cars data, one hours 15-20w (no landing, no anti-grilled).

1. What you need to know before crawling:

1) Structure of the Web page:

A. You know, crawling this site can make you disgusting, the field is very many, and the boy's page and the girls page knot enough is not the same, so the male and female data page inside the location is not the same, the label properties have different places, so, I use if separate men and women to climb separately, The middle also contains some data conversion and split processing, there are also some pits, I will say in the code below

B. Web site has anti-stripping technology, previously wrote a crawler, request, resolution, access are in a file, when the request is too fast to be forced to disconnect the server, the solution is to use proxy IP and replace the useragent, but there is a bug is the agent to crawl the agent IP is very low availability, resulting in very slow, and then decisively dropped, hair Now is OK, his anti-crawling mechanism did not get too clear.

C. Analog login: Strictly speaking can not be simulated, the solution is to manually after landing on the Web page, and then go to the browser to take the cookie out, put into the headers inside, the request with a cookie, so that you can request to log in to catch the data.

D. Analysis of the page, the user's details page is followed by the ID in the change, so crawl logic is to use a For loop id+ request connection, but also natural to go to heavy, sometimes there will be an unknown error in the middle of the break, you can catch an exception for the outside, and then the entire code is installed in a def, Recursive implementation always loops

2) configuring RABBITMQ in Linux can be found in one of my other articles: http://www.cnblogs.com/devinCat/articles/7172927.html

3) Development tools: pycharm+python3.6

2. The role of MQ in this task: to split a task vertically into blocks, each piece of message queue for the media to connect and transfer the required information, so that the implementation of different task logic of the code can work in parallel, do not affect and can improve the efficiency of the program, Message Queuing has six modes of operation, Here is Work mode, suggest Baidu understand the principle, in this but more elaboration. Below I draw a picture to help you understand the code implementation logic and process, hope to help everyone! (Big point to see more clearly)

The code on GitHub, I small white, I hope the great Gods more guidance: Https://github.com/DevinCat/Cat

Follow-up will add multithreading, and constantly optimize and improve, although hard, after the realization or there is a kind of what is the sense of ^?_?^, hope that we have a lot of advice, welcome message exchange!

PYTHON+RABBITMQ Crawl a dating website user data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More