PYTHON+RABBITMQ Crawl a dating website user data

Source: Internet
Author: User
Tags message queue rabbitmq

"Always ask for you but never say thank you ~ ~ ~", in the blog park and the above to absorb a lot of knowledge, will also grow here, here is very good, thank you blog Park and know, so today also put their own in the project during the things to share, hope to help friends ....

Say less nonsense, let's go~~~~!

Demand:

Project needs to do a dating site, the main technology has nginx, server cluster, Redis cache, MySQL master-slave replication, amoeba read and write separation, etc., I mainly use Rabbitmq+python to complete and achieve the data crawl work (database writing and image download Save), Speed words Company's computer crawl (i5+16g& website need to verify login), speed to force words should be faster (my company's network, I also do not much evaluation of ^_^), pro-Test crawl "~ Car Home" of the used cars data, one hours 15-20w (no landing, no anti-grilled).

1. What you need to know before crawling:

1) Structure of the Web page:

A. You know, crawling this site can make you disgusting, the field is very many, and the boy's page and the girls page knot enough is not the same, so the male and female data page inside the location is not the same, the label properties have different places, so, I use if separate men and women to climb separately, The middle also contains some data conversion and split processing, there are also some pits, I will say in the code below

B. Web site has anti-stripping technology, previously wrote a crawler, request, resolution, access are in a file, when the request is too fast to be forced to disconnect the server, the solution is to use proxy IP and replace the useragent, but there is a bug is the agent to crawl the agent IP is very low availability, resulting in very slow, and then decisively dropped, hair Now is OK, his anti-crawling mechanism did not get too clear.

C. Analog login: Strictly speaking can not be simulated, the solution is to manually after landing on the Web page, and then go to the browser to take the cookie out, put into the headers inside, the request with a cookie, so that you can request to log in to catch the data.

D. Analysis of the page, the user's details page is followed by the ID in the change, so crawl logic is to use a For loop id+ request connection, but also natural to go to heavy, sometimes there will be an unknown error in the middle of the break, you can catch an exception for the outside, and then the entire code is installed in a def, Recursive implementation always loops

2) configuring RABBITMQ in Linux can be found in one of my other articles: http://www.cnblogs.com/devinCat/articles/7172927.html

3) Development tools: pycharm+python3.6

2. The role of MQ in this task: to split a task vertically into blocks, each piece of message queue for the media to connect and transfer the required information, so that the implementation of different task logic of the code can work in parallel, do not affect and can improve the efficiency of the program, Message Queuing has six modes of operation, Here is Work mode, suggest Baidu understand the principle, in this but more elaboration. Below I draw a picture to help you understand the code implementation logic and process, hope to help everyone! (Big point to see more clearly)

The code on GitHub, I small white, I hope the great Gods more guidance: Https://github.com/DevinCat/Cat

Follow-up will add multithreading, and constantly optimize and improve, although hard, after the realization or there is a kind of what is the sense of ^?_?^, hope that we have a lot of advice, welcome message exchange!

PYTHON+RABBITMQ Crawl a dating website user data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.