On the way to write the Python crawler of elder brother

Source: Internet
Author: User

In order to do a Baidu web search engine, brother began to study the crawler, from the crawler and get out of hand, and now about the brother of the crawler, by the way, the engine: http://www.quzhuanpan.com

First of all the basic Python syntax you need to understand, recommend a book "Basic Python Tutorial", very suitable for getting started.

Second, analyze your crawler needs. How is the program specific process? Set up the general framework of the program. What other difficulties may be there?

Then find out what libraries the general crawler needs to use, and these libraries can help you solve a lot of problems. Recommended requests:http for humans there are other libraries, such as Urllib2 BeautifulSoup, that can be understood.

Start writing, encounter problems Google can, Google does not know the question, I met a problem is to know the private messages Daniel solve. In the process of writing will also learn a lot of relevant knowledge, such as HTTP protocol, multithreading and so on.

Here are some special cases to deal with:

1, for the processing of the landing situation

This is a POST request that sends the form data to the server before the server stores the returned cookie locally.

2. use cookies to log in

Using a cookie to log in, the server will think you are a logged-in user, so it will return you to a logged-in content. Therefore, the need to verify the code can be used with a verification code login cookie resolution.

3, Application: Limit the situation of IP address, but also can be resolved due to "frequent click" and need to enter the verification code login situation. The best way to do this is to maintain a proxy IP pool, the Internet has a lot of free proxy IP, good and bad, can be found through screening can be used. For "Frequent clicks", we can also prevent the site from being banned by restricting the frequency of crawlers accessing the site.

4. Application: Limit the frequency of the situation.

REQUESTS,URLLIB2 can use the sleep () function of the time library:

5, some websites will check whether you are true browser access, or the machine automatically access. This situation, plus user-agent, indicates that you are a browser access. Sometimes check if you have referer information and check if your referer is legal, generally plus referer.

Thanks, crossing.

On the way to write the Python crawler of elder brother

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.