Python Simple crawler

Last Update:2016-03-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

crawler is really an interesting thing ah, before the crawler, with the URLLIB2, BeautifulSoup to achieve a simple crawler, Scrapy has also been achieved. If you want to learn a reptile better recently, make a record of it as much as possible. This blog is written about one of my learning processes today.

A regular expression

Regular expression is a very powerful tool, a lot of grammar rules, I commonly used in reptiles are:

.	Match any character (except line breaks)
*	Match the previous character 0 or unlimited times
?	Match the previous character 0 or 1 times
.*	Greedy algorithm
.*?	Non-greedy algorithm
（.*?)	Outputs the results in parentheses that match to
\d	Match numbers
Re. S	make. can match line break

Common methods are: Find_all (), search (), sub ()

To practice the above grammatical methods, see the code: https://github.com/Ben0825/Crawler/blob/master/re_test.py

Two Urllib and Urllib2

Urllib and Urllib2 Library is the most basic library of learning Python crawler, using this library, we can get the content of the Web page, at the same time, can be combined with regular to extract the analysis of these content, get the real desired results.

Here, Urllib and URLLIB2 are combined to crawl the author of the Embarrassing encyclopedia to praise the content.

Code See: https://github.com/Ben0825/Crawler/blob/master/qiubai_test.py

Three BeautifulSoup

BeautifulSoup is a library of Python, the main function is to crawl data from the Web page, the official introduction is this:
Beautiful Soup provides some simple, Python-style functions for navigating, searching, and modifying analysis trees. It is a tool box that provides the user with the data to be crawled by parsing the document, because it is simple, so it does not require much code to write a complete application sequence.
Beautiful Soup automatically converts the input document to Unicode encoding, and the output document is converted to UTF-8 encoding. You don't have to think about coding , unless the document does not specify an encoding, Beautiful Soup cannot automatically recognize the encoding. Then, you just need to explain the original encoding method.
Beautiful Soup has become as good a Python interpreter as lxml and Html6lib, providing users with the flexibility to provide different analytic strategies or strong speed.

First: Crawl the Baidu Encyclopedia Python entry under the relevant 100 pages, crawled page values set by themselves.

Code See: Https://github.com/Ben0825/Crawler/tree/master/python_baike_Spider

Code to run:

　　Consolidate the article, according to the label of the book in the Watercress to get a book list, the same use BeautifulSoup.

Code See: https://github.com/Ben0825/Crawler/blob/master/doubanTag.py

Operation Result:　　

　These are some of the contents of today's study, crawler is really interesting ah, tomorrow continue to learn scrapy!

Python Simple crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Simple crawler

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support