Compile a summary of simple crawlers in python and a summary of python crawlers.

Last Update:2016-03-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Crawler is really an interesting thing. I have previously written about crawlers and used urllib2 and BeautifulSoup to implement simple crawlers. scrapy has also been implemented. If you want to learn crawlers better recently, record them as much as possible. This blog is just a learning process for me today.

A regular expression

Regular Expressions are a powerful tool. Many syntax rules are commonly used in crawlers:

.	Match any character (except for line breaks)
*	Match the first character 0 or unlimited times
?	Match the first character 0 or 1 time
.*	Greedy Algorithm
.*?	Non-Greedy Algorithm
(.*?)	Output the matching result in parentheses.
\ D	Matching number
Re. S	Make it possible to match line breaks

Common methods include find_all (), search (), and sub ()

To practice the above syntax method, see the code: https://github.com/Ben0825/Crawler/blob/master/re_test.py

Urllib and urllib2

The urllib and urllib2 libraries are the most basic libraries for learning Python crawlers. With this library, we can get the content of the webpage and extract and analyze the content using regular expressions, get the expected result.

Here, we use urllib and urllib2 in combination with regular expressions to crawl the likes from the author's encyclopedia.

See: https://github.com/Ben0825/Crawler/blob/master/qiubai_test.py for code

Three BeautifulSoup

BeautifulSoup is a Python library. The main function is to capture data from webpages. The official introduction is as follows:
Beautiful Soup provides some simple, python-style functions for processing navigation, searching, and modifying analysis trees. It is a toolbox that parses documents to provide users with the data to be captured. Because it is simple, a complete application can be written without much code.
Beautiful Soup automatically converts the input file to Unicode encoding, and the output file to UTF-8 encoding. You do not need to consider the encoding method unless the document does not specify an encoding method. In this case, Beautiful Soup cannot automatically identify the encoding method. Then, you just need to describe the original encoding method.
Beautiful Soup has become an excellent python interpreter like lxml and html6lib, providing users with different parsing policies or powerful speed flexibly.

First, crawl the relevant 100 pages under the Python entry of Baidu encyclopedia and set the page value.

For the code, see: https://github.com/Ben0825/Crawler/tree/master/python_baike_Spider

Code running:

Consolidate the article, get a book ticket based on the book label in Douban, and use BeautifulSoup.

For the code, see: https://github.com/Ben0825/Crawler/blob/master/doubanTag.py

Running result:

The above are some of the content I learned today. crawlers are really interesting. I will continue to learn scrapy tomorrow!

Articles you may be interested in:

Python simulates Sina Weibo login (Sina Weibo crawler)
Install and use the Python crawler framework Scrapy
Python web image capture example (python crawler)
Use Python to write simple web crawlers to capture video download resources
Full record of crawler writing for python Crawlers
No basic write python crawler: Use Scrapy framework to write Crawlers
Python implements simple crawler sharing for crawling links on pages
Python3 simple Crawler
Write crawler programs in python
Multi-thread web crawler using python
Python crawlers capture data transmitted by mobile apps
Python crawler simulated logon website with verification code
Simple Example of Python multi-thread Crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Compile a summary of simple crawlers in python and a summary of python crawlers.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Compile a summary of simple crawlers in python and a summary of python crawlers.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support