Compile a summary of simple crawlers in python and a summary of python crawlers.

Source: Internet
Author: User

Compile a summary of simple crawlers in python and a summary of python crawlers.

Crawler is really an interesting thing. I have previously written about crawlers and used urllib2 and BeautifulSoup to implement simple crawlers. scrapy has also been implemented. If you want to learn crawlers better recently, record them as much as possible. This blog is just a learning process for me today.

A regular expression

Regular Expressions are a powerful tool. Many syntax rules are commonly used in crawlers:

. Match any character (except for line breaks)
* Match the first character 0 or unlimited times
? Match the first character 0 or 1 time
.* Greedy Algorithm
.*? Non-Greedy Algorithm
(.*?) Output the matching result in parentheses.
\ D Matching number
Re. S Make it possible to match line breaks


Common methods include find_all (), search (), and sub ()

To practice the above syntax method, see the code: https://github.com/Ben0825/Crawler/blob/master/re_test.py

Urllib and urllib2

The urllib and urllib2 libraries are the most basic libraries for learning Python crawlers. With this library, we can get the content of the webpage and extract and analyze the content using regular expressions, get the expected result.

Here, we use urllib and urllib2 in combination with regular expressions to crawl the likes from the author's encyclopedia.

See: https://github.com/Ben0825/Crawler/blob/master/qiubai_test.py for code

Three BeautifulSoup

BeautifulSoup is a Python library. The main function is to capture data from webpages. The official introduction is as follows:
Beautiful Soup provides some simple, python-style functions for processing navigation, searching, and modifying analysis trees. It is a toolbox that parses documents to provide users with the data to be captured. Because it is simple, a complete application can be written without much code.
Beautiful Soup automatically converts the input file to Unicode encoding, and the output file to UTF-8 encoding. You do not need to consider the encoding method unless the document does not specify an encoding method. In this case, Beautiful Soup cannot automatically identify the encoding method. Then, you just need to describe the original encoding method.
Beautiful Soup has become an excellent python interpreter like lxml and html6lib, providing users with different parsing policies or powerful speed flexibly.

First, crawl the relevant 100 pages under the Python entry of Baidu encyclopedia and set the page value.

For the code, see: https://github.com/Ben0825/Crawler/tree/master/python_baike_Spider

Code running:

  

       

Consolidate the article, get a book ticket based on the book label in Douban, and use BeautifulSoup.

For the code, see: https://github.com/Ben0825/Crawler/blob/master/doubanTag.py

Running result:

The above are some of the content I learned today. crawlers are really interesting. I will continue to learn scrapy tomorrow!

Articles you may be interested in:
  • Python simulates Sina Weibo login (Sina Weibo crawler)
  • Install and use the Python crawler framework Scrapy
  • Python web image capture example (python crawler)
  • Use Python to write simple web crawlers to capture video download resources
  • Full record of crawler writing for python Crawlers
  • No basic write python crawler: Use Scrapy framework to write Crawlers
  • Python implements simple crawler sharing for crawling links on pages
  • Python3 simple Crawler
  • Write crawler programs in python
  • Multi-thread web crawler using python
  • Python crawlers capture data transmitted by mobile apps
  • Python crawler simulated logon website with verification code
  • Simple Example of Python multi-thread Crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.