Python Simple crawler

Source: Internet
Author: User

crawler is really an interesting thing ah, before the crawler, with the URLLIB2, BeautifulSoup to achieve a simple crawler, Scrapy has also been achieved. If you want to learn a reptile better recently, make a record of it as much as possible. This blog is written about one of my learning processes today.

A regular expression

Regular expression is a very powerful tool, a lot of grammar rules, I commonly used in reptiles are:

. Match any character (except line breaks)
* Match the previous character 0 or unlimited times
? Match the previous character 0 or 1 times
.* Greedy algorithm
.*? Non-greedy algorithm
(.*?) Outputs the results in parentheses that match to
\d Match numbers
Re. S make. can match line break

Common methods are: Find_all (), search (), sub ()

To practice the above grammatical methods, see the code: https://github.com/Ben0825/Crawler/blob/master/re_test.py

Two Urllib and Urllib2

Urllib and Urllib2 Library is the most basic library of learning Python crawler, using this library, we can get the content of the Web page, at the same time, can be combined with regular to extract the analysis of these content, get the real desired results.

Here, Urllib and URLLIB2 are combined to crawl the author of the Embarrassing encyclopedia to praise the content.

Code See: https://github.com/Ben0825/Crawler/blob/master/qiubai_test.py

Three BeautifulSoup

BeautifulSoup is a library of Python, the main function is to crawl data from the Web page, the official introduction is this:
Beautiful Soup provides some simple, Python-style functions for navigating, searching, and modifying analysis trees. It is a tool box that provides the user with the data to be crawled by parsing the document, because it is simple, so it does not require much code to write a complete application sequence.
Beautiful Soup automatically converts the input document to Unicode encoding, and the output document is converted to UTF-8 encoding. You don't have to think about coding , unless the document does not specify an encoding, Beautiful Soup cannot automatically recognize the encoding. Then, you just need to explain the original encoding method.
Beautiful Soup has become as good a Python interpreter as lxml and Html6lib, providing users with the flexibility to provide different analytic strategies or strong speed.

First: Crawl the Baidu Encyclopedia Python entry under the relevant 100 pages, crawled page values set by themselves.

Code See: Https://github.com/Ben0825/Crawler/tree/master/python_baike_Spider

Code to run:

  

       

  Consolidate the article, according to the label of the book in the Watercress to get a book list, the same use BeautifulSoup.

Code See: https://github.com/Ben0825/Crawler/blob/master/doubanTag.py

Operation Result:  

 These are some of the contents of today's study, crawler is really interesting ah, tomorrow continue to learn scrapy!

Python Simple crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.