crawler is really an interesting thing ah, before the crawler, with the URLLIB2, BeautifulSoup to achieve a simple crawler, Scrapy has also been achieved. If you want to learn a reptile better recently, make a record of it as much as possible. This blog is written about one of my learning processes today.
A regular expression
Regular expression is a very powerful tool, a lot of grammar rules, I commonly used in reptiles are:
| . |
Match any character (except line breaks) |
| * |
Match the previous character 0 or unlimited times |
| ? |
Match the previous character 0 or 1 times |
| .* |
Greedy algorithm |
| .*? |
Non-greedy algorithm |
| (.*?) |
Outputs the results in parentheses that match to |
| \d |
Match numbers |
| Re. S |
make. can match line break |
Common methods are: Find_all (), search (), sub ()
To practice the above grammatical methods, see the code: https://github.com/Ben0825/Crawler/blob/master/re_test.py
Two Urllib and Urllib2
Urllib and Urllib2 Library is the most basic library of learning Python crawler, using this library, we can get the content of the Web page, at the same time, can be combined with regular to extract the analysis of these content, get the real desired results.
Here, Urllib and URLLIB2 are combined to crawl the author of the Embarrassing encyclopedia to praise the content.
Code See: https://github.com/Ben0825/Crawler/blob/master/qiubai_test.py
Three BeautifulSoup
BeautifulSoup is a library of Python, the main function is to crawl data from the Web page, the official introduction is this:
Beautiful Soup provides some simple, Python-style functions for navigating, searching, and modifying analysis trees. It is a tool box that provides the user with the data to be crawled by parsing the document, because it is simple, so it does not require much code to write a complete application sequence.
Beautiful Soup automatically converts the input document to Unicode encoding, and the output document is converted to UTF-8 encoding. You don't have to think about coding , unless the document does not specify an encoding, Beautiful Soup cannot automatically recognize the encoding. Then, you just need to explain the original encoding method.
Beautiful Soup has become as good a Python interpreter as lxml and Html6lib, providing users with the flexibility to provide different analytic strategies or strong speed.
First: Crawl the Baidu Encyclopedia Python entry under the relevant 100 pages, crawled page values set by themselves.
Code See: Https://github.com/Ben0825/Crawler/tree/master/python_baike_Spider
Code to run:
Consolidate the article, according to the label of the book in the Watercress to get a book list, the same use BeautifulSoup.
Code See: https://github.com/Ben0825/Crawler/blob/master/doubanTag.py
Operation Result:
These are some of the contents of today's study, crawler is really interesting ah, tomorrow continue to learn scrapy!
Python Simple crawler