Compile a summary of simple crawlers in python and a summary of python crawlers.
Crawler is really an interesting thing. I have previously written about crawlers and used urllib2 and BeautifulSoup to implement simple crawlers. scrapy has also been implemented. If you want to learn crawlers better recently, record them as much as possible. This blog is just a learning process for me today.
A regular expression
Regular Expressions are a powerful tool. Many syntax rules are commonly used in crawlers:
. |
Match any character (except for line breaks) |
* |
Match the first character 0 or unlimited times |
? |
Match the first character 0 or 1 time |
.* |
Greedy Algorithm |
.*? |
Non-Greedy Algorithm |
(.*?) |
Output the matching result in parentheses. |
\ D |
Matching number |
Re. S |
Make it possible to match line breaks |
Common methods include find_all (), search (), and sub ()
To practice the above syntax method, see the code: https://github.com/Ben0825/Crawler/blob/master/re_test.py
Urllib and urllib2
The urllib and urllib2 libraries are the most basic libraries for learning Python crawlers. With this library, we can get the content of the webpage and extract and analyze the content using regular expressions, get the expected result.
Here, we use urllib and urllib2 in combination with regular expressions to crawl the likes from the author's encyclopedia.
See: https://github.com/Ben0825/Crawler/blob/master/qiubai_test.py for code
Three BeautifulSoup
BeautifulSoup is a Python library. The main function is to capture data from webpages. The official introduction is as follows:
Beautiful Soup provides some simple, python-style functions for processing navigation, searching, and modifying analysis trees. It is a toolbox that parses documents to provide users with the data to be captured. Because it is simple, a complete application can be written without much code.
Beautiful Soup automatically converts the input file to Unicode encoding, and the output file to UTF-8 encoding. You do not need to consider the encoding method unless the document does not specify an encoding method. In this case, Beautiful Soup cannot automatically identify the encoding method. Then, you just need to describe the original encoding method.
Beautiful Soup has become an excellent python interpreter like lxml and html6lib, providing users with different parsing policies or powerful speed flexibly.
First, crawl the relevant 100 pages under the Python entry of Baidu encyclopedia and set the page value.
For the code, see: https://github.com/Ben0825/Crawler/tree/master/python_baike_Spider
Code running:
Consolidate the article, get a book ticket based on the book label in Douban, and use BeautifulSoup.
For the code, see: https://github.com/Ben0825/Crawler/blob/master/doubanTag.py
Running result:
The above are some of the content I learned today. crawlers are really interesting. I will continue to learn scrapy tomorrow!
Articles you may be interested in:
- Python simulates Sina Weibo login (Sina Weibo crawler)
- Install and use the Python crawler framework Scrapy
- Python web image capture example (python crawler)
- Use Python to write simple web crawlers to capture video download resources
- Full record of crawler writing for python Crawlers
- No basic write python crawler: Use Scrapy framework to write Crawlers
- Python implements simple crawler sharing for crawling links on pages
- Python3 simple Crawler
- Write crawler programs in python
- Multi-thread web crawler using python
- Python crawlers capture data transmitted by mobile apps
- Python crawler simulated logon website with verification code
- Simple Example of Python multi-thread Crawler