Douban reader crawler (requests + RE)

Source: Internet
Author: User

I have sorted out some crawler content. Today I wrote a small chestnut with little content. Please ignore it. The content includes crawling and sorting the basic information of books on Douban's reading website, so that we can quickly learn about the center of each book.

I. Crawling Information

When crawling the information of a webpage, you first need to enter the webpage to see if there are any restrictions in the crawling process. You can view the robots protocol of the website. Add "/robots.txt" to the original URL ". The result of this website is:

User-agent: *Disallow: /subject_searchDisallow: /searchDisallow: /new_subjectDisallow: /service/iframeDisallow: /j/Sitemap: http://www.douban.com/sitemap_index.xmlSitemap: http://www.douban.com/sitemap_updated_index.xmlUser-agent: Wandoujia SpiderDisallow: /

According to the above Protocol, we can see that some common crawlers are not prohibited, just as we are now, crawling just a little bit for your own use. Then, we can use the structure mentioned in the previous article to implement this crawler. First, import the function library, apply the framework, input the address, and return the page content. This content is written in this blog and will not be explained in detail here. At this point, the crawling of the web page is over, and the rest is to get the content we want from these items.

 1 import requests 2  3 url = "https://book.douban.com/" 4 def getHtmlText(url): 5     headers = { 6         ‘User-Agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36‘ 7     } 8     try: 9         response = requests.get(url, headers=headers)10         response.raise_for_status()11         response.encoding = response.apparent_encoding12         return response.text13     except:14         print("Fail")15         return16     17 html = getHtmlText(url)
II. Information Processing

The web page code extracted above contains many things, such as the various frameworks of the displayed pages. These are useless for us and information is extracted through regular expressions, if you extract the content directly on the entire page, there will inevitably be some coincidences, so that the content you really want is not extracted, and other content like pattern. Therefore, first of all, take out the key blocks first, and then take out the specific information.

1 import re2 3 re_books = Re. Compile ('<ul class = "list-Col list-col5 list-Express slide-item"> (.*?) </Ul> ', re. s) # Re. S is also written in the regular expression blog to allow "." To match the line break 4 content = re_books.search (HTML)

Check the source code of the webpage, find matching rules for retrieving the main information, and obtain all the intermediate content. The rest is to extract every item of information in each book through regular expressions. This is to observe their rules and find matching rules. There are more than one Information tag, and pandas is used to sort data. Pandas's dataframe data type can easily store two-dimensional structures, in addition, pandas stores data in an Excel format (dataframe. to_excel ()).

1 import pandas as PD # This is a habit of most people. Pandas is long and often used in data processing. Therefore, use two letters of PD to represent 2 3 # first, create a dataframe, then, the information of each book is traversed and saved in dataframe format and spliced behind it. 4 Data = PD. dataframe (columns = ['title', 'author', 'abstract ', 'href', 'her her']) 5 6 re_book = Re. compile ('<li class = ""> (. *?) </LI> ', re. s) 7 booklist = re_book.findall (content [0]) # findall locate all the books and return 8 for books in booklist format: 9 COUNT = 010 count + = 111 href = Re. search ('href = "(. *?) "', Book )#.*? It refers to matching in non-Greedy mode, and () is a group. It is convenient to retrieve the information in the group 12 href = href. group (1) 13 Title = Re. search ('<H4 class = "title"> (. *?) </H4> ', book, re. s) 14 title = title. group (1 ). split () [0] 15 author = Re. search ('<SPAN class = "author"> (. *?) </Span> ', book, re. s) 16 author = ''. join (author. group (1 ). split () 17 publisher = Re. search ('<SPAN class = "publisher"> (. *?) </Span> ', book, re. s) 18 publisher = ''. join (publisher. group (1 ). split () 19 abstract = Re. search ('<P class = "abstract"> (. *?) </P> ', book, re. s) 20 abstract = ''. join (abstract. group (1 ). split () 21 abstract = Re. sub ('[Content Overview]', '', abstract) # It is found that the obtained information is not very good-looking. The main content of the first book contains the following words at the beginning, replace 22 new = Pd with the sub method of RE. dataframe ({"title": title, "author": Author, "abstract": abstract, "href": href, "publisher": publisher }, index = ["0"]) 23 DATA = data. append (new, ignore_index = true) 24 data.to_excel('bookinfo.xls ', encoding = 'utf-8 ')

Let's take a look at the results. Pandas's direct output results are also quite regular. They are stored in Excel and saved in CSV files at first, but they are garbled, later I changed to excel without thinking too much about it. Later I went to see what was going on, or some readers could clearly teach the bloggers.

Some things are not shown in the figure, but everyone knows it, right. You can try it on your own. Of course, this crawler is very simple. If we only get this data, we will give you the following content. You can try to find something deeper, and the principle is similar, when learning crawlers, you must discover things that can be mined at any time and try them slowly.

 

Douban reader crawler (requests + RE)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.