Python's approach to crawling a wave-wise novel

Source: Internet
Author: User
In this paper, we describe the method of Python to crawl the wave-wise novel. Share to everyone for your reference. The specific analysis is as follows:

I like to read novels on the internet, has been using the novel download reader, you can automatically download from the Internet to read the novel to the local, more convenient. Recently in the study of Python crawler, inspired by this, suddenly think of writing a crawl of the content of the script to play. Then, by analyzing the source code on the wave-by, and finding out the structure features, a script can be crawled to crawl the content of the novel on the Wave.

The specific implementation functions are as follows: After entering the URL of the novel directory page, the script automatically analyzes the catalog page, extracting the chapter name and Chapter link address of the novel. The chapter content is then extracted from the chapter link address. At this stage only the novel from the first chapter, each extract a chapter content, enter after the next chapter to extract the content. Other sites may have different results and need to make certain changes. In the wave-wise test is normal.

Now share this code, one is to make a record, convenient for you to review later. Two also want to initiate, hope each road great God generous enlighten.

#-*-coding:utf8-*-#!/usr/bin/python# python:2.7.8# platform:windows# program:get Novels from Internet# author:w ucl# description:get novels# version:1.0# history:2015.5.27 complete directory and URL extraction # 2015.5.28 complete the catalogue in the first section, extract chapter links and download. The download is correct on a wave-wise test. From BS4 import beautifulsoupimport urllib2,redef get_menu (URL): "" "get chapter name and its URL" "" User_agent = "Mozill a/5.0 (Windows NT 6.1; WOW64; rv:39.0) gecko/20100101 firefox/39.0 "headers = {' User-agent ': user_agent} req = Urllib2. Request (url,headers = headers) page = Urllib2.urlopen (req). Read () soup = BeautifulSoup (page) novel = Soup.find_all (' Tit Le ') [0].text.split ('_') [0] # Extract the novel name menu = [] All_text = Soup.find_all (' A ', target= "_blank") # Extract the module that records the chapter name and link address of the novel Regex =re.compile (Ur ' \u7b2c.+\u7ae0 ') # Zhong Wenjing matches the first. Chapter, removing unnecessary links for the title in All_text:if Re.findall (regex,title.text): name = Title.text x = [name,title[' href '] ] Menu.append (x) # Insert a list with the chapter name and link address of the novel into the list return menu,noveldef Get_chapter (name,url): "" "Get every chapter in menu" "" Html=urllib2.urlopen (URL). Read () soup=beautifulsoup (HTML) content=soup.find_all (' P ') # Extract the body of the novel return Content[0].textif __name__== "__main__": Url=raw_input ("" "Input the main page ' s URL of the novel in Zhulan G\n then press Enter to Continue\n "" ") if Url:menu,title=get_menu (URL) print title,str (Len (menu)) + ' \ n Press E Nter to Continue \ n ' # output gets the novel name and number of chapters for I in Menu:chapter=get_chapter (I[0],i[1]) raw_input () Prin T ' \ n ' +i[0]+ ' \ n ' print chapter print ' \ n '

Hopefully this article will help you with Python programming.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.