Python Crawl Pen Funny Court novel

Source: Internet
Author: User
Tags for in range

Recently in learning Python, I think the crawler is very fun, today I am ready to crawl I read at least three times the novel "The Snow in the sword", the author is the War of Princes, his novel is very talented, has a lot of fans, but his many novels are in the state of the broken, so called the big head.

I am going to crawl the novel website is a new pen fun, here a pirate site, is a noble and decent thorn, but for me this kind of people who do not want to pay the money to read the novel, not qualified to comment on it, this site serial novel is still relatively fast, the content is identical with the original content. Okay, no more nonsense, let's start with the code:

I crawl the content of the novel first with the requests library to crawl, the result caught a chapter in the beginning of the novel words, and then thought of just learning Selenium automated testing tools, using Selenium can drive the browser to perform specific actions, such as click, Pull, etc. You can also get the page source code that the browser is currently rendering, so that you can crawl all the content that is present in front of you.

First we open the homepage of the novel We want to crawl url = ' https://www.xxbiquge.com/0_807/', at this time we can see the page is the entire chapter of the novel, we first want to get the total number of novel chapters, the Code is as follows:

defpage_num ():"""To analyze the novel, get the total number of chapters: return: Number of chapters"""    #the URL address of the target novel, if you want to crawl other novels, just change to the homepage of the novel you want to crawlURL ='https://www.xxbiquge.com/0_807/'Browser=Webdriver. Chrome () browser.get (URL) HTML=Browser.page_source Soup= BeautifulSoup (HTML,'lxml') DD= Soup.find_all (name="DD") Page=Len (DD) browser.close ()returnPage

Next we will analyze the sermon novel, such as the first chapter, the first chapter of the URL = ' https://www.xxbiquge.com/0_807/4055527.html ', we can easily get the first chapter of the source code, the key is how to get the next chapter of the content, Selenium can simulate the user click the next chapter, jump to the next chapter of the page, the code is as follows:

defIndex_page (i):"""load the contents of each chapter of the novel:p Aram I: Chapter I of the novel"""    ifi = = 1:        #The first chapter of the novel URL address, you want to crawl the first chapter of the novel URLURL ="https://www.xxbiquge.com/0_807/4055527.html"browser.get (URL)#wait for the Content node to load outWait.until (ec.presence_of_element_located (By.css_selector,'#content')))    #Call the Get_info () method to parse the pageGet_info ()#Find the node that you clicked on in the next chapterNext_p = Browser.find_elements (By.xpath, ('//div[@class = "bottem2"]/a')) [2]    #Click the buttonNext_p.click ()

The third step, extract the contents of each chapter, the code is as follows:

defget_info ():"""extract chapter name and text of each chapter of the novel: return:"""    #find the name of the chapterName = Browser.find_element_by_css_selector ('#wrapper > Div.content_read > div > Div.bookname > H1'). TextPrint(name)#Find the text of the novelContent = browser.find_element_by_id ('content'). TextPrint(content)#write the name of the novel and the corresponding body content into the TXT fileWith open ('the snow in the sword line. txt','a', encoding="Utf-8") as F:f.write ('\ n'. Join ([name, Content])) F.write ('\ n')

Fourth step, traverse each page:

def Main ():     """     B Traverse all chapters of the novel    : Return    :    "" "= page_num (    )  Print(page)    for in range (1,page+1):        index_page (i) 

The final step is to run the program:

  

if __name__ ' __main__ ' :    main ()

Then you can have a cup of tea and wait for the novel to be good. Complete code in: Https://github.com/luoyunqa/Test/tree/master/biquge_novel

Python Crawl Pen Funny Court novel

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.