Writing Python crawlers from scratch---1.7 crawler practices: __python

Source: Internet
Author: User
writing python crawlers from scratch---1.7 crawler practices: List of stories in bulk downloadsEhco 5 months ago

Originally just prepare to do a crawl beginning novel name Crawler, then thought for a while, why not by the way the content of the novel also climbed down. So I wrote this reptile, and he climbed down all the chapters on all kinds of novels and saved them locally. Think about it, all kinds of pirated novel reader, is not the way to do it. Target Analysis:

First take a look at the address of our list:

http://www.qu.la/paihangbang/


Our purpose is clear:

Find the name of each novel that is next to each category, and the link to the site: look at the structure of the page:

It is easy to see that each category is wrapped in:

<div class= "index_toplist mright mbottom" >

Among
This kind of conditioning clear website, greatly facilitates our reptile to write the novel title and the Link:

We looked for ourselves in the div just now:

<div class= "index_toplist mright mbottom" > <div class= "toptab" id= "Top_all_1" > <span> Fantasy Fantasy Ranking </ span><div> <div class= "topbooks" id= "con_o1g_1" style= "Display:block"; > <ul> <li><span class= "hits" >05-06</span><span class= "num" >1.</span><a
                        
                        href= "/book/168/" title= "Choice Day" target= "_blank" > choose the day to remember </a></li> <li><span class= "hits" >05-06</span><span class= "num" >2.</span><a href= "/book/176"
                       
                        /"title=" target= "_blank" > Big dominate </a></li> <!--omitted a lot of--> <li><span class= "hits" >05-06</span><span class= "num" &
                        
                        Gt;3.</span><a href= "/book/4140/" title= "Swire King" target= "_blank" > Swire King </a></li> <li><span class= "hits" >05-06</span><spAn class= "num" >4.</span><a href= "/book/5094/" title= "Snow Eagle Lord" target= "_blank" > Snow Eagle Lord </a></li > <li><span class= "hits" >05-01</span><span class= "num" >15.</span> <a href= "/book/365/" title= "Martial Universe" target= "_blank" > Martial universe </a></li> </ul><        
 /div>

found that all the novels were in a list and clearly defined:

Title: title = div.a[' title ']
Link: link = ' http://www.qu.la/' + div.a[' href ']

So we just need to find all the links to the novel on the current page and save it in the list. List of tips to go heavy:

Students of the information will find that even the different categories of novels will be repeated in the rankings.
This will be a waste of our resources, especially in the face of climbing a large number of pages.
So how do we go from the crawl URL list to the weight.
Just learning Python's little partner might be able to implement a loop algorithm that goes heavy,
But the great thing about Python is that he can solve a lot of problems in a graceful way, where a single line of code can solve:

Url_list = List (set (Url_list))


Here we call a list constructor set: This ensures that there are no duplicate elements in the list. is not very simple. links to all chapters of a single novel:

First, we select an experiment from the URL connection of the novel we obtained earlier:
Like my favorite choice of heaven:

http://www.qu.la/book/168/


Is still a very clear web page structure, the point of praise:

<div class= "Box_con" >
<div id= "list" >
<dl><dt> "Optional Day" body </dt>
                    <dd> <a style= "" "href="/book/168/1748915.html "> Preface downhill </a></dd>
                    <dd> <a style=" "href="/book/ 168/1748916.html "> First chapter I changed my mind </a></dd>
                    <dd> <a style=" "href="/book/168/1748917.html " > The second chapter why </a></dd>
                    <!--the middle section omitted-->
                    <dd> <a style= "href="/book/168/1748924. HTML > Chapter Nineth did I do something wrong??</a></dd>
   </div>                 

We can easily find the connection to the corresponding chapter:
This code is an excerpt and cannot be used directly. There will be a note later.

link= ' http://www.qu.la/' + url.a[' href ']


OK, so we can climb down the links of all the chapters in a novel.
The last step: crawling the content of the article: crawling the contents of the article :

First we open a chapter and look at his source code:

We can find all the body contents, all of which are stored in:

<div id= ' content ' >


All the chapter names are simpler:

 


Then we can find the various tags of the BS4 library, it is very simple.
OK, let's take a look at the implementation of the Code: The implementation of the code:

Modular, functional programming is a very good habit, we insist that each individual function is written as a function, this will make your code simple and reusable. Web Crawl Header:

def get_html (URL):
    try:
        r = requests.get (URL, timeout=30)
        r.raise_for_status
        # I tested the encoding manually. and set well, this helps the efficiency of the promotion
        r.encoding = (' utr-8 ') return
        r.text
    except: return
        "someting wrong. "
get a list of stories and links:
def
    Get_content (URL): "Crawl each type of novel ranking, write the file in order, file content for the novel name + novel link saves the content to the list and returns a list filled with URL links. Url_list = [] html = get_html (URL) soup = bs4. BeautifulSoup (HTML, ' lxml ') # Because of the reason of the novel typesetting, The history class and the end of this kind of novel are not in a div category_list = soup.find_all (' div ', class_= ' INDEX_TOPL ist mright mbottom ') history_finished_list = Soup.find_all (' div ', class_= ' index_toplist mbottom ') for C Ate in category_list:name = Cate.find (' div ', class_= ' Toptab '). Span.string with open (' novel_list.csv ', ' + ') as F:f.write ("\ n novel kind: {} \ n". Format (name)) # We directly navigate through the style attribute to the total list general_list = Cate.find
        (style= ' display:block; ') 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.