Data Mining _ Multi-threaded crawl

Source: Internet
Author: User

In this article, we mainly introduce multi-threaded fetching data.

Multithreading is performed in a concurrent manner, where it is important to note that Python's multithreaded programming can only run on a single core and run concurrently, even if it is a multi-core machine, so using multi-threaded fetching can greatly improve the capture efficiency.

Here we take requests as an example to introduce multi-threaded crawl, and then in the comparison with the single-threaded, realize the efficiency of multithreading improve

This time, I will not use my website to do the test, because the content of the site is not too much at this time, can not reflect the advantages of multithreading

We use Dangdang to test our multi-threaded instances, through the search results of the same crawl implementation function demonstration, the search mode address is as follows

Http://search.dangdang.com/?key=Python&act=input&page_index=1

You can see that key represents a search keyword, and act represents how you searched for it, and Page_index represents the page number of the search.

After crawling to the above page number, to extract the information inside, and finally save the extracted information in a text file, the document holds the title of each book, as well as his links

Here we define the method (or function) required to crawl the experiment

# coding=utf-8__author__ = "susmote" Import requestsfrom BS4 import beautifulsoupdef format_str (s): return s.replace ("\ n "," "). Replace (" "," "). Replace (" \ T "," ") def get_urls_in_pages (From_page_num, to_page_num): URLs = [] Search_word = "Python" url_part_1 = "http://search.dangdang.com/?key=" url_part_2 = "&act=input" url_part_3 = "&page_in Dex= "For I in range (From_page_num, To_page_num + 1): Urls.append (url_part_1 + search_word + url_part_2 + URL_PA Rt_3 + str (i)) all_href_list = [] for URLs in Urls:print (URL) resp = requests.get (URL) bs = Bea            Utifulsoup (Resp.text, "lxml") A_list = Bs.find_all ("a") needed_list = [] For a in a_list: If ' name ' in a.attrs:name_val = a[' name '] href_val = a[' href '] title = A.tex t if ' itemlist-title ' in Name_val and title! = "": if [title, Href_val] not in Needed_li                     St:   Needed_list.append ([Format_str (title), Format_str (Href_val)]) all_href_list + = needed_list all_href_file = op En (str (from_page_num) + ' _ ' + str (to_page_num) + ' _ ' + ' All_hrefs.txt ', ' W ') for href in all_href_list:all_href _file.write (' \ t '. Join (HREF) + ' \ n ') All_href_file.close () print (From_page_num, To_page_num, Len (all_href_list))

Let's explain the code here.

First, FORMAT_STR is used to remove extraneous blanks after extracting information.

And the Get_url_in_pages method is the main body of the execution function, the method receives the parameter refers to the range of page numbers, function body, URLs are mainly used to store all the pages to be crawled based on two parameters, I divide the URL into 3 parts, but also for the convenience of the combination of links, Then the for loop is to do is splicing work, here I do not explain more, if not understand, please leave a message

Next, we define a list of all_href_list, which is used to store each page containing the book information, and in fact it is a nested list, the element is [title, link], his form is as follows

All_href_list = [    [' Title 1 ', ' link 1 '],    [' Title 2 ', ' Link 2 '],    [' Title 3 ', ' link 3 '],    ...

The next code is to fetch and extract information from the page, this part of the code in the for URL in this loop body, first print the link, and then call the Requs get method, get the page, It then uses BeautifulSoup to parse the HTML text of the GET request back into the structure that BeautifulSoup can handle, named BS

The needed_list we defined later was used to store the title and the link, and Bs.find_all (' a ') extracted all the link elements from the page, and for a in a_list traversed the elements of each list, before which we discovered his structure through the browser.

Each book element will have a property name, the value of "Itemlist_title", through which we easily filter out the book elements, and then the book information, as well as the link element href together into the list, before the deposit, we also made some judgments, whether the link already exists, And the link to this element is empty

After each page is extracted, you can add it to the All_href_list, which is the following line of code

All_href_list + = Needed_list

Note that I am using the + = operator here

After getting all the link elements in the range, you can write to the file, and here I don't do much explaining.

Then our next step is to define multithreading, because the total number of pages we search for keywords is 32 pages.

So here we are going to use 3 threads to accomplish these tasks, that is, each thread handles 10 pages, and in the case of a single thread, the 30 pages are done separately

Below we give the code of the crawl scheme

# coding=utf-8__author__ = "susmote" Import timeimport threadingfrom mining_func import get_urls_in_pagesdef Multiple_ Threads_test ():    start_time = Time.time ()    page_range_list = [        (1, ten),        (one, a),        (+),    ]        th_list = [] for    page_range in page_range_list:        th = Threading. Thread (target = get_urls_in_pages, args = (page_range[0], page_range[1]))        th_list.append (th) for        th in Th_list :        Th.start ()            for th in th_list:        th.join ()            end_time = Time.time ()    print ("Total use time 1:", End_time- start_time)    return end_time-start_time

  

To explain briefly, in order to get the run time, we define a start time start_time and an end time End_time, the run time is the end time minus the start time

Then define a list page_range_list that is to divide the page number into three paragraphs, mentioned earlier

A list is then defined th_list that is, a list of all thread objects, followed by a loop that generates 3 thread objects, each corresponding to a different page range, and depositing them into the list

Then in the subsequent loops, execute Th.start (), open the thread, and in the following, we will exit the function in order for these asynchronous concurrent threads to execute, and then use the thread's join method to wait for each thread to finish

Here's the most exciting time to test the code.

Here, let's write down the following code

# coding=utf-8__author__ = "Susmote" from mining_threading import multiple_threads_testif __name__ = "__main__":    MT = Multiple_threads_test ()    print (' Mt ', MT)

  

To make the test results more accurate, we conducted three experiments to take the average time

First experiment

Use time 6.651

Second experiment

Use time 6.876

The third experiment

Use time 6.960

The average time is as follows

6.829

The following is a single process code

# coding=utf-8__author__ = "susmote" Import timefrom mining_func import Get_urls_in_pagesdef sigle_test ():    Start_ Time = Time.time ()    get_urls_in_pages (1, +)    end_time = Time.time ()    print ("Total use:", End_time-start_time)    

  

The calling function is as follows

# coding=utf-8__author__ = "Susmote" from single_mining import single_testif __name__ = "__main__":    st = Single_test ( )    

  

Executing at the command line

First time

10.138

Second time

10.290

Third time

10.087

Average Time spent

10.171

So, multithreading really can improve the efficiency of the crawl, note that this is in the case of less data, if the volume of data is larger, the advantages of multithreading is obvious

You can change the search keyword, and page number, or re-find a page (crawl and speed also has a lot of relationship)

With a few graphs to fetch the data

Data Mining _ Multi-threaded crawl

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.