In this article, we mainly introduce multi-threaded fetching data.
Multithreading is performed in a concurrent manner, where it is important to note that Python's multithreaded programming can only run on a single core and run concurrently, even if it is a multi-core machine, so using multi-threaded fetching can greatly improve the capture efficiency.
Here we take requests as an example to introduce multi-threaded crawl, and then in the comparison with the single-threaded, realize the efficiency of multithreading improve
This time, I will not use my website to do the test, because the content of the site is not too much at this time, can not reflect the advantages of multithreading
We use Dangdang to test our multi-threaded instances, through the search results of the same crawl implementation function demonstration, the search mode address is as follows
Http://search.dangdang.com/?key=Python&act=input&page_index=1
You can see that key represents a search keyword, and act represents how you searched for it, and Page_index represents the page number of the search.
After crawling to the above page number, to extract the information inside, and finally save the extracted information in a text file, the document holds the title of each book, as well as his links
Here we define the method (or function) required to crawl the experiment
# coding=utf-8__author__ = "susmote" Import requestsfrom BS4 import beautifulsoupdef format_str (s): return s.replace ("\ n "," "). Replace (" "," "). Replace (" \ T "," ") def get_urls_in_pages (From_page_num, to_page_num): URLs = [] Search_word = "Python" url_part_1 = "http://search.dangdang.com/?key=" url_part_2 = "&act=input" url_part_3 = "&page_in Dex= "For I in range (From_page_num, To_page_num + 1): Urls.append (url_part_1 + search_word + url_part_2 + URL_PA Rt_3 + str (i)) all_href_list = [] for URLs in Urls:print (URL) resp = requests.get (URL) bs = Bea Utifulsoup (Resp.text, "lxml") A_list = Bs.find_all ("a") needed_list = [] For a in a_list: If ' name ' in a.attrs:name_val = a[' name '] href_val = a[' href '] title = A.tex t if ' itemlist-title ' in Name_val and title! = "": if [title, Href_val] not in Needed_li St: Needed_list.append ([Format_str (title), Format_str (Href_val)]) all_href_list + = needed_list all_href_file = op En (str (from_page_num) + ' _ ' + str (to_page_num) + ' _ ' + ' All_hrefs.txt ', ' W ') for href in all_href_list:all_href _file.write (' \ t '. Join (HREF) + ' \ n ') All_href_file.close () print (From_page_num, To_page_num, Len (all_href_list))
Let's explain the code here.
First, FORMAT_STR is used to remove extraneous blanks after extracting information.
And the Get_url_in_pages method is the main body of the execution function, the method receives the parameter refers to the range of page numbers, function body, URLs are mainly used to store all the pages to be crawled based on two parameters, I divide the URL into 3 parts, but also for the convenience of the combination of links, Then the for loop is to do is splicing work, here I do not explain more, if not understand, please leave a message
Next, we define a list of all_href_list, which is used to store each page containing the book information, and in fact it is a nested list, the element is [title, link], his form is as follows
All_href_list = [ [' Title 1 ', ' link 1 '], [' Title 2 ', ' Link 2 '], [' Title 3 ', ' link 3 '], ...
The next code is to fetch and extract information from the page, this part of the code in the for URL in this loop body, first print the link, and then call the Requs get method, get the page, It then uses BeautifulSoup to parse the HTML text of the GET request back into the structure that BeautifulSoup can handle, named BS
The needed_list we defined later was used to store the title and the link, and Bs.find_all (' a ') extracted all the link elements from the page, and for a in a_list traversed the elements of each list, before which we discovered his structure through the browser.
Each book element will have a property name, the value of "Itemlist_title", through which we easily filter out the book elements, and then the book information, as well as the link element href together into the list, before the deposit, we also made some judgments, whether the link already exists, And the link to this element is empty
After each page is extracted, you can add it to the All_href_list, which is the following line of code
All_href_list + = Needed_list
Note that I am using the + = operator here
After getting all the link elements in the range, you can write to the file, and here I don't do much explaining.
Then our next step is to define multithreading, because the total number of pages we search for keywords is 32 pages.
So here we are going to use 3 threads to accomplish these tasks, that is, each thread handles 10 pages, and in the case of a single thread, the 30 pages are done separately
Below we give the code of the crawl scheme
# coding=utf-8__author__ = "susmote" Import timeimport threadingfrom mining_func import get_urls_in_pagesdef Multiple_ Threads_test (): start_time = Time.time () page_range_list = [ (1, ten), (one, a), (+), ] th_list = [] for page_range in page_range_list: th = Threading. Thread (target = get_urls_in_pages, args = (page_range[0], page_range[1])) th_list.append (th) for th in Th_list : Th.start () for th in th_list: th.join () end_time = Time.time () print ("Total use time 1:", End_time- start_time) return end_time-start_time
To explain briefly, in order to get the run time, we define a start time start_time and an end time End_time, the run time is the end time minus the start time
Then define a list page_range_list that is to divide the page number into three paragraphs, mentioned earlier
A list is then defined th_list that is, a list of all thread objects, followed by a loop that generates 3 thread objects, each corresponding to a different page range, and depositing them into the list
Then in the subsequent loops, execute Th.start (), open the thread, and in the following, we will exit the function in order for these asynchronous concurrent threads to execute, and then use the thread's join method to wait for each thread to finish
Here's the most exciting time to test the code.
Here, let's write down the following code
# coding=utf-8__author__ = "Susmote" from mining_threading import multiple_threads_testif __name__ = "__main__": MT = Multiple_threads_test () print (' Mt ', MT)
To make the test results more accurate, we conducted three experiments to take the average time
First experiment
Use time 6.651
Second experiment
Use time 6.876
The third experiment
Use time 6.960
The average time is as follows
6.829
The following is a single process code
# coding=utf-8__author__ = "susmote" Import timefrom mining_func import Get_urls_in_pagesdef sigle_test (): Start_ Time = Time.time () get_urls_in_pages (1, +) end_time = Time.time () print ("Total use:", End_time-start_time)
The calling function is as follows
# coding=utf-8__author__ = "Susmote" from single_mining import single_testif __name__ = "__main__": st = Single_test ( )
Executing at the command line
First time
10.138
Second time
10.290
Third time
10.087
Average Time spent
10.171
So, multithreading really can improve the efficiency of the crawl, note that this is in the case of less data, if the volume of data is larger, the advantages of multithreading is obvious
You can change the search keyword, and page number, or re-find a page (crawl and speed also has a lot of relationship)
With a few graphs to fetch the data
Data Mining _ Multi-threaded crawl