[Python Data Analysis] Python3 multi-thread concurrent web crawler-taking Douban library Top250 as an example, python3top250

Source: Internet
Author: User

[Python Data Analysis] Python3 multi-thread concurrent web crawler-taking Douban library Top250 as an example, python3top250

Based on the work of the last two articles

[Python Data Analysis] Python3 Excel operation-Take Douban library Top250 as an Example

[Python Data Analysis] solve and optimize some problems in Python3 Excel (2)

I have correctly captured the top P250 of Douban books and saved them to excel. However, unfortunately, due to the serial crawling method, it takes 7 to 8 minutes to climb 250 pages each time, which is obviously unacceptable. Therefore, the efficiency must be improved.

If you think about it, you can find that it doesn't matter whether you crawl 10 pages (25 pages per page) because there is no dependency During writing, each write, so serial crawling is a disadvantage. Obviously, concurrency can be used to speed up the process, and because there is no synchronization mutex, no chain is needed.

Since concurrency is considered, there are two methods: multi-process and multi-thread. For their advantages and disadvantages, see:

To put it simply, multi-process stability is not affected because the failure of one process to other processes is not affected, but the overhead is high. Too many processes will consume a large amount of system resources and the switchover is slow, because the system process scheduling is required.

As a lightweight process, multithreading is the basic unit of Operating System Scheduling. Switching is fast and consumes only a very small amount of resources. However, the disadvantage is that a thread will collapse the entire process, including other threads, therefore, the stability is poor.

Although the number of processes/Threads is very small (only 10), even using multiple processes won't have much overhead, but in order to crawl more quickly and crawl a big site like Douban, stability is not too bad, so it is more cost-effective to use multithreading.

Multithreading has two modules, one Thread module and one threading module. However, the former is rarely used, and the latter is more convenient and practical. Therefore, the latter is used.

There are two ways to use threads in a program. One is to write a class by yourself and rewrite the _ init _ method and the run () method in this class, create an object for this class and call the run () method automatically when start () is called. The other is to input the functions to be run by the Thread and their parameters in the threading. Thread constructor. I use the latter.

The main multi-threaded code is as follows:

Thread = [] for I in range (0,250, 25): geturl = url + "/start =" + str (I) # print ("Now to get" + geturl) t = threading. thread (target = crawler, args = (s, I, url, header, image_dir, worksheet, txtfile) thread. append (t) for I in range (0, 10): thread [I]. start () for I in range (0, 10): thread [I]. join ()

Previous crawler and storage functions are written to the crawler, with seven parameters. Put the URLs on the 10 pages into the thread list, and start them one by one. After starting, call join () to wait until every thread ends. If not, you will find that some of them have already run to close the file below, so you won't be able to write anything else.

The modified and simplified code is as follows:

#-*-Coding: UTF-8-*-import requestsimport reimport into bs4 import BeautifulSoupfrom datetime import datetimeimport codecsimport threading # download image def download_img (imageurl, image_dir, imageName = "xxx.jpg "): rsp = requests. get (imageurl, stream = True) image = rsp. content path = image_dir + imageName }'.jpg 'with open (path, 'wb') as file: file. write (image) def crawler (s, I, url, header, image_dir, wor Ksheet, txtfile): postData = {"start": I} # post Data res = s. post (url, data = postData, headers = header) # post soup = BeautifulSoup (res. content. decode (), "html. parser ") # BeautifulSoup Parsing table = soup. findAll ('table', {"width": "100%"}) # Find the table sz = len (table) # sz = 25, 25 articles for j in range (1, sz + 1) are listed on each page: # j = 1 ~ 25 sp = BeautifulSoup (str (table [J-1]), "html. parser ") # parse the image URL of each book = sp. img ['src'] # Find the image link bookurl = sp. a ['href '] # Find the book link bookName = sp. div. a ['title'] nickname = sp. div. span # Alias if (nickname): # if an alias exists, store the alias. Otherwise, save 'none' nickname = nickname. string. strip () else: nickname = "" notion = str (sp. find ('P', {"class": "pl "}). string) # capture the publishing information. Pay attention to the content. string is not really str Type rating = str (sp. find ('span ', {"class": "rating_nums "} ). String) # capture the split data nums = sp. find ('span ', {"class": "pl "}). string # Number of scored crawlers nums = nums. replace ('(',''). replace (')',''). replace ('\ n ',''). strip () nums = re. findall ('(\ d +) Comments', nums) [0] download_img (imageurl, bookName) # download Image book = requests. get (bookurl) # Open the book webpage sp3 = BeautifulSoup (book. content, "html. parser ") # parse taglist = sp3.find _ all ('A', {" class ":" tag "}) # tag = "" lis = [] for tagurl in taglist: Sp4 = BeautifulSoup (str (tagurl), "html. parser ") # parse each label lis. append (str (sp4.a. string) tag = ','. join (lis) # Add the comma the_img = "I: \ douban \ image \" + bookName + ". jpg "writelist = [I + j, bookName, nickname, rating, nums, the_img, bookurl, notion, tag] for k in range (0, 9): if k = 5: worksheet. insert_image (I + j, k, the_img) else: worksheet. write (I + j, k, writelist [k]) txtfile. write (str (writelist [k]) txtfile. write ('\ t') txtf Ile. write (U' \ r \ n') def main (): now = datetime. now () # start timing print (now) txtfile = codecs. open ("top2501.txt", 'w', 'utf-8') url = "http://book.douban.com/top250? "Header = {" User-Agent ":" Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.13 Safari/537.36 ", "Referer": "http://book.douban.com/"} image_dir = "I: \ douban \ image \" # create an Excel workbookx = xlsxwriter. workbook ('I :\\ douban \ booktop250.xlsx') worksheet = workbookx. add_worksheet () format = workbookx. add_format () # format. set_align ('justify ') format. set_align ('center') # format. set_align ('vjustify ') format. set_align ('vcenter ') format. set_text_wrap () worksheet. set_row (1,251, format) for I in range (): worksheet. set_row (I, 70) worksheet. set_column ('a: A', 3, format) worksheet. set_column ('B: C', 17, format) worksheet. set_column ('d: d', 4, format) worksheet. set_column ('e: e', 7, format) worksheet. set_column ('f: f', 10, format) worksheet. set_column ('G: G', 19, format) worksheet. set_column ('H: I ', 40, format) item = ['title', 'Alias', 'rating', 'rating', 'Cover ', 'book link', 'publication information', 'tag'] for I in range (): worksheet. write (0, I, item [I-1]) s = requests. session () # create Session s. get (url, headers = header) thread = [] for I in range (0,250, 25): geturl = url + "/start =" + str (I) # print ("Now to get" + geturl) t = threading. thread (target = crawler, args = (s, I, url, header, image_dir, worksheet, txtfile) thread. append (t) for I in range (0, 10): thread [I]. start () for I in range (0, 10): thread [I]. join () end = datetime. now () # end Time print (end) print ("program time consumption:" + str (end-now) txtfile. close () workbookx. close () if _ name _ = '_ main _': main ()

Although it is still a bit messy to write .. Then run:

2016-03-29 08:48:37. 006681Now to get http://book.douban.com/top250? /Start = 0Now to get http://book.douban.com/top250? /Start = 25Now to get http://book.douban.com/top250? /Start = 50Now to get http://book.douban.com/top250? /Start = 75Now to get http://book.douban.com/top250? /Start = 100Now to get http://book.douban.com/top250? /Start = 125Now to get http://book.douban.com/top250? /Start = 150Now to get http://book.douban.com/top250? /Start = 175Now to get http://book.douban.com/top250? /Start = 200Now to get http://book.douban.com/top250? /Start = 2252016-03-29 08:49:44. 003378 program time consumption: 0:01:06. 996697

It only took 1 minute 6 seconds. Compared with the previous 7 minutes 24 seconds, the acceleration rate reached 6.7! This is the advantage of multithreading. In theory, it should be nearly 10 times, but thread creation and switching are overhead, so it reaches 7 ~ 8 times better. Then I ran it several times again, and the stability was okay. Ps: the default line feed of this blog template is so popular that it will automatically wrap lines in the code ..

(End)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.