Simple implementation of a python3 multi-thread crawler, crawling the daily ranking on the p station, python3 Multithreading
I started learning python about half a year ago. That is, half a year ago, I copied and modified the same crawler and wrote it out. Because it is a single-threaded program, a small error occurs, but the header and other things in the crawler can still be used, so I keep the previous one. Because half of them are copied, I did not get much of them. This rewrite is equivalent to re-learning it again. You may have some questions. please correct me.
First, we need to construct a request to log on to the p station. The requests to log on to the p station include:
Request = urllib. request. Request (# create a request url = login_url, # link data = login_data, # data headers = login_header # header)
The url can be obtained by guessing:
Data = {# construct request data "pixiv_id": self. id, # account "pass": self. passwd, # password "mode": "login", "skip": 1}
Login_header = {# Build Request Header "accept-language": "zh-cn, zh; q = 0.8", "referer": "https://www.pixiv.net/login.php? Return_to = 0 "," user-agent ":" mozilla/5.0 (windows nt 10.0; win64; x64; rv: 45.0) gecko/20100101 firefox/45.0 "}
Because it is To Crawl multiple page graphs, I use the cookie login method here, but because the cookie may change to run every time, you have to log on again:
Cookie = http. cookiejar. mozillaCookieJar (". cookie ") # update handler = urllib when the cookie is overwritten every time. request. HTTPCookieProcessor (cookie) opener = urllib. request. build_opener (handler) response = opener. open (request) print ("Log in successfully! ") Cookie. save (ignore_discard = True, ignore_expires = True) response. close () print (" Update cookies successfully! ")
After login, cookie login is actually very simple, as long as the local cookie file is loaded:
Def cookie_opener (self): # use cookies to log on and create opener cookies = http. cookiejar. mozillaCookieJar () cookie. load (". cookie ", ignore_discard = True, ignore_expires = True) handler = urllib. request. HTTPCookieProcessor (cookie) opener = urllib. request. build_opener (handler) return opener
In the beginning, my idea was to parse the image links in all links and then download them. It seems that such an approach is a waste of time, because the time used for parsing and downloading is different, parsing may take 3 or 4 minutes, and separate downloading only takes less than 10 seconds. When the computer permits, one thread is responsible for parsing, And the other thread is responsible for downloading, which is highly efficient.
After learning, I changed to the producer and consumer model:
1. A Crawler crawls the image link in the link and puts it in the processing queue.
2. n Downloader downloads and Crawlers
Producer and consumer models in this article
It can be parsed and downloaded simultaneously. Since the resolution speed is usually faster than the download speed, the download speed may be faster than the resolution speed at the beginning, but it will be reversed later, there is no small difference in the efficiency of using a parser for multiple Downloaders.
Specifically, I use downloader as a derived class of threading. Thread:
Class downloader (threading. thread): # use a download tool as a Thread def _ init _ (self, q, path, opener): threading. thread. _ init _ (self) self. opener = opener # opener self. q = q # self. sch = 0 # progress [0-50] self. is_working = False # Whether self is working. filename = "" # The currently downloaded file name self. path = path # file path self. exitflag = False # exit signal def run (self): def report (blocks, blocksize, total): # The callback function is used to update the download progress self. sch = int (blocks * B Locksize/total * 50) # Calculate the current download percentage self. sch = min (self. sch, 50) # ignore overflow def download (url, referer, path): # Use urlretrieve to download image self. opener. addheaders = [# Add a header to opener ('Accept-color', 'zh-CN, zh; q = 100'), ('user-agent ', 'mozilla/5.0 (Windows NT 10.0; Win64; x64; rv: 45.0) Gecko/20100101 Firefox/45.0 '), ('Referer', referer) # anti-leech mechanism of p station] pattern = re. compile (R' ([a-zA-Z.0-9 _-] *?) $ ') # Regular Expression matching processing mode filename = re. search (pattern, url). group (0) # match the image link to generate a local file name if filename. find ("master ")! =-1: # Remove the master_xxxx string from Multiple Graphs. master = re. search (re. compile (R' _ master [0-9] * '), filename) filename = filename. replace (master. group (0), '') self. filename = filename urllib. request. install_opener (self. opener) # Add the updated opener try: urllib. request. urlretrieve (url, path + filename, report) # download the file to the Local Kernel T: OS. remove (path + filename) # If the download fails, delete the problematic file and referer and url into the queue self. q. put (referer, url) while not self. exitflag: if not self. q. empty (): # When the queue is not empty, get the queue Header element and start downloading links = self. q. get () self. is_working = True download (links [1], links [0], self. path) self. sch = 0 # set zero self. is_working = False
The encapsulated downloader class starts with threading as a separate thread. the same effect of Thread. At the same time, some descriptions of the download task can be displayed in the subsequent running process.
When crawling the address, first scan the homepage of the ranking list to obtain the addresses of all works. The beautifulsoup module is used below:
Response = opener. open (self. url) html = response. read (). decode ("gbk", "ignore") # code, ignore the error (the error does not exist on the Link) soup = BeautifulSoup (html, "html5lib ") # Use the bs and html5lib parser to create the bs object tag_a = soup. find_all ("a") for link in tag_a: top_link = str (link. get ("href") # Find all links under the <a> label if top_link.find ("member_illust ")! =-1: pattern = re. compile (r'id = [0-9] * ') # filter links with IDs. result = re. search (pattern, top_link) if result! = None: result_id = result. group (0) url_work = "http://www.pixiv.net/member_illust.php? Mode = medium & illust _ "+ result_id if url_work not in self. rankurl_list: self. rankurl_list.append (url_work)
Since there is only one thread for parsing, I use the general usage:
Def _ crawl (): while len (self. rankurl_list)> 0: url = self. rankurl_list [0] response = opener. open (url) html = response. read (). decode ("gbk", "ignore") # encode and ignore the error (the error does not exist on the Link) soup = BeautifulSoup (html, "html5lib") imgs = soup. find_all ("img", "original-image") if len (imgs)> 0: self. picurl_queue.put (url, str (imgs [0] ["data-src"]) else: multiple = soup. find_all ("a", "_ work multiple") if len (multiple)> 0: manga_url = "http://www.pixiv.net/" + multiple [0] ["href"] response = opener. open (manga_url) html = response. read (). decode ("gbk", "ignore") soup = BeautifulSoup (html, "html5lib") imgs = soup. find_all ("img", "image ui-scroll-view") for I in range (0, len (imgs): self. picurl_queue.put (manga_url + "& page =" + str (I), str (imgs [I] ["data-src"]) self. rankurl_list = self. rankurl_list [1:] self. crawler = threading. thread (target = _ crawl) # start the first Thread to crawl the link, and the producer self in the producer consumer mode. crawler. start ()
At the same time, generate a thread that is equal to the maximum number of threads previously set:
For I in range (0, self. max_dlthread): # Open the download thread Based on the set maximum number of threads. The consumer thread = downloader (self. picurl_queue, self. OS _path, opener) thread. start () self. downlist. append (thread) # Put the generated thread into a queue
The next step is to display the download progress of each thread in multiple threads and wait until all the operations are completed:
Flag = Falsewhile not self. picurl_queue.empty () or len (self. rankurl_list)> 0 or not flag: # display the progress, and wait until all threads end, the condition of the end (the opposite here ): #1. The download queue is empty #2. The resolution list is empty #3. All current download tasks complete the OS. system ("cls") flag = True if len (self. rankurl_list)> 0: print (str (len (self. rankurl_list) + "urls to parse... ") if not self. picurl_queue.empty (): print (str (self. picurl_queue.qsize () + "pics ready to download... ") for t in self. downlist: if t. is_working: flag = False print ("Downloading" + '"' + t. filename + '": \ t [' +"> "* t. sch + "" * (50-t.sch) + "]" + str (t. sch * 2) + "%") else: print ("This downloader is not working now. ") time. sleep (0.1)
The following figure shows the actual operation:
Multithreading: the progress of each thread varies. The above shows the link to be parsed and the link to be downloaded.
When all tasks are completed, an exit command is sent to all threads:
For t in self. downlist: # Send an exit command t. exitflag = True
After all the tasks are completed, the system will give the download time:
Def start (self): st = time. time () self. login () opener = self. cookie_opener () self. crawl (opener) ed = time. time () tot = ed-st intvl = getTime (int (tot) OS. system ("cls") print ("Finished. ") print (" Total using "+ intvl + ". ") # count the time spent on the end of all jobs
On September 6, December 14, 2016, the time used to crawl all of the daily rankings of the p site:
The address on the coding is given below:
Https://coding.net/u/MZI/p/PixivSpider/git