Python crawler learning notes multi-thread crawler, python learning notes

Source: Internet
Author: User

Python crawler learning notes multi-thread crawler, python learning notes

Installation and Use of XPath

1. Introduction to XPath

I have just learned a regular expression, and I am using the regular expression. Now I will replace it with the regular expression and use XPath. Some people say this is too embarrassing. I knew it was easy to learn XPath when I first came up. In fact, I personally think it is very helpful to learn regular expressions. I personally think it is because it is more accurate and easier to use. Some people may not know the differences between XPath and regular expressions. For example, using regular expressions to extract our content is like a person who wants to go to Tiananmen, the description of the address is a circular building on the left, and a square building on the right. You can find it. When you use XPath, the description of the address becomes the specific address of Tiananmen Square. How is it? In contrast, which method is more efficient and more accurate?

2. Install XPath

If XPath is included in the lxml library, where can we download it? Click here to enter the webpage, press ctrl + f to search for lxml, and then download it. After the download is complete, change the file extension name. zip, decompress the package, copy and paste the folder named lxml to the Lib directory of Python, and the installation is complete.

3. Use of XPath

For ease of demonstration, I used Html to write a simple web page. The Code is as follows (to save time and make it easier for my friends to directly perform tests, you can copy and paste my code)

<! DOCTYPE html> 

Open the webpage in Google's browser, right-click it, and select Check. The following page is displayed:

At this time, you can right-click any line of html code to see a Copy, move the mouse up, you can see the Copy XPath, Copy it first, how to use it?

# coding=utf-8from lxml import etreef = open('myHtml.html','r')html = f.read()f.close()selector = etree.HTML(html)content = selector.xpath('//*[@id="like"]/li/text()')for each in content:  print each

View printed results

like onelike twolike three

Obviously, we printed the desired content. Note that we use the text () function in xpath (). This function is used to obtain the content, but what if we want to get an attribute? For example, if we want to get two link addresses in html, that is, the href attribute, we can do this.

content = selector.xpath('//*[@id="url"]/a/@href')for each in content:  print each

The printed result at this time is

http://www.baidu.comhttp://www.hao123.com

Now, you probably have a certain understanding of the symbols in xpath (). For example, the beginning of // indicates the root directory, and/is the child node under the parent node, other id attributes are also found step by step. Because this is a tree structure, it is no wonder that the method name is etree ().

4. Special Use of XPath

<!DOCTYPE html>

In the face of the previous webpage, how should we get three lines of content? Well, it's easy. I can't write three XPath statements, so easy. If this is the case, then our efficiency seems to be too low. Take a closer look at the id attribute of the three lines of div. It seems that the first four letters are like, So that's easy to do, we can use starts-with to extract the three rows at the same time, as shown below:

content = selector.xpath('//div[starts-with(@id,"like")]/text()')

But in this case, we need to manually write the XPath path. Of course, we can copy and paste the path and modify it. This is the problem of increasing complexity in exchange for efficiency. Let's take a look at the extraction of nested tags.

<!DOCTYPE html>

If we want to get the hello world Python statement for a webpage like above, how can we get it? Obviously, this is a case where labels are nested. We can extract the labels according to the normal situation and see how the results are.

content = selector.xpath('//*[@id="text"]/p/text()')for each in content:  print each

After running, unfortunately, only the "hello" text is printed, and other characters are lost. What should I do? In this case, you can use string (.) as follows:

content = selector.xpath('//*[@id="text"]/p')[0]info = content.xpath('string(.)')data = info.replace('\n','').replace(' ','')print data

In this way, you can print the correct content. As for the existence of the third behavior, you can remove it and see the result. Then you will naturally understand it.

A Brief Introduction to Python parallelization

Some people say that parallelize in Python is not true parallelization, but multithreading can significantly improve the efficiency of code execution, saving us a lot of time, next we will compare the time of Single-thread and multi-thread.

# Coding = utf-8import requestsfrom multiprocessing. dummy import Pool as ThreadPoolimport timedef getsource (url): html = requests. get (url) if _ name _ = '_ main _': urls = [] for I in range (50,500, 50): newpage = 'HTTP: // tieba.baidu.com/f? Kw = python & ie = UTF-8 & pn = '+ str (I) urls. append (newpage) # single-thread timing time1 = time. time () for I in urls: print I getsource (I) time2 = time. time () print 'time consumed by a single thread: '+ str (time2-time1) + 's' # Multithread timing pool = ThreadPool (4) time3 = time. time () results = pool. map (getsource, urls) pool. close () pool. join () time4 = time. time () print 'multithreading time consumed: '+ str (time4-time3) + 'S'

The printed result is

Http://tieba.baidu.com/f? Kw = python & ie = UTF-8 & pn = 50 http://tieba.baidu.com/f? Kw = python & ie = UTF-8 & pn = 100 http://tieba.baidu.com/f? Kw = python & ie = UTF-8 & pn = 150 http://tieba.baidu.com/f? Kw = python & ie = UTF-8 & pn = 200 http://tieba.baidu.com/f? Kw = python & ie = UTF-8 & pn = 250 http://tieba.baidu.com/f? Kw = python & ie = UTF-8 & pn = 300 http://tieba.baidu.com/f? Kw = python & ie = UTF-8 & pn = 350 http://tieba.baidu.com/f? Kw = python & ie = UTF-8 & pn = 400 http://tieba.baidu.com/f? Kw = python & ie = UTF-8 & pn = 450 time consumed by a single thread: 7.26399993896 s time consumed by multiple threads: 2.49799990654 s

As for why the above link is set to 50, it is because I found that the pn value will increase by 50 if I did not flip a page on the Baidu Post Bar. Through the above results, we found that the efficiency of Multithreading is much higher than that of a single thread. As for the use of multithreading in the above code, I will not explain it too much. I believe that anyone who has been familiar with Java will not be unfamiliar with multithreading, but it is actually a big difference. I have never touched Java? Sorry, please digest the above Code.

Practice-crawling Dangdang books

I have been purchasing books on Dangdang for a long time. Since I learned how to use Python to crawl information, I should first crawl the books on Dangdang. The actual content is as follows:

I searched for Java on Dangdang and found 89 pages. I chose to crawl the first 80 pages. In order to compare the efficiency of multithreading and single thread, I specifically compared them here, the crawling time of a single thread is 67 s, while that of multiple threads is only 15 s.

We have also introduced how to crawl a webpage in the use of XPath above. It is nothing more than entering the webpage, right-clicking and selecting check, viewing the webpage html code, and searching for the rule, information Extraction is not described here. Because the code is short, the source code is directly used here.

# Coding = utf8import requestsimport reimport timefrom lxml import etreefrom multiprocessing. dummy import Pool as ThreadPoolimport sysreload (sys) sys. setdefaultencoding ('utf-8') def changepage (url, total): urls = [] nowpage = int (re. search ('(\ d +)', url, re. S ). group (1) for I in range (nowpage, total + 1): link = re. sub ('page _ index = (\ d +) ', 'page _ index = % s' % I, url, re. s) urls. append (link) return urlsdef Spider (url): html = requests. get (url) content = html. text selector = etree. HTML (content) title = [] title = selector. xpath ('// * [@ id = "component_0 _ 01_6612"]/li/a/@ title') detail = [] detail = selector. xpath ('// * [@ id = "component_0 _ 01_6612"]/li/p [3]/span [1]/text ()') saveinfo (title, detail) def saveinfo (title, detail): length1 = len (title) for I in range (0, length1-1): f. writelines (title [I] + '\ n') f. Writelines (detail [I] + '\ n \ n') if _ name _ =' _ main _ ': pool = ThreadPool (4) f = open('info.txt ', 'A') url = 'HTTP: // search.dangdang.com /? Key = Java & act = input & page_index = 1' urls = changepage (url, 80) time1 = time. time () pool. map (spider, urls) pool. close () pool. join () f. close () print 'crawling successful! 'Time2 = time. time () print 'multithreading time consumed: '+ str (time2-time1) + 's' # time1 = time. time () # for each in urls: # spider (each) # time2 = time. time () # f. close () # print 'time consumed by a single thread: '+ str (time2-time1) + s'

It can be seen that the knowledge in the above Code is described in detail in the introduction of XPath and parallelization, so it is very easy to read.

Now, the article series related to Python crawlers has ended. Thank you for watching it.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.