The installation and use of XPath
1. Introduction to XPath
Just learned regular expression, with the right, now put the regular expression replaced, using XPath, someone said this too pit dad, early know that just come up to learn XPath more convenient AH. In fact, I personally think that learning a regular expression is very useful, the reason for XPath, I personally think because it is more accurate positioning, use more convenient. The difference between XPath and regular expressions may not be clear to some people, for example, using regular expressions to extract our content is like saying that a person wants to go to Tian ' anmen, the address is described as a circular building on the left, a square building on the right, you go find it, and using XPath, The description of the address becomes the specific address of Tiananmen Square. What do you think? By contrast, which way is more efficient and more accurate?
2. Installation of XPath
XPath is included in the lxml library, so where do we go to download it? Click here to enter the page and hold down Ctrl+f search lxml, and then download, after downloading the file expansion name to. zip, and then decompression, the folder named lxml copied into the Python Lib directory, so the installation is completed.
3. Use of XPath
To facilitate the demo, I used Html to write a simple Web page, the code is as follows (in order to save time, to facilitate the small partners to test directly, you can copy and paste my code directly)
Use Google browser to open this page, and then right-click, select the check, will appear as shown in the following interface
This time you right-click any line of HTML code, you can see a copy, put the mouse up, you can see copy XPath, first copied down, how to use it?
# Coding=utf-8 from
lxml import etree
f = open (' myhtml.html ', ' r ')
HTML = f.read ()
f.close (
) selector = etree. HTML (HTML)
content = Selector.xpath ('//*[@id = ' like ']/li/text () ') for each of
content:
Print each
Look at the print results
Like one like
two like
three
Obviously, we print out what we want, and notice that we use the text () function in XPath (), which is getting the content, but what if we want to get an attribute? Let's say we want two link addresses in HTML, which is the href attribute, we can do this.
Content = Selector.xpath ('//*[@id = ' url ']/a/@href ') for each of
content:
Print each
This time the print result is
Http://www.baidu.com
http://www.hao123.com
See now you probably have a certain understanding of the symbols in XPath (), for example, the beginning of//refers to the root directory, and/is the parent node under the child node, the other id attribute is a step-by-step from the search down, because this is a tree structure, so it is no wonder the name of the method is Etree ().
4. Special uses of XPath
Faced with a page above, how should we get to three lines of content? Well, it's simple, I'm writing three XPath statements, so easy. If this is true, then our efficiency seems to be too low, take a closer look at the id attribute of the three line Div, as if the first four letters are like, then it is good to do, we can use Starts-with to the three rows of the simultaneous extraction, as shown below
content = selector.xpath('//div[starts-with(@id,"like")]/text()')
However, there is a bit of trouble, we need to manually write XPath path, of course, can be copied and pasted down in the modification, which is to enhance the complexity of the problem of exchange for efficiency. Let's take a look at the extraction of label nested tags
How do we get a page like the one above, if we want to get the Hello World Python statement? Obviously this is a label nested label situation, we take the normal situation to extract, see how the results
Content = Selector.xpath ('//*[@id = ' text ']/p/text () ') for each of
content:
Print each
After running, unfortunately, only to print out the word hello, the other characters are lost, how to do? This situation can be aided by string (.) As shown below
Content = Selector.xpath ('//*[@id = "text"]/p ') [0]
info = Content.xpath (' string (.) ')
data = Info.replace (' \ n ', ' '). Replace (', ')
print data
So you can print out the correct content, as for the third line why there, you can remove it to see the results, then you will naturally understand.
A simple introduction to Python parallelism
Some people say that parallelism in Python is not really parallelization, but multithreading can significantly improve the efficiency of our code, saving us a large amount of time, we have a single thread and multithreading for the time comparison.
# Coding=utf-8
import requests from
Multiprocessing.dummy import Pool as ThreadPool
import Time
def GetSource (URL):
html = requests.get (URL)
if __name__ = = ' __main__ ':
urls = []
for I in range (50, 500, 50):
newpage = ' http://tieba.baidu.com/f?kw=python&ie=utf-8&pn= ' + str (i)
urls.append (newpage)
# single-threaded timing
time1 = Time.time () for
i in URLs:
print i
getsource (i)
time2 = Time.time ()
Print ' Single-threaded time consuming: ' + str (time2-time1) + ' s '
# multi-threaded timer
pool = ThreadPool (4)
Time3 = Time.time ()
results = Pool . Map (GetSource, URLs)
pool.close ()
pool.join ()
time4 = Time.time ()
print ' Multithreading time consuming: ' + str (TIME4- Time3) + ' s '
Print results to
Http://tieba.baidu.com/f?kw=python&ie=utf-8&pn=50
Http://tieba.baidu.com/f?kw=python&ie=utf-8 &pn=100
http://tieba.baidu.com/f?kw=python&ie=utf-8&pn=150
http://tieba.baidu.com/f?kw= python&ie=utf-8&pn=200
http://tieba.baidu.com/f?kw=python&ie=utf-8&pn=250
http:// tieba.baidu.com/f?kw=python&ie=utf-8&pn=300
HTTP://TIEBA.BAIDU.COM/F?KW=PYTHON&IE=UTF-8&PN =350
http://tieba.baidu.com/f?kw=python&ie=utf-8&pn=400
Http://tieba.baidu.com/f?kw=python &ie=utf-8&pn=450
single-threaded time consuming: 7.26399993896 s
multithreading time consuming: 2.49799990654 s
As for the above link why set interval is 50, is because I found in Baidu Post bar did not turn over a page, the value of PN will increase by 50. Through the above results, we found that multithreading compared to single-threaded efficiency has increased too much. As for the above code in the use of multiple threads, I will not too much to explain, I believe that as long as the use of multithreading to the user will not be unfamiliar, in fact, is the difference. No contact with Java? Then I am sorry, the above code please digest it yourself.
The actual combat--crawls the Dangdang to collect the book information
have been in Dangdang buy books, since learned how to use Python crawl information, then first to crawl the book information in the Dangdang. After this actual combat completes the content to be shown below
In the Dangdang search Java, there are 89 pages of content, I chose to crawl the first 80 pages, and in order to compare the efficiency of multithreading and single-threaded, I specifically here to compare the two, where the single thread crawl time used for 67s, and multithreading only 15s.
How to crawl a Web page, in the use of XPath above we have also introduced, nothing more than to enter the Web page, right-click Select Check, look at the page HTML code, and then look for the law, the extraction of information, here is not more introduced, because the code is relatively short, so here directly on the source code.
# Coding=utf8 Import requests import re import time from lxml import etree from multiprocessing.dummy import Pool as Thre Adpool Import sys reload (SYS) sys.setdefaultencoding (' Utf-8 ') def changepage (URL, total): URLs = [] nowpage = Int (re . Search (' (\d+) ', URL, re. S). Group (1)) for I in range (nowpage, total + 1): link = re.sub (' page_index= (\d+) ', ' page_index=%s '% i, URL, re. S) urls.append (link) return URLs def spider (URL): html = requests.get (URL) content = html.text Selector = et Ree. HTML (content) title = [] title = Selector.xpath ('//*[@id = ' component_0__0__6612 ']/li/a/@title ') detail = [] Detail = Selector.xpath ('//*[@id = "component_0__0__6612"]/li/p[3]/span[1]/text ()) Saveinfo (Title,detail) def saveinfo ( Title, detail): Length1 = Len (title) for I in range (0, length1-1): F.writelines (title[i] + ' \ n ') f.writeline S (detail[i] + ' \ n ') if __name__ = = ' __main__ ': Pool = ThreadPool (4) F = open (' Info.txt ', ' a ') url = ' Http://searc H.dangDang.com/?key=java&act=input&page_index=1 ' urls = changepage (url, time1 = Time.time () Pool.map (Spider, URLs) Pool.close () Pool.join () f.close () print ' Crawl success! ' time2 = time.time () print ' Multithreading time consuming: ' + str (time2-time1) + ' s ' # time1 = Time.time () # for all in URLs: #
Spider (each) # time2 = Time.time () # f.close () # print ' single-threaded time consuming: ' + str (time2-time1) + ' s '
It is obvious that the knowledge in the above code, we are introduced in the XPath and parallelization to do a detailed introduction, so it is easy to read.
OK, so far, the Python crawler-related series is over, thank you for watching.