Python Crawler Learn notes from multiple threads crawler

Python Crawler Learn notes from multiple threads crawler _python

Last Update:2017-01-18 Source: Internet

Author: User

Tags xpath in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The installation and use of XPath

1. Introduction to XPath

Just learned regular expression, with the right, now put the regular expression replaced, using XPath, someone said this too pit dad, early know that just come up to learn XPath more convenient AH. In fact, I personally think that learning a regular expression is very useful, the reason for XPath, I personally think because it is more accurate positioning, use more convenient. The difference between XPath and regular expressions may not be clear to some people, for example, using regular expressions to extract our content is like saying that a person wants to go to Tian ' anmen, the address is described as a circular building on the left, a square building on the right, you go find it, and using XPath, The description of the address becomes the specific address of Tiananmen Square. What do you think? By contrast, which way is more efficient and more accurate?

2. Installation of XPath

XPath is included in the lxml library, so where do we go to download it? Click here to enter the page and hold down Ctrl+f search lxml, and then download, after downloading the file expansion name to. zip, and then decompression, the folder named lxml copied into the Python Lib directory, so the installation is completed.

3. Use of XPath

To facilitate the demo, I used Html to write a simple Web page, the code is as follows (in order to save time, to facilitate the small partners to test directly, you can copy and paste my code directly)

<! DOCTYPE html>

Use Google browser to open this page, and then right-click, select the check, will appear as shown in the following interface

This time you right-click any line of HTML code, you can see a copy, put the mouse up, you can see copy XPath, first copied down, how to use it?

# Coding=utf-8 from
lxml import etree

f = open (' myhtml.html ', ' r ')
HTML = f.read ()
f.close (

) selector = etree. HTML (HTML)
content = Selector.xpath ('//*[@id = ' like ']/li/text () ') for each of
content:
  Print each

Look at the print results

Like one like
two like
three

Obviously, we print out what we want, and notice that we use the text () function in XPath (), which is getting the content, but what if we want to get an attribute? Let's say we want two link addresses in HTML, which is the href attribute, we can do this.

Content = Selector.xpath ('//*[@id = ' url ']/a/@href ') for each of
content:
  Print each

This time the print result is

Http://www.baidu.com
http://www.hao123.com

See now you probably have a certain understanding of the symbols in XPath (), for example, the beginning of//refers to the root directory, and/is the parent node under the child node, the other id attribute is a step-by-step from the search down, because this is a tree structure, so it is no wonder the name of the method is Etree ().

4. Special uses of XPath

<! DOCTYPE html>

Faced with a page above, how should we get to three lines of content? Well, it's simple, I'm writing three XPath statements, so easy. If this is true, then our efficiency seems to be too low, take a closer look at the id attribute of the three line Div, as if the first four letters are like, then it is good to do, we can use Starts-with to the three rows of the simultaneous extraction, as shown below

content = selector.xpath('//div[starts-with(@id,"like")]/text()')

However, there is a bit of trouble, we need to manually write XPath path, of course, can be copied and pasted down in the modification, which is to enhance the complexity of the problem of exchange for efficiency. Let's take a look at the extraction of label nested tags

<! DOCTYPE html>

How do we get a page like the one above, if we want to get the Hello World Python statement? Obviously this is a label nested label situation, we take the normal situation to extract, see how the results

Content = Selector.xpath ('//*[@id = ' text ']/p/text () ') for each of
content:
  Print each

After running, unfortunately, only to print out the word hello, the other characters are lost, how to do? This situation can be aided by string (.) As shown below

Content = Selector.xpath ('//*[@id = "text"]/p ') [0]
info = Content.xpath (' string (.) ')
data = Info.replace (' \ n ', ' '). Replace (', ')
print data

So you can print out the correct content, as for the third line why there, you can remove it to see the results, then you will naturally understand.

A simple introduction to Python parallelism

Some people say that parallelism in Python is not really parallelization, but multithreading can significantly improve the efficiency of our code, saving us a large amount of time, we have a single thread and multithreading for the time comparison.

# Coding=utf-8
import requests from
Multiprocessing.dummy import Pool as ThreadPool
import Time


def GetSource (URL):
  html = requests.get (URL)

if __name__ = = ' __main__ ':
  urls = []
  for I in range (50, 500, 50):
    newpage = ' http://tieba.baidu.com/f?kw=python&ie=utf-8&pn= ' + str (i)
    urls.append (newpage)

  # single-threaded timing
  time1 = Time.time () for
  i in URLs:
    print i
    getsource (i)
  time2 = Time.time ()

  Print ' Single-threaded time consuming: ' + str (time2-time1) + ' s '

  # multi-threaded timer
  pool = ThreadPool (4)
  Time3 = Time.time ()
  results = Pool . Map (GetSource, URLs)
  pool.close ()
  pool.join ()
  time4 = Time.time ()
  print ' Multithreading time consuming: ' + str (TIME4- Time3) + ' s '

Print results to

Http://tieba.baidu.com/f?kw=python&ie=utf-8&pn=50
Http://tieba.baidu.com/f?kw=python&ie=utf-8 &pn=100
http://tieba.baidu.com/f?kw=python&ie=utf-8&pn=150
http://tieba.baidu.com/f?kw= python&ie=utf-8&pn=200
http://tieba.baidu.com/f?kw=python&ie=utf-8&pn=250
http:// tieba.baidu.com/f?kw=python&ie=utf-8&pn=300
HTTP://TIEBA.BAIDU.COM/F?KW=PYTHON&IE=UTF-8&PN =350
http://tieba.baidu.com/f?kw=python&ie=utf-8&pn=400
Http://tieba.baidu.com/f?kw=python &ie=utf-8&pn=450
single-threaded time consuming: 7.26399993896 s
multithreading time consuming: 2.49799990654 s

As for the above link why set interval is 50, is because I found in Baidu Post bar did not turn over a page, the value of PN will increase by 50. Through the above results, we found that multithreading compared to single-threaded efficiency has increased too much. As for the above code in the use of multiple threads, I will not too much to explain, I believe that as long as the use of multithreading to the user will not be unfamiliar, in fact, is the difference. No contact with Java? Then I am sorry, the above code please digest it yourself.

The actual combat--crawls the Dangdang to collect the book information

have been in Dangdang buy books, since learned how to use Python crawl information, then first to crawl the book information in the Dangdang. After this actual combat completes the content to be shown below

In the Dangdang search Java, there are 89 pages of content, I chose to crawl the first 80 pages, and in order to compare the efficiency of multithreading and single-threaded, I specifically here to compare the two, where the single thread crawl time used for 67s, and multithreading only 15s.

How to crawl a Web page, in the use of XPath above we have also introduced, nothing more than to enter the Web page, right-click Select Check, look at the page HTML code, and then look for the law, the extraction of information, here is not more introduced, because the code is relatively short, so here directly on the source code.

# Coding=utf8 Import requests import re import time from lxml import etree from multiprocessing.dummy import Pool as Thre Adpool Import sys reload (SYS) sys.setdefaultencoding (' Utf-8 ') def changepage (URL, total): URLs = [] nowpage = Int (re . Search (' (\d+) ', URL, re. S). Group (1)) for I in range (nowpage, total + 1): link = re.sub (' page_index= (\d+) ', ' page_index=%s '% i, URL, re. S) urls.append (link) return URLs def spider (URL): html = requests.get (URL) content = html.text Selector = et Ree.  HTML (content) title = [] title = Selector.xpath ('//*[@id = ' component_0__0__6612 ']/li/a/@title ') detail = [] Detail = Selector.xpath ('//*[@id = "component_0__0__6612"]/li/p[3]/span[1]/text ()) Saveinfo (Title,detail) def saveinfo ( Title, detail): Length1 = Len (title) for I in range (0, length1-1): F.writelines (title[i] + ' \ n ') f.writeline S (detail[i] + ' \ n ') if __name__ = = ' __main__ ': Pool = ThreadPool (4) F = open (' Info.txt ', ' a ') url = ' Http://searc H.dangDang.com/?key=java&act=input&page_index=1 ' urls = changepage (url, time1 = Time.time () Pool.map (Spider, URLs) Pool.close () Pool.join () f.close () print ' Crawl success!   ' time2 = time.time () print ' Multithreading time consuming: ' + str (time2-time1) + ' s ' # time1 = Time.time () # for all in URLs: #

 Spider (each) # time2 = Time.time () # f.close () # print ' single-threaded time consuming: ' + str (time2-time1) + ' s '

It is obvious that the knowledge in the above code, we are introduced in the XPath and parallelization to do a detailed introduction, so it is easy to read.

OK, so far, the Python crawler-related series is over, thank you for watching.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More