Python crawler: Now learning to use XPath to crawl the watercress music

Source: Internet
Author: User
Tags xpath

There are several ways to crawl the crawler, Regular Expressions, Lxml (XPath) and beautiful,I looked up the information on the Internet, to understand the use of the difficulty and performance of the three

Three ways to compare reptiles.

Crawl Mode Performance Use difficulty
Regular expressions Fast Difficult
Lxml Fast Simple
BeautifulSoup Slow Simple

Such a comparison I I chose the way of lxml (XPath), although there are three ways, but certainly to choose the best way to crawler, this truth we all understand, another interested friends can also go to learn the other two kinds of reptile way!

Okay, now, let's talk about XPath.

Because XPath belongs to the lxml module, you first need to install the lxml library, the old-fashioned direct file-->setting---project interpreter add lxml Library.

XPath simple usage

From lxml import etree

S=etree. HTML (source code) #将源码转化为能被XPath匹配的格式

S.xpath (XPath expression) #返回为一列表,

Basic syntax:
  1. Double slash locates the root node, scans the full text, selects all eligible content in the document, and returns it as a list.
  2. /Slash to find the next layer of path labels for the current label path or to manipulate the contents of the current route label
  3. /text () Gets the text content under the current path
  4. /@xxxx Extract the property value of the label under the current path
  5. | Optional characters Use | You can select several paths such as//p | The DIV selects all the eligible P tags and div tags under the current path.
  6. . Point to select the current node
  7. .. Select the parent node of the current node with two points

We can learn the XPath grammar function quickly.

We need to crawl the top 250 of the watercress music this time.

Open Watercress Music: https://music.douban.com/top250

Get a single piece of data 1. Get Music titles

Right-click popup menu bar copy==> Copy Xpath

Here we want to get the music title, the XPath of the music title is:xpath://*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div/a

# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div/a‘)print title

To run the code:
It's empty.
It is important to note that browser-copied XPath can only be used as a reference, as browsers often add extra tbody tags in themselves, and we need to manually remove this tag
Delete the middle of the/tbody after that is the case,
title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a‘)
Then we run the code again.
Get:
<Element a at 0x53d26c8>

The description title was obtained.
Because you want to get the caption text, the XPath expression is appended with/text ()
title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/text()‘)#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()

And because this s.xpath returns a collection, and there's only one element in the collection, I'm going to append another one [0]
New expression:
title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text(),再追加[0]

Rerun to get results:
We Sing. We Dance. We steal things.
Is exactly what we want the title to be.

2. Get music rating and number of reviews

Old way, first right-click the XPath with the copy rating://*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div/div/span[2]
Copy the XPath of the evaluation number: //*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div/div/span[3]/text()

Again, we're going to get rid of the tbody and rerun the code:

# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()score = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/div/span[2]/text()‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()numbers = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/div/span[3]/text()‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()print title,score,numbers

Get:

        We Sing. We Dance. We Steal Things.    9.1                 (                        100395人评价                )
3. Get the music link

The XPath for the copy title://*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div/a

Want to get music connection href here need to get this tag belongs to,/@xxx can fetch the property value under the current path label
//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div/a/@href

Code:

# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)href = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/@href‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()score = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/div/span[2]/text()‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()numbers = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/div/span[3]/text()‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()print href,title,score,numbers

Run the code to get:

https://music.douban.com/subject/2995812/             We Sing. We Dance. We Steal Things.        9.1                     (                            100395人评价                    )
5. Get the image address:

Locate the image and copy his XPath address://*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[1]/a/img

To run the code:

# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)href = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/@href‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()score = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/div/span[2]/text()‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()numbers = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/div/span[3]/text()‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()imgpath = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[1]/a/img/@src‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()print href,title,score,numbers,imgpath

Old routines:
Get results:

https://music.douban.com/subject/2995812/             We Sing. We Dance. We Steal Things.        9.1                     (                            100395人评价                    )                https://img3.doubanio.com/spic/s2967252.jpg

But this just gets a piece of data, what if we get more than one piece of data?

Get more than one piece of data

Let's look at the second piece of data, the third data, the fourth data
Get their XPath:

# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()title2 = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[2]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()title3 = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[3]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()title4 = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[4]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()print title,title2,title3,title4

Get:

        We Sing. We Dance. We Steal Things.        Viva La Vida        华丽的冒险        范特西

Comparing their XPath and finding that only the table ordinal is different, we can get the generic XPath message by removing the ordinal:
To run the code:

# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)titles = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/a/text()‘)#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()for title in titles:    print title.strip()

Get:

We Sing. We Dance. We Steal Things.Viva La Vida华丽的冒险范特西後。青春期的詩是时候LenkaStart from Here旅行的意义太阳Once (Soundtrack)Not Going AnywhereAmerican IdiotOK無與倫比的美麗亲爱的...我还不知道城市OWake Me Up When September Ends叶惠美七里香21My Life Will...寓言你在烦恼什么

Other information such as: Link address, scoring, evaluation of the number of people can be used in the same way to obtain, now I also get multiple data, because each page of data is 25, so:
The complete code is as follows:

# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)hrefs = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/a/@href‘)titles = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/a/text()‘)scores = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/div/span[2]/text()‘)numbers = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/div/span[3]/text()‘)imgs = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[1]/a/img/@src‘)for i in range(25):    print hrefs[i],titles[i],scores[i],numbers[i],imgs[i]

Get:
A large number of data, I will not show. It is interesting to be able to directly copy the code to run., note you have to install lxml and requests library.

We also found the problem. Each XPath path is exceptionally long, can it be streamlined?

# # # #5. Refine the XPath path

hrefs = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/a/@href‘)titles = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/a/text()‘)scores = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/div/span[2]/text()‘)numbers = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/div/span[3]/text()‘)imgs = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[1]/a/img/@src‘)

Observe that the XPath prefixes that get several key fields are //*[@id="content"]/div/div[1]/div/table/tr that I can put these things up, let the back of the different self to append, in addition to write this also do not have to control the number of each page exactly how many data, just check on the line. So the code did a little bit of streamlining.

url = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)trs = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr‘)  #先提取tr之前的节点集合for tr in trs: #遍历tr    href = tr.xpath(‘./td[2]/div/a/@href‘)[0]  #注意新节点是tr下的节点     title = tr.xpath(‘./td[2]/div/a/text()‘)[0]    score = tr.xpath(‘./td[2]/div/div/span[2]/text()‘)[0]    number = tr.xpath(‘./td[2]/div/div/span[3]/text()‘)[0]    img = tr.xpath(‘./td[1]/a/img/@src‘)[0]    print href,title,score,number,img

The results were the same as before.

But, however, this is just a page of data, I now want to crawl multiple pages of data, how to do?

Get multiple pages of data.

Take a look at the paging path:
Https://music.douban.com/top250?start=0
Https://music.douban.com/top250?start=25
Https://music.douban.com/top250?start=50

There is no discovery page just after the start parameter has changed and grew to 25 each time, and 250 data is exactly 10 pages.
So I can traverse this page.
Code:

for i in range(10):url = ‘https://music.douban.com/top250?start={}‘.format(i*25)print url

Get:

https://music.douban.com/top250?start=0https://music.douban.com/top250?start=25https://music.douban.com/top250?start=50https://music.douban.com/top250?start=75https://music.douban.com/top250?start=100https://music.douban.com/top250?start=125https://music.douban.com/top250?start=150https://music.douban.com/top250?start=175https://music.douban.com/top250?start=200https://music.douban.com/top250?start=225

It is the result that I want.

Okay, finally, we assemble the code together and notice the purpose of each method.

Full code
# coding:utf-8from lxml import etreeimport requests#获取页面地址def getUrl():    for i in range(10):    url = ‘https://music.douban.com/top250?start={}‘.format(i*25)    scrapyPage(url)#爬取每页数据def scrapyPage(url):    html = requests.get(url).text    s = etree.HTML(html)    trs = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr‘)    for tr in trs:        href = tr.xpath(‘./td[2]/div/a/@href‘)[0]        title = tr.xpath(‘./td[2]/div/a/text()‘)[0]        score = tr.xpath(‘./td[2]/div/div/span[2]/text()‘)[0]        number = tr.xpath(‘./td[2]/div/div/span[3]/text()‘)[0]        img = tr.xpath(‘./td[1]/a/img/@src‘)[0]        print href, title, score, number, imgif  ‘__main__‘:    getUrl()

Python crawler: Now learning to use XPath to crawl the watercress music

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.