There are several ways to crawl the crawler,
Regular Expressions, Lxml (XPath) and beautiful,I looked up the information on the Internet, to understand the use of the difficulty and performance of the three
Three ways to compare reptiles.
Crawl Mode |
Performance |
Use difficulty |
Regular expressions |
Fast |
Difficult |
Lxml |
Fast |
Simple |
BeautifulSoup |
Slow |
Simple |
Such a comparison I I chose the way of lxml (XPath), although there are three ways, but certainly to choose the best way to crawler, this truth we all understand, another interested friends can also go to learn the other two kinds of reptile way!
Okay, now, let's talk about XPath.
Because XPath belongs to the lxml module, you first need to install the lxml library, the old-fashioned direct file-->setting---project interpreter add lxml Library.
XPath simple usage
From lxml import etree
S=etree. HTML (source code) #将源码转化为能被XPath匹配的格式
S.xpath (XPath expression) #返回为一列表,
Basic syntax:
- Double slash locates the root node, scans the full text, selects all eligible content in the document, and returns it as a list.
- /Slash to find the next layer of path labels for the current label path or to manipulate the contents of the current route label
- /text () Gets the text content under the current path
- /@xxxx Extract the property value of the label under the current path
- | Optional characters Use | You can select several paths such as//p | The DIV selects all the eligible P tags and div tags under the current path.
- . Point to select the current node
- .. Select the parent node of the current node with two points
We can learn the XPath grammar function quickly.
We need to crawl the top 250 of the watercress music this time.
Open Watercress Music: https://music.douban.com/top250
Get a single piece of data 1. Get Music titles
Right-click popup menu bar copy==> Copy Xpath
Here we want to get the music title, the XPath of the music title is:xpath://*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div/a
# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div/a‘)print title
To run the code:
It's empty.
It is important to note that browser-copied XPath can only be used as a reference, as browsers often add extra tbody tags in themselves, and we need to manually remove this tag
Delete the middle of the/tbody after that is the case,
title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a‘)
Then we run the code again.
Get:
<Element a at 0x53d26c8>
The description title was obtained.
Because you want to get the caption text, the XPath expression is appended with/text ()
title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/text()‘)#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()
And because this s.xpath returns a collection, and there's only one element in the collection, I'm going to append another one [0]
New expression:
title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text(),再追加[0]
Rerun to get results:
We Sing. We Dance. We steal things.
Is exactly what we want the title to be.
2. Get music rating and number of reviews
Old way, first right-click the XPath with the copy rating://*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div/div/span[2]
Copy the XPath of the evaluation number: //*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div/div/span[3]/text()
Again, we're going to get rid of the tbody and rerun the code:
# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()score = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/div/span[2]/text()‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()numbers = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/div/span[3]/text()‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()print title,score,numbers
Get:
We Sing. We Dance. We Steal Things. 9.1 ( 100395人评价 )
3. Get the music link
The XPath for the copy title://*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div/a
Want to get music connection href here need to get this tag belongs to,/@xxx can fetch the property value under the current path label
//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div/a/@href
Code:
# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)href = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/@href‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()score = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/div/span[2]/text()‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()numbers = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/div/span[3]/text()‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()print href,title,score,numbers
Run the code to get:
https://music.douban.com/subject/2995812/ We Sing. We Dance. We Steal Things. 9.1 ( 100395人评价 )
5. Get the image address:
Locate the image and copy his XPath address://*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[1]/a/img
To run the code:
# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)href = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/@href‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()score = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/div/span[2]/text()‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()numbers = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/div/span[3]/text()‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()imgpath = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[1]/a/img/@src‘)[0]#因为要获取文本,所以我需要这个当前路径下的文本,所以使用/text()print href,title,score,numbers,imgpath
Old routines:
Get results:
https://music.douban.com/subject/2995812/ We Sing. We Dance. We Steal Things. 9.1 ( 100395人评价 ) https://img3.doubanio.com/spic/s2967252.jpg
But this just gets a piece of data, what if we get more than one piece of data?
Get more than one piece of data
Let's look at the second piece of data, the third data, the fourth data
Get their XPath:
# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)title = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()title2 = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[2]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()title3 = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[3]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()title4 = s.xpath(‘//*[@id="content"]/div/div[1]/div/table[4]/tr/td[2]/div/a/text()‘)[0]#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()print title,title2,title3,title4
Get:
We Sing. We Dance. We Steal Things. Viva La Vida 华丽的冒险 范特西
Comparing their XPath and finding that only the table ordinal is different, we can get the generic XPath message by removing the ordinal:
To run the code:
# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)titles = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/a/text()‘)#因为要获取标题,所以我需要这个当前路径下的文本,所以使用/text()for title in titles: print title.strip()
Get:
We Sing. We Dance. We Steal Things.Viva La Vida华丽的冒险范特西後。青春期的詩是时候LenkaStart from Here旅行的意义太阳Once (Soundtrack)Not Going AnywhereAmerican IdiotOK無與倫比的美麗亲爱的...我还不知道城市OWake Me Up When September Ends叶惠美七里香21My Life Will...寓言你在烦恼什么
Other information such as: Link address, scoring, evaluation of the number of people can be used in the same way to obtain, now I also get multiple data, because each page of data is 25, so:
The complete code is as follows:
# coding:utf-8from lxml import etreeimport requestsurl = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)hrefs = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/a/@href‘)titles = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/a/text()‘)scores = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/div/span[2]/text()‘)numbers = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/div/span[3]/text()‘)imgs = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[1]/a/img/@src‘)for i in range(25): print hrefs[i],titles[i],scores[i],numbers[i],imgs[i]
Get:
A large number of data, I will not show. It is interesting to be able to directly copy the code to run., note you have to install lxml and requests library.
We also found the problem. Each XPath path is exceptionally long, can it be streamlined?
# # # #5. Refine the XPath path
hrefs = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/a/@href‘)titles = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/a/text()‘)scores = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/div/span[2]/text()‘)numbers = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div/div/span[3]/text()‘)imgs = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr/td[1]/a/img/@src‘)
Observe that the XPath prefixes that get several key fields are //*[@id="content"]/div/div[1]/div/table/tr
that I can put these things up, let the back of the different self to append, in addition to write this also do not have to control the number of each page exactly how many data, just check on the line. So the code did a little bit of streamlining.
url = ‘https://music.douban.com/top250‘html = requests.get(url).texts = etree.HTML(html)trs = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr‘) #先提取tr之前的节点集合for tr in trs: #遍历tr href = tr.xpath(‘./td[2]/div/a/@href‘)[0] #注意新节点是tr下的节点 title = tr.xpath(‘./td[2]/div/a/text()‘)[0] score = tr.xpath(‘./td[2]/div/div/span[2]/text()‘)[0] number = tr.xpath(‘./td[2]/div/div/span[3]/text()‘)[0] img = tr.xpath(‘./td[1]/a/img/@src‘)[0] print href,title,score,number,img
The results were the same as before.
But, however, this is just a page of data, I now want to crawl multiple pages of data, how to do?
Get multiple pages of data.
Take a look at the paging path:
Https://music.douban.com/top250?start=0
Https://music.douban.com/top250?start=25
Https://music.douban.com/top250?start=50
There is no discovery page just after the start parameter has changed and grew to 25 each time, and 250 data is exactly 10 pages.
So I can traverse this page.
Code:
for i in range(10):url = ‘https://music.douban.com/top250?start={}‘.format(i*25)print url
Get:
https://music.douban.com/top250?start=0https://music.douban.com/top250?start=25https://music.douban.com/top250?start=50https://music.douban.com/top250?start=75https://music.douban.com/top250?start=100https://music.douban.com/top250?start=125https://music.douban.com/top250?start=150https://music.douban.com/top250?start=175https://music.douban.com/top250?start=200https://music.douban.com/top250?start=225
It is the result that I want.
Okay, finally, we assemble the code together and notice the purpose of each method.
Full code
# coding:utf-8from lxml import etreeimport requests#获取页面地址def getUrl(): for i in range(10): url = ‘https://music.douban.com/top250?start={}‘.format(i*25) scrapyPage(url)#爬取每页数据def scrapyPage(url): html = requests.get(url).text s = etree.HTML(html) trs = s.xpath(‘//*[@id="content"]/div/div[1]/div/table/tr‘) for tr in trs: href = tr.xpath(‘./td[2]/div/a/@href‘)[0] title = tr.xpath(‘./td[2]/div/a/text()‘)[0] score = tr.xpath(‘./td[2]/div/div/span[2]/text()‘)[0] number = tr.xpath(‘./td[2]/div/div/span[3]/text()‘)[0] img = tr.xpath(‘./td[1]/a/img/@src‘)[0] print href, title, score, number, imgif ‘__main__‘: getUrl()
Python crawler: Now learning to use XPath to crawl the watercress music