R語言爬蟲:CSS方法與XPath方法對比(代碼實現)

來源:互聯網
上載者:User

標籤:blog   text   方法   html   爬蟲   log   post   single   path   

CSS選取器和XPath方法都是用來定位DOM樹的標籤,只不過兩者的定位表示形式上存在一些差別:
  • CSS 方法提取節點
library("rvest")single_table_page <- read_html("single-table.html")# 提取url裡的所有表格html_table(single_table_page)html_table(html_node(single_table_page,"table"))products_page <- read_html("./case/products.html")products_page %>% html_nodes(".product-list li .name") %>% html_text() product_items <- products_page %>% html_nodes(".product-list li")data.frame(name = product_items %>% html_nodes(".name") %>% html_text(),            price = product_items %>% html_nodes(".price") %>%html_text()            %>% str_replace_all(pattern="\\$",replacement="") %>%                as.numeric(), stringsAsFactors = FALSE)
  • XPath 方法提取節點
page <- read_html("./case/new-products.html")#尋找所有p節點page %>% html_nodes(xpath="//p")#CSS's waypage %>% html_nodes("p")# 找到所有具有class屬性的li標籤# xpath's waypage %>% html_nodes(xpath="//li[@class]")# CSS's waypage %>% html_nodes("li[class]")# 找到id=‘list’的div標籤下的所有li標籤# xparth's waypage %>% html_nodes(xpath="//div[@id='list']/ul/li")# CSS's waypage %>% html_nodes("div#list > ul > li")# 尋找包含p節點的所有div節點page %>% html_nodes(xpath="//div[p]")# 尋找所有class值為“info-value”,常值內容為“Good”的span節點page %>% html_nodes(xpath = "//span[@class='info-value' and text()='Good']")

R語言爬蟲:CSS方法與XPath方法對比(代碼實現)

相關文章

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.