Both the CSS selector and the XPath method are used to locate the tags of the DOM tree, but there are some differences in the positioning representation of the two:
- CSS Method Extraction Node
library("rvest")single_table_page <- read_html("single-table.html")# 提取url里的所有表格html_table(single_table_page)html_table(html_node(single_table_page,"table"))products_page <- read_html("./case/products.html")products_page %>% html_nodes(".product-list li .name") %>% html_text() product_items <- products_page %>% html_nodes(".product-list li")data.frame(name = product_items %>% html_nodes(".name") %>% html_text(), price = product_items %>% html_nodes(".price") %>%html_text() %>% str_replace_all(pattern="\\$",replacement="") %>% as.numeric(), stringsAsFactors = FALSE)
- XPath Method Extraction Node
page <- read_html("./case/new-products.html")#查找所有p节点page %>% html_nodes(xpath="//p")#CSS's waypage %>% html_nodes("p")# 找到所有具有class属性的li标签# xpath's waypage %>% html_nodes(xpath="//li[@class]")# CSS's waypage %>% html_nodes("li[class]")# 找到id=‘list’的div标签下的所有li标签# xparth's waypage %>% html_nodes(xpath="//div[@id='list']/ul/li")# CSS's waypage %>% html_nodes("div#list > ul > li")# 查找包含p节点的所有div节点page %>% html_nodes(xpath="//div[p]")# 查找所有class值为“info-value”,文本内容为“Good”的span节点page %>% html_nodes(xpath = "//span[@class='info-value' and text()='Good']")
R language Crawler: CSS methods vs. XPath methods (Code implementation)