標籤:blog text 方法 html 爬蟲 log post single path
CSS選取器和XPath方法都是用來定位DOM樹的標籤,只不過兩者的定位表示形式上存在一些差別:
library("rvest")single_table_page <- read_html("single-table.html")# 提取url裡的所有表格html_table(single_table_page)html_table(html_node(single_table_page,"table"))products_page <- read_html("./case/products.html")products_page %>% html_nodes(".product-list li .name") %>% html_text() product_items <- products_page %>% html_nodes(".product-list li")data.frame(name = product_items %>% html_nodes(".name") %>% html_text(), price = product_items %>% html_nodes(".price") %>%html_text() %>% str_replace_all(pattern="\\$",replacement="") %>% as.numeric(), stringsAsFactors = FALSE)
page <- read_html("./case/new-products.html")#尋找所有p節點page %>% html_nodes(xpath="//p")#CSS's waypage %>% html_nodes("p")# 找到所有具有class屬性的li標籤# xpath's waypage %>% html_nodes(xpath="//li[@class]")# CSS's waypage %>% html_nodes("li[class]")# 找到id=‘list’的div標籤下的所有li標籤# xparth's waypage %>% html_nodes(xpath="//div[@id='list']/ul/li")# CSS's waypage %>% html_nodes("div#list > ul > li")# 尋找包含p節點的所有div節點page %>% html_nodes(xpath="//div[p]")# 尋找所有class值為“info-value”,常值內容為“Good”的span節點page %>% html_nodes(xpath = "//span[@class='info-value' and text()='Good']")
R語言爬蟲:CSS方法與XPath方法對比(代碼實現)