R 網頁資料爬蟲1

來源:互聯網
上載者:User

標籤:

For collecting and analyzing data.

【啟示】本處所分享的內容均是筆者從一些專業書籍中學習所得,也許會有一些自己使用過程中的技巧、心得、小經驗一類的,但遠比不上書中所講述的精彩翔實。只因自己在學習過程中深感在R爬蟲應用中互連網可搜尋的公開資源並不如其它知識豐富,特此稍作分享以供後來者鑒,也因此關於這一塊的內容不做原創聲明,歡迎朋友們一起交流學習、批評指正,以期共同進步。EMAIL:[email protected]

1.WHY R?

即使對於非專業人員而言,也多少耳聞目前的R在爬蟲應用的表現也遠不如其它軟體,R既非專業適合的軟體、而八爪魚一類的簡單應用也完全可以滿足我們這些"偶爾的使用者",那麼為什麼需要用R爬蟲呢?我認為每一個來搜尋R爬蟲技巧的朋友都有自己的答案。

提醒幾個個優勢:

#1.FOR a software environment with a primarily statistical focus.

#2.there will be an amazing visual work.

#May be a complete set of operational procedures.

2.About basics.

we need threw ourselves into the preparation with some basic knowledge of HTML, XML and the logic of regular expressions and Xpath, BUT the operations are executed from WIHTIN R!

3.RECOMMENDATION

http://www.r-datacollection.com

4.A little case study.

#爬取電影票房資訊library(stringr)library(XML)library(maps)#htmlParse()用來interpreting HTML#建立一個objectmovie_parsed<-htmlParse("http://58921.com/boxoffice/wangpiao/20161004",                        encoding = "UTF-8")#the next step:extract tables/data#readHTMLTable() for identifying and reading out those tablestables<-readHTMLTable(movie_parsed,stringsAsFactors=FALSE)is.matrix(tables)is.character(tables)is.data.frame(tables)is.list(tables)#so we got an "list" format#

 

因為R對於中文的支援不是很好,所以碰到一些中文亂碼是正常的,所以我們需要more advanced text manipulation tools.(本例中出現了部分列資訊的完全丟失是因為該網站的某些列的資料是以.png格式放置的。)

5.ABC‘s of...

For browsing the Web, there is a hidden standard behind the scenes that structures how information is displayed.

#HTML or the hypertext markup language

Not a dedicated data storage format, but usually contains the useful information. And in general HTML is used to shape the display of information.

#XML the extensible markup language or XML

The main purpose of XML is to storage data. Thus HTML documents are interpreted and transformed in to pretty-looking output by browsers, whereas XML is "just" data wrapped in user-defined tags. The user-defined tags make XML much more flexible for storing data than HTML. Both HTML and XML-style document offer natrual, often hierarchical, structures for data storage. 

(unfinished......)

 

R 網頁資料爬蟲1

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.