In fact, this blog post is an R-based crawler, crawling 1001 of all the girls who know the main upload of the beautiful photo. A total of more than 800 pictures.
Needless to say, put down the code and the final crawl results. Well, what??, I'll map the code before pasting.
What are some of the best knowledge about these images? (A total of 100 questions)
Mushroom Cool Pictures: (I only caught 100 pictures, a total of more than 800 bar) (if there is invasion of privacy, please contact me immediately, I immediately delete)
Code:
#--------#2015-10-04 1001 Girls #lee#http://www.zhihu.com/collection/26348030?page=1#--------Library (magrittr) Library (proto) library (GSUBFN) libraries (bitops) library (rvest) library (STRINGR) library (DBI) library (rsqlite) library ( Rcurl) Library (Curl) library (sqldf) Get.jpgurl <-function (pgnum) {URL <-paste0 (' http://www.zhihu.com/ Collection/26348030?page= ', pgnum) #url <-' http://www.zhihu.com/collection/26348030?page=5 ' text <-url%>%h The Tml_session ()%>%html_nodes (' Textarea.content ')%>%html_text () #下面这个问题的答案太棒 extracts the URL directly from the text. Do not end with a JPG URL, so use regular matches to the URL ending with JPG. #http://stackoverflow.com/questions/26496538/extract-urls-with-regex-into-a-new-data-frame-column Url_pattern <-"http[s]?:/ /(?: [a-za-z]| [0-9]| [[email protected]&+]| [!*\\(\\),]| (?:%[0-9a-fa-f][0-9a-fa-f])) + "#每个回答, the master may not upload more than one photo, so you need to extract all the URLs that end with. jpg from the answer txt combineurl <-data.frame () for (I in 1:length (text)) { Urlpick <-sapply (text[i],function (TXT) str_extract_all (txt,url_pattern,simplify=t)) if (!is.matrix (Urlpick)) next; # The following error is always reported as if there is a DF and there is no column name, forced to assign him a column name, so error. # Google, and did not get a good solution. Skip first. # ' names ' property length [1] must be the same as the length of the vector [0] the same # finally want to understand, there may be some answers and not upload pictures, the answer to the txt there is no URL, this time generated an empty matrix, so, you give the empty matrix to change the name, is not advisable. Colnames (Urlpick) <-C (' V1 ') Urlpick <-as.data.frame (urlpick,stringsasfactors=f) Urlpick <-sqldf ("select * from Urlpick where V1 like '%jpg '") #再次过滤, as long as the URL ending in jpg combineurl <-rbind (combineurl,u Rlpick)} return (Combineurl)} #抓取所有以. jpg End of url# total 5 pages, do not write multi-threaded ‹url.total <-data.frame () for (I in 1:5) {Url.tot Al <-Rbind (Url.total,get.jpgurl (i))} #下载所有jpg文件, and numbering starting from 1, after the download found that each picture is double, to troubleshoot the reason, # https://pic4.zhimg.com/ d69a132cc27914af57c1b9eeb1a938c7_b.jpg# https://pic4.zhimg.com/d69a132cc27914af57c1b9eeb1a938c7_r.jpg# The above two URLs point to the same picture, but the two URLs are different, and are next to each other. # See how many of them end with B.jpg and r.jpg, and the result is 811 and 750, whichever is b.jpg. # 811 Mushrooms cool, less than 1001,?? TT <-URL.TOTALURLTT <-sqlDF ("SELECT * from TT where V1 like '%b.jpg '") for (I in 1:100) {#想自己爬的, change the path to your own OK, I'll replace the path with * * *. TMP <-paste0 ('/users/***/r/1001 girls/', I, '. jpg ') download.file (urltt[i,1],destfile=tmp) print (Paste0 (' ", I, ' map The slice has been downloaded ')} #点进几个答案和抓取的图对比, the order is right, also catch all. #这些图片是来源与哪些优秀的知乎问题?? Get.ques <-Function (pgnum) {URL <-paste0 (' http://www.zhihu.com/collection/26348030?page= ', pgnum) #url <- ' http://www.zhihu.com/collection/26348030?page=1 ' title <-url%>%html_session ()%>%html_nodes (' Div.zm-item h2.zm-item-title a ')%>%html_text ()%>%as.data.frame (stringsasfactors=f) return (title)}title <- Data.frame () for (I in 1:5) {title <-rbind (Title,get.ques (i))}
Crawl process:
Summarize:??.
1001-bit girl.