1001-bit girl.

Last Update:2015-10-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In fact, this blog post is an R-based crawler, crawling 1001 of all the girls who know the main upload of the beautiful photo. A total of more than 800 pictures.

Needless to say, put down the code and the final crawl results. Well, what??, I'll map the code before pasting.

What are some of the best knowledge about these images? (A total of 100 questions)

Mushroom Cool Pictures: (I only caught 100 pictures, a total of more than 800 bar) (if there is invasion of privacy, please contact me immediately, I immediately delete)

Code:

#--------#2015-10-04 1001 Girls #lee#http://www.zhihu.com/collection/26348030?page=1#--------Library (magrittr) Library (proto) library (GSUBFN) libraries (bitops) library (rvest) library (STRINGR) library (DBI) library (rsqlite) library ( Rcurl) Library (Curl) library (sqldf) Get.jpgurl <-function (pgnum) {URL <-paste0 (' http://www.zhihu.com/ Collection/26348030?page= ', pgnum) #url <-' http://www.zhihu.com/collection/26348030?page=5 ' text <-url%>%h The Tml_session ()%>%html_nodes (' Textarea.content ')%>%html_text () #下面这个问题的答案太棒 extracts the URL directly from the text.    Do not end with a JPG URL, so use regular matches to the URL ending with JPG. #http://stackoverflow.com/questions/26496538/extract-urls-with-regex-into-a-new-data-frame-column Url_pattern <-"http[s]?:/ /(?: [a-za-z]| [0-9]| [[email protected]&+]| [!*\\(\\),]| (?:%[0-9a-fa-f][0-9a-fa-f]))            + "#每个回答, the master may not upload more than one photo, so you need to extract all the URLs that end with. jpg from the answer txt combineurl <-data.frame () for (I in 1:length (text)) { Urlpick <-sapply (text[i],function (TXT) str_extract_all (txt,url_pattern,simplify=t)) if (!is.matrix (Urlpick)) next;            # The following error is always reported as if there is a DF and there is no column name, forced to assign him a column name, so error. # Google, and did not get a good solution.            Skip first.            # ' names ' property length [1] must be the same as the length of the vector [0] the same # finally want to understand, there may be some answers and not upload pictures, the answer to the txt there is no URL, this time generated an empty matrix, so, you give the empty matrix to change the name, is not advisable. Colnames (Urlpick) <-C (' V1 ') Urlpick <-as.data.frame (urlpick,stringsasfactors=f) Urlpick <-sqldf ("select * from Urlpick where V1 like '%jpg '") #再次过滤, as long as the URL ending in jpg combineurl <-rbind (combineurl,u Rlpick)} return (Combineurl)} #抓取所有以. jpg End of url# total 5 pages, do not write multi-threaded ‹url.total <-data.frame () for (I in 1:5) {Url.tot Al <-Rbind (Url.total,get.jpgurl (i))} #下载所有jpg文件, and numbering starting from 1, after the download found that each picture is double, to troubleshoot the reason, # https://pic4.zhimg.com/ d69a132cc27914af57c1b9eeb1a938c7_b.jpg# https://pic4.zhimg.com/d69a132cc27914af57c1b9eeb1a938c7_r.jpg# The above two URLs point to the same picture, but the two URLs are different, and are next to each other. # See how many of them end with B.jpg and r.jpg, and the result is 811 and 750, whichever is b.jpg. # 811 Mushrooms cool, less than 1001,?? TT <-URL.TOTALURLTT <-sqlDF ("SELECT * from TT where V1 like '%b.jpg '") for (I in 1:100) {#想自己爬的, change the path to your own OK, I'll replace the path with * * *. TMP <-paste0 ('/users/***/r/1001 girls/', I, '. jpg ') download.file (urltt[i,1],destfile=tmp) print (Paste0 (' ", I, ' map The slice has been downloaded ')} #点进几个答案和抓取的图对比, the order is right, also catch all. #这些图片是来源与哪些优秀的知乎问题?? Get.ques <-Function (pgnum) {URL <-paste0 (' http://www.zhihu.com/collection/26348030?page= ', pgnum) #url <- ' http://www.zhihu.com/collection/26348030?page=1 ' title <-url%>%html_session ()%>%html_nodes (' Div.zm-item h2.zm-item-title a ')%>%html_text ()%>%as.data.frame (stringsasfactors=f) return (title)}title <- Data.frame () for (I in 1:5) {title <-rbind (Title,get.ques (i))}

Crawl process:

Summarize:??.

1001-bit girl.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

1001-bit girl.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

1001-bit girl.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support