0x01 Preface
Two days ago in the hundred saw an article called "Counterattack Crawler, front-end engineer brain hole can be how big?" Article, the article from a number of aspects combined with the actual situation, including Cat's Eye movies, American Regiment, where to go and other large e-commerce website anti-crawler mechanism. Indeed, as the article said, for a Web page, we often hope that it is well-structured, clear content, so that the search engine can accurately recognize it, and in turn, there are some scenarios, we do not want content to be easily accessible, such as e-commerce website turnover, the university's website title. Because these content, often is a product lifeline, must be effective to protect. This is the origin of the topic of reptiles and anti-reptiles. This article is to do a better "cat's Eye movie" website For example, fix his anti-crawler mechanism, easy to crawl the data we want!
0x02 Common anti-crawler
In terms of function, reptiles are generally divided into data collection, processing, storage three parts. And as programmers we only care about the data collection, processing or to the data analysts to do it.
In general, most Web sites are headers from three aspects: User requests, user behavior, site directories, and data loading methods. The first two are relatively easy to encounter, most of the site from these angles to reverse the crawler, and the third is relatively special, some applications of AJAX sites will be adopted, which will undoubtedly increase the difficulty of crawling crawler.
However, these three kinds of anti-crawler strategies have already had the methods and strategies to deal with. If you encounter the headers anti-creeping mechanism from the user request, you can add headers to the crawler directly, copy the user-agent of the browser to the headers of the crawler, or modify the Referer value to the target website domain name. For the detection of headers anti-crawler, in the crawler to modify or add headers can be very good bypass. For the anti-crawler based on user behavior is actually by restricting the same IP in a short period of time to access the same page, the strategy is also very rough-using IP proxy, you can specifically write a crawler, crawling online public proxy IP, after the detection of all saved up. With a large number of proxy IP can be a few times per request to replace an IP, can bypass this anti-crawler mechanism. For the last dynamic page anti-crawler mechanism, the SELENIUM+PHANTOMJS framework allows you to simulate the dynamic request to load a webpage in a browser with no interface, after all, selenium is an artifact of automated infiltration.
0X03 Cat's eye anti-reptile introduction
After introducing the common anti-reptile mechanism, we look back at our protagonist today: What is the anti-reptile of the cat's eye movie?
For the important data of the daily cinema fares, the source code does not show the sheer numbers. Instead, the character set is defined on the page using Font-face, and the display is mapped through Unicode. A brief introduction to this new WEB-FONGT anti-crawler mechanism: Use Web-font to load fonts from the network, so we can create a set of fonts ourselves, set a custom character mapping relationship table. For example, setting 0xefab is the mapping character 1,0xeba2 is the mapping character 2, and so on. When the need to display character 1 o'clock, the source of the Web page will only be 0xefab, the acquisition will only be 0xefab, not 1:
Therefore, the collector does not collect the correct fare data:
The collector can only obtain similar & #xebc4 data, and does not know "& #xebc4;" What is the mapped character, which realizes the data anti-collection. For normal access users have no effect, because the browser will load the CSS font fonts for us to render good, real-time display in the Web page. That is, to remove the image recognition, you must crawl the character set at the same time to recognize the number.
Viewing the cat's eye site source file is exactly the case:
All fare information is obtained by "encrypting" the dynamic font font. Now that we know the principle, we continue to explore, by analyzing the HTML structure of the website, we find that the font font for each rendered fare can be found in the Script tab of the Web page:
Fonts are stored in Web pages encrypted by Base64, so, on Python:
#将base64加密的font文件解密转存本地font = re.findall(r"src: url\(data:application/font-woff;charset=utf-8;base64,(.*?)\) format",response_all)[0]fontdata=base64.b64decode(font) file=open(‘/home/jason/workspace/1.ttf‘,‘wb‘) file.write(fontdata) file.close()
When we crawl, the font file is decrypted and stored locally as a TTF file, reserved for backup.
As mentioned earlier, this web-font defines a character set and is to be mapped by Unicode, so we want to build a dictionary of Unicode-mapped characters in the TTF font file:
Python code:
import fontforgedef tff2Unicode():#将字体映射为unicode列表 filename = ‘/home/jason/workspace/1.ttf‘ fnt = fontforge.open(filename) for i in fnt.glyphs(): print i.unicode
We guess the mapping relationship is as follows:
Remember, the third picture we climbed to the data is "embroidered spring knife • Asura Battlefield 341189 2017-07-20 6th Hall 2D Mandarin 11:10 & #xebc4;& #xe1e7;", we replaced "the" "0" After the corresponding table of the fare is not exactly "29" it!
Python code:
Tmp_dic={}ttf_list = []DefCreattmpdic():#创建映射字典 tmp_dic={} ttf_list = [] Num_list = [-1,-1,0,1,2,3,4,5,6,7,8,9] filename ='/home/jason/workspace/1.ttf ' fnt = fontforge.open (filename) ttf_list = []For IIn Fnt.glyphs (): Ttf_list.append (i.unicode) tmp_dic = dict (Zip (ttf_list,num_list))#构建字典Return tmp_dic,ttf_listDefTff2price #将爬取的字符映射为字典中的数字 Tmp_return = " " Span class= "Hljs-keyword" >for J in para.split ( ";"): if j! = "": & nbsp SS = J.replace ( "", "0") for g in ttf_list: if (Hex (g) = = ss): TMP_RETURN+=STR (Tmp_dic[g]) return Tmp_return
Well, to this, we can already say has completed the "encryption" of the price of the data cracked ~ or a little sense of accomplishment! However, there is still a very hole in the place: the developers have thought that the collector can be analyzed to know the meaning of each mapping, so that after the acquisition of conversion processing, so we each access is random to get a font, and developers also regularly update a batch of font files and mapping table to increase the difficulty of acquisition, So we have to collect a page in the process of the update of the local page of the Web-font font, will undoubtedly greatly increase the crawl cost and crawl efficiency, so from a certain point of the actual implementation of anti-crawler.
Reference documents:
http://blog.csdn.net/fdipzone/article/details/68166388
Https://baijiahao.baidu.com/s?id=1572788572555517&wfr=spider&for=pc
Https://zhuanlan.zhihu.com/p/20520370?columnSlug=python-hacker
Counter-crawler strategy to counter the "cat's Eye Movie" website