Python Detection 404 page

Source: Internet
Author: User
Tags response code

Some sites provide a custom error page for user friendly interaction, rather than displaying a large 404, such as the 404 Hint page on csdn as follows:

This improves the user experience, but when writing the corresponding POC for detection only based on the HTTP header information returned, it is likely to cause false positives, in order to accurately detect the 404 page,
Need to judge from two aspects of Status code and page content.
It's easier to judge from the status code. You can send HTTP requests directly using the requests library and get a response code.
Judging from the content of the page, the idea is to visit the Web site obviously does not exist on the page, get the page content to save, and then visit the target page, compare the two, if the similarity reaches a certain threshold, then the page is 404 pages, otherwise the normal page.
In order to determine the similarity of two pages, using the Python Simhash Library, the implementation of this library specific algorithm I do not understand, but the advantage of Python is: do not know it doesn't matter, directly to use the line. Here's just a simple way to use it:

#-*-encoding:utf-8-*-# 404 Page Recognition fromHashes.simhashImportSimhashImportRequestsclasspage_404:def __init__( Self, domain):#检测站点         Self. _404_page=[]# 404 Pages         Self. _404_url=[]#404 URL         Self. _404_path=["Test_404.html","404_test.html","Helloworld.html","test.asp?action=modify&newsid=122%20and%201=2%20union%20select%201,2,admin%2bpassword,4,5,6,7%20fROM%20shopxp_admin "]#404页面路径, used to generate a portion of 404 pages         Self. _404_code=[ $,301,302]#当前可能是404页面的http请求的返回值        #自己构造404url to collect some 404 pages of information         forPathinch  Self. _404_path: forPathinch  Self. _404_path:ifdomain[-1]== "/": URL=Domain+PathElse: URL=Domain+ "/" +Path response=Requests.get (URL)ifResponse.status_codeinch  Self. _404_code: Self. kb_appent (response.content, URL)defKb_appent ( Self, _404_page, _404_url):if_404_page not inch  Self. _404_page: Self. _404_page.append (_404_page)if_404_url not inch  Self. _404_url: Self. _404_url.append (_404_url)defIs_similar_page ( Self, Page1, Page2): Hash1=Simhash (Page1) hash2=Simhash (Page2) Similar=Hash1.similarity (HASH2)ifSimilar> 0.85:#当前阈值定义为0.            return True        Else:return False    defis_404 ( Self, URL):ifUrlinch  Self. _404_url:return TrueResponse=Requests.get (URL)ifResponse.status_code== 404:return True        ifResponse.status_codeinch  Self. _404_code: forPageinch  Self. _404_page:if  Self. Is_similar_page (Response.content, page): Self. kb_appent (URL, response.content)#如果是404页面, the current URL and page information is saved                    return True                Else:return False        return False

In the above code, the detection class mainly holds such a few information:
_404_page:404 page, for the other requested page similarity judgment, in order to identify 404 pages, where the list is mainly to prevent a site has a number of 404 pages, The longer this code runs, the higher the accuracy of the
_404_url:404 page URL, before saving to determine that the page is a 404 URL, has been judged to no longer judge, in order to improve efficiency
_404_path: To build the URL of a nonexistent page, The last is a SQL injection code, here in order to identify those blocked by the firewall to display the error page
_404_code: May return 404 page response code, if the response code is these, you need to judge the page
class needs to pass in a domain name at initialization. According to this domain name to splice several non-existent or will be blocked by the firewall request and submit these requests, get the return information, the information as a judgment to save. The
is judged first based on the previously saved 404 URL information, and if the current URL is 404, it returns directly, improving efficiency. It then submits the normal HTTP request and gets the response information,
returns TRUE if the response code is 404, otherwise the status code is in the _404_code list, and the result is compared to the previously saved 404 page information. The test code for the
code is as follows:

fromimport page_404if__name__==‘__main__‘:    ="http://xzylrd.gov.cn"    = page_404(domain)    ="http://xzylrd.gov.cn/TEXTBOX2.ASP?action=modify&newsid=122%20and%201=2%20union%20select%201,2,admin%2bpassword,4,5,6,7%20from%20shopxp_admin"    print (check_404.is_404(dest_url))

Python Detection 404 page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.