#Coding:utf-8#Author:blood_zero" "1, Access to Web page Information 2, solve the problem of coding, through the CharSet library (the library file is not installed by default)" "ImportUrllibImportUrllib2url="http://192.168.1.135/myself/"HTML=urllib.urlopen (URL) content=Html.read ()Printcontent#If there are other encodings in the page, garbled characters will appear#print Content.decode (' GBK '). Encode (' Utf-8 ')" "easy access to web information" "#Get current URLPrint "Current URL:"+Str (Html.geturl ())#Page Status CodePrint "Current status code:"+Str (html.code)#print "Current status code:" +str (Html.getcode ())#Website Header informationPrint "Current header information: \ n"+Str (html.headers)#print "Current header information: \ n" +str (Html.info ())#Get site EncodingPrint "Current Website Usage code:"+str (Html.info (). GetParam ("CharSet"))#download Web SourceUrllib.urlretrieve (URL,"E:\\python_code\\pytools\\url.txt")" "Simulate browser access URLs" "#Method Onereq=Urllib2. Request (URL)#Add header informationReq.add_header ("user-agent","mozilla/5.0 (Windows NT 6.2; WOW64; rv:39.0) gecko/20100101 firefox/39.0") Req.add_header ("Get", URL) req.add_header ("Host","192.168.1.135") new_html=Urllib2.urlopen (req)PrintNew_html.read ()PrintReq.headers.items ()#Method TwoMyheader={ "user-agent":"mozilla/5.0 (Windows NT 6.2; WOW64; rv:39.0) gecko/20100101 firefox/39.0", "Host":"192.168.1.135", "Get": url}req1= Urllib2. Request (url,headers=myheader) New_html_1=Urllib2.urlopen (req1)PrintNew_html_1.read ()PrintReq1.headers.items ()" "querying the specified file in a Web page" "defget_content (URL): HTML=urllib.urlopen (URL) content=Html.read () html.close ( )returncontentdefGet_file (self):#Matching PHP filesRegex = R'a href= (. +?\.php)'Pat=re.compile (regex) File_code=Re.findall (pat,self)PrintSTR (file_code) +"\ n"Info= Get_content ("http://192.168.1.135/myself/SQL_Injection/") Get_file (info)
Python Crawler Learning