1. Installing requests
WINDOW:PIP Install Requests
Linux:sudo PIP Install requests
Domestic installation is slow, recommended to:
http://www.lfd.uci.edu/~gohlke/pythonlibs/
Search for the request and download
Modify the suffix name WHL to zip and unzip, copy the requests folder into the Python lib directory
2. Get Site Content
Import= {'user-agent':'mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.85 safari/537.36'= Requests.get ( " Http://tieba.baidu.com/f?ie=utf-8&kw=python ", headers=useragent)print(html.text)
3. Submitting data to a Web page
Get data from the server
Post sends data to the server
Get functions by constructing parameters in the URL
Post is to put data in the header to submit data
When using AJAX to load data is not displayed in the source code, it is necessary to send a POST request to obtain data
Data={ 'type':'1', 'Sort':'1', 'currentpage':'3'}html_text= Requests.post ("http://xxxxxx/student/courses/searchCourses", Data=data)Print(Html_text.text)
---------------------------------------------------------------------------------------------
For a small example, this is a note from the Academy of Sciences video.
ImportRequestsImportRe#-*-coding:utf-8-*-classSpider (object):defchangepage (self,url,total_page): Now_page= Int (Re.search ('pagenum= (\d+)', Url,re. S). Group (1)) Page_group=[] forIinchRange (now_page,total_page+1): Link= Re.sub ('pagenum=\d+','pagenum=%s'%I,url,re. S) page_group.append (link)returnPage_groupdefGetSource (self,url): HTML=requests.get (URL)returnHtml.textdefGeteveryclass (self,source): Everyclass= Re.findall ('(<li id=.*?</li>)', Source,re. S)returnEveryclassdefGetInfo (self,eachclass): info={} info['title'] = Re.search ('alt= "(. *?)"', Eachclass,re. S). Group (1) info['content'] = Re.search ('Display:none; " > (. *?) </p>', Eachclass,re. S). Group (1) Timeandlevel= Re.findall ('<em> (. *?) </em>', Eachclass,re. S) info['Classtime'] =Timeandlevel[0] info['Classlevel'] = timeandlevel[1] info['Learnnum'] = Re.search ('"Learn-number" > (. *?) </em>', Eachclass,re. S). Group (1) returnInfodefSaveinfo (Self,classinfo): F=open ('Info.txt','a')#Open (path + file name, read-write mode) r read-only, r+ read/write, w New (will overwrite the original file), a append, b binary file. Common mode foreachinchClassinfo:f.writelines ('Title:'+each['title']+'\ n') #f.writelines (' content: ' +each[' content ' + ' \ n ']) #f.writelines (' classtime: ' +each[' classtime ' + ' \ n ')) #f.writelines (' classlevel: ' +each[' classlevel ' + ' \ n ')) #f.writelines (' learnnum: ' +each[' learnnum ' + ' \ n ')]f.close ()if __name__=='__main__': ClassInfo= []#define a list that will place a dictionary of all coursesURL ='Http://www.jikexueyuan.com/course/?pageNum=1'Jikespider= Spider ()#instantiation ofAll_links = Jikespider.changepage (url,2)#get the URL for page 20 forLinkinchall_links:Print('Read file:'+link) HTML= Jikespider.getsource (link)#Gets the current page resourceEveryclass = Jikespider.geteveryclass (HTML)#gets the current page of all Li foreachinchEveryclass:info= Jikespider.getinfo (each)#Category Get ResourcesClassinfo.append (Info)#Join the listJikespider.saveinfo (ClassInfo)#Write Operations
Python crawl a bit