2017-07-29 23:20:24
Main technical route: requests+bs4+ formatted output
ImportRequests fromBs4ImportBeautifulsoupurl='http://www.zuihaodaxue.com/zuihaodaxuepaiming2017.html'defgethtml (URL):#Open Web page is risky, need to use TRY-EXCEPT statement for risk controlKV = {'user-agent':'mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:54.0) gecko/20100101 firefox/54.0'} Try: R= Requests.get (url,headers=kv) r.raise_for_status ()#If Open fails, a Httperror exception is thrown #encoding is a coding method that is analyzed from the header, and apparent_encoding is the encoding that is analyzed from the content.r.encoding=r.apparent_encodingreturnRexcept: Print("Open Failed") return-1defGetText (R): Soup= BeautifulSoup (R.text,'Html.parser') #print (Soup.prettify ())TR = Soup ('TR') LS=list () LST=list () forIinchRange (4): Th= Tr[0] ('th') lst.append (th[i].string) ls.append (LST) forIinchRange (1, Len (TR)): TD= Tr[i] ('TD') LST=list () lst.append (i) forKinchRange (1,4): Lst.append (td[k].string) ls.append (LST)returnlsdefprinttext (LS): forIinchls:Print('{0:^10}\t{1:{3}^10}\t{2:^10}'. Format (I[0],I[1],I[2],CHR (12288)))if __name__=='__main__': R=gethtml (URL) ls=GetText (R) printtext (LS)
Python crawler-Get University Rankings