Crawl Autohome News, the code is as follows
ImportRequestsres=requests.get (url='https://www.autohome.com.cn/news/')#initiate a GET request directly to the car to obtain the requested dataRes.encoding=res.apparent_encoding#The encoding of HTML to the RES, to avoid coding mismatch garbled fromBs4ImportBeautifulsoupsoup=beautifulsoup (Res.text,'Html.parser') Div=soup.find (name='Div', id="auto-channel-lazyload-article")#get the div tag with id ' auto-channel-lazyload-article 'Li_list=div.find_all (name='Li')#get all the Li tags, generate the list, and then iterate through the data for each Li tag forLiinchLi_list:h3=li.find (name='H3') ifH3:#if the H3 tag does not exist after the code will be an error, so if the H3 tag is empty, then skip Print(H3.text)#gets the text of the H3 labelp = Li.find (name='P') Print(P.text)#gets the text of the P label #get the A tag in the Li tag to get the href and reject//A = Li.find (name='a') href=a.get ('href') Href_url=href.split ("//") [1] Print(Href_url)Print(" "* 20)View Code
Automatically log in to the drawer and click Like
#the URL login and the like operation are required to carry the pre-logon cookie, so get request to obtain the cookie firstImportRequests fromBs4ImportBeautifulSoup#initiate a request to a URL to obtain a cookieres=requests.get (URL='https://dig.chouti.com/', Headers={'user-agent':"mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36"}) Res_cookie=res.cookies.get_dict ()#because you need to like all the news on this page, you need to get the ID of all the news URLs. The ID is in the a tag of class discus-a, so we get all the a tags first, so that the subsequent traversal gets the IDSoup=beautifulsoup (Res.text,'Html.parser') A_list=soup.find_all (name='a', attrs={'class':'discus-a'})#Login Drawerlogin=requests.request (URL='Https://dig.chouti.com/login', Method='POST', Headers={'user-agent':'mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36'}, the data={'Phone':'8618857172792', 'Password':'[email protected]', 'Onemonth':'1',}, Cookies=Res_cookie)#traversal get ID point likes forAinchA_list:id=a.get ('Lang') Res=Requests.request (Method='POST', the URL='https://dig.chouti.com/link/vote?linksId='+ID, headers={'user-agent':'mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36'}, Cookies=Res_cookie,)Print(Res.text)Print('*'*20)View Code
Crawler Nature: Write a program that simulates a browser sending a request to get site information.
Common parameter parameters in requests requests:
Method: Network request mode. such as Get/post.
URL: The requested domain/IP address
Heards: request header. Example
headers={
' User-agent ': "mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36 "}
#user-agent refers to the requested terminal information
Cookies:cookie
Params:url in-pass parameters
such as: params={' user ': ' Tom ', ' pwd ': ' 123 '} #同等于http://www.xxx.com?user=tom&pwd=123,
Data: value passed in the request body
JSON: Format in the transform request body:
If data={' user ': ' Tom ', ' pwd ': ' 123 ', the data in the request body is User=tom&pwd=123,json converted to ' {' User ': ' Tom ', ' pwd ': ' 123 '} '
data=json.dumps{' user ': ' Tom ', ' pwd ': ' 123 '} effect equivalent to json={' user ': ' Tom ', ' pwd ': ' 123 '}
Files: File parameters. Example:
file_dict={
' F1 ':(' new filename ', open (' filename ', ' RB ') #参数2可传文件句柄或文件内容
}
Files=file_dict
Auth: Basic authentication method (seldom used, often used for window authentication login) Example
From Requests.auth import Httpbasicauth,httpdigestauth
Ret=requests.get (
Url= ',
Auth=httpbasicauth (' Tom ', ' 123456 ')
)
Print (Ret.text)
Timeout: Time-out, example
Ret=request.get (url= ' www.***.com ', timeout= (10,1)) #参数1是响应时间最多10秒, parameter 2 is a request time of up to 1 seconds and then stops after timeout
Allow_redirects: whether to redirect
Proxies: Proxy IP Example
proxies={' http ': ' **.**.**.** ', ' https ': ' **.**.**.** '} #访问 http use **ip, access HTTPS, use **ip
proxies={' http://**.**.**.** ': ' http://**.**.**.**:* * '}# access **IP, using * * Proxy
Note: If the agent needs to use the username password, then you need to import Httpproxyauth.
From Requests.auth import Httpproxyauth
proxies_dict={' http ': ' **.**.**.** ', ' https ': ' **.**.**.** '}
auth=httpproxyauth{' user ', ' passwd '}
Res=requests.get (url= ", Proxies=proxies_dict,auth=auth)
Print (Res.text)
Stream: Used when downloading large files, like the context management of an iterator.
1. Res=requests.get (url= ' https://www.autohome.com.cn/news/')
For I in Res.iter_content ():
Print (i)
2. Form Contextlib Importan Closing
With closing (requests.get (' https://www.autohome.com.cn/news/', stream=true)) as R;
For I in R.iter_content ():
Print (i)
Cert: Certificate (essentially encryption of data), such as the difference between HTTPS and HTTP
Verify: Confirmation during certificate validation
Cases:
Import requests
Requests.get (
Url= "Http://www.xxx.com",
params={' user ': ' Tom ', ' pwd ': ' 123 '} #同等于http://www.xxx.com?user=tom&pwd=123,
heards={},
cookies={}
)
Requests.post (
Url= "Http://www.xxx.com",
params={' user ': ' Tom ', ' pwd ': ' 123 '} #同等于http://www.xxx.com?user=tom&pwd=123,
heards={},
cookies={},
data={}, #get请求中没有请求体, so no data
)
Getting started with python-----crawling Autohome News,---automatically log in to the drawer and like it,