Get started with Python-----crawl Autohome News,---automatically log in to the drawer and click Like,

Source: Internet
Author: User

Crawl Autohome News, the code is as follows

ImportRequestsres=requests.get (url='https://www.autohome.com.cn/news/')#initiate a GET request directly to the car to obtain the requested dataRes.encoding=res.apparent_encoding#The encoding of HTML to the RES, to avoid coding mismatch garbled fromBs4ImportBeautifulsoupsoup=beautifulsoup (Res.text,'Html.parser') Div=soup.find (name='Div', id="auto-channel-lazyload-article")#get the div tag with id ' auto-channel-lazyload-article 'Li_list=div.find_all (name='Li')#get all the Li tags, generate the list, and then iterate through the data for each Li tag forLiinchLi_list:h3=li.find (name='H3')    ifH3:#if the H3 tag does not exist after the code will be an error, so if the H3 tag is empty, then skip        Print(H3.text)#gets the text of the H3 labelp = Li.find (name='P')        Print(P.text)#gets the text of the P label        #get the A tag in the Li tag to get the href and reject//A = Li.find (name='a') href=a.get ('href') Href_url=href.split ("//") [1]        Print(Href_url)Print("  "* 20)
View Code

Automatically log in to the drawer and click Like

#the URL login and the like operation are required to carry the pre-logon cookie, so get request to obtain the cookie firstImportRequests fromBs4ImportBeautifulSoup#initiate a request to a URL to obtain a cookieres=requests.get (URL='https://dig.chouti.com/', Headers={'user-agent':"mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36"}) Res_cookie=res.cookies.get_dict ()#because you need to like all the news on this page, you need to get the ID of all the news URLs. The ID is in the a tag of class discus-a, so we get all the a tags first, so that the subsequent traversal gets the IDSoup=beautifulsoup (Res.text,'Html.parser') A_list=soup.find_all (name='a', attrs={'class':'discus-a'})#Login Drawerlogin=requests.request (URL='Https://dig.chouti.com/login', Method='POST', Headers={'user-agent':'mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36'}, the data={'Phone':'8618857172792',          'Password':'[email protected]',          'Onemonth':'1',}, Cookies=Res_cookie)#traversal get ID point likes forAinchA_list:id=a.get ('Lang') Res=Requests.request (Method='POST', the URL='https://dig.chouti.com/link/vote?linksId='+ID, headers={'user-agent':'mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36'}, Cookies=Res_cookie,)Print(Res.text)Print('*'*20)
View Code
Crawler Nature: Write a program that simulates a browser sending a request to get site information.

Common parameter parameters in requests requests:
Method: Network request mode. such as Get/post.
URL: The requested domain/IP address
Heards: request header. Example
headers={
' User-agent ': "mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36 "}
#user-agent refers to the requested terminal information
Cookies:cookie
Params:url in-pass parameters
such as: params={' user ': ' Tom ', ' pwd ': ' 123 '} #同等于http://www.xxx.com?user=tom&pwd=123,

Data: value passed in the request body
JSON: Format in the transform request body:
If data={' user ': ' Tom ', ' pwd ': ' 123 ', the data in the request body is User=tom&pwd=123,json converted to ' {' User ': ' Tom ', ' pwd ': ' 123 '} '
data=json.dumps{' user ': ' Tom ', ' pwd ': ' 123 '} effect equivalent to json={' user ': ' Tom ', ' pwd ': ' 123 '}
Files: File parameters. Example:
file_dict={
' F1 ':(' new filename ', open (' filename ', ' RB ') #参数2可传文件句柄或文件内容
}
Files=file_dict
Auth: Basic authentication method (seldom used, often used for window authentication login) Example
From Requests.auth import Httpbasicauth,httpdigestauth
Ret=requests.get (
Url= ',
Auth=httpbasicauth (' Tom ', ' 123456 ')
)
Print (Ret.text)
Timeout: Time-out, example
Ret=request.get (url= ' www.***.com ', timeout= (10,1)) #参数1是响应时间最多10秒, parameter 2 is a request time of up to 1 seconds and then stops after timeout
Allow_redirects: whether to redirect
Proxies: Proxy IP Example
proxies={' http ': ' **.**.**.** ', ' https ': ' **.**.**.** '} #访问 http use **ip, access HTTPS, use **ip
proxies={' http://**.**.**.** ': ' http://**.**.**.**:* * '}# access **IP, using * * Proxy
Note: If the agent needs to use the username password, then you need to import Httpproxyauth.
From Requests.auth import Httpproxyauth
proxies_dict={' http ': ' **.**.**.** ', ' https ': ' **.**.**.** '}
auth=httpproxyauth{' user ', ' passwd '}
Res=requests.get (url= ", Proxies=proxies_dict,auth=auth)
Print (Res.text)
Stream: Used when downloading large files, like the context management of an iterator.
1. Res=requests.get (url= ' https://www.autohome.com.cn/news/')
For I in Res.iter_content ():
Print (i)

2. Form Contextlib Importan Closing
With closing (requests.get (' https://www.autohome.com.cn/news/', stream=true)) as R;
For I in R.iter_content ():
Print (i)
Cert: Certificate (essentially encryption of data), such as the difference between HTTPS and HTTP
Verify: Confirmation during certificate validation


Cases:
Import requests
Requests.get (
Url= "Http://www.xxx.com",
params={' user ': ' Tom ', ' pwd ': ' 123 '} #同等于http://www.xxx.com?user=tom&pwd=123,
heards={},
cookies={}


)
Requests.post (
Url= "Http://www.xxx.com",
params={' user ': ' Tom ', ' pwd ': ' 123 '} #同等于http://www.xxx.com?user=tom&pwd=123,
heards={},
cookies={},
data={}, #get请求中没有请求体, so no data

)

Getting started with python-----crawling Autohome News,---automatically log in to the drawer and like it,

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.