Get started with Python-----crawl Autohome News,---automatically log in to the drawer and click Like,

Last Update:2018-07-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Crawl Autohome News, the code is as follows

ImportRequestsres=requests.get (url='https://www.autohome.com.cn/news/')#initiate a GET request directly to the car to obtain the requested dataRes.encoding=res.apparent_encoding#The encoding of HTML to the RES, to avoid coding mismatch garbled fromBs4ImportBeautifulsoupsoup=beautifulsoup (Res.text,'Html.parser') Div=soup.find (name='Div', id="auto-channel-lazyload-article")#get the div tag with id ' auto-channel-lazyload-article 'Li_list=div.find_all (name='Li')#get all the Li tags, generate the list, and then iterate through the data for each Li tag forLiinchLi_list:h3=li.find (name='H3')    ifH3:#if the H3 tag does not exist after the code will be an error, so if the H3 tag is empty, then skip        Print(H3.text)#gets the text of the H3 labelp = Li.find (name='P')        Print(P.text)#gets the text of the P label        #get the A tag in the Li tag to get the href and reject//A = Li.find (name='a') href=a.get ('href') Href_url=href.split ("//") [1]        Print(Href_url)Print("  "* 20)

View Code

Automatically log in to the drawer and click Like

#the URL login and the like operation are required to carry the pre-logon cookie, so get request to obtain the cookie firstImportRequests fromBs4ImportBeautifulSoup#initiate a request to a URL to obtain a cookieres=requests.get (URL='https://dig.chouti.com/', Headers={'user-agent':"mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36"}) Res_cookie=res.cookies.get_dict ()#because you need to like all the news on this page, you need to get the ID of all the news URLs. The ID is in the a tag of class discus-a, so we get all the a tags first, so that the subsequent traversal gets the IDSoup=beautifulsoup (Res.text,'Html.parser') A_list=soup.find_all (name='a', attrs={'class':'discus-a'})#Login Drawerlogin=requests.request (URL='Https://dig.chouti.com/login', Method='POST', Headers={'user-agent':'mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36'}, the data={'Phone':'8618857172792',          'Password':'[email protected]',          'Onemonth':'1',}, Cookies=Res_cookie)#traversal get ID point likes forAinchA_list:id=a.get ('Lang') Res=Requests.request (Method='POST', the URL='https://dig.chouti.com/link/vote?linksId='+ID, headers={'user-agent':'mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36'}, Cookies=Res_cookie,)Print(Res.text)Print('*'*20)

View Code

Crawler Nature: Write a program that simulates a browser sending a request to get site information.

Common parameter parameters in requests requests:
Method: Network request mode. such as Get/post.
URL: The requested domain/IP address
Heards: request header. Example
headers={
' User-agent ': "mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36 "}
#user-agent refers to the requested terminal information
Cookies:cookie
Params:url in-pass parameters
such as: params={' user ': ' Tom ', ' pwd ': ' 123 '} #同等于http://www.xxx.com?user=tom&pwd=123,

Data: value passed in the request body
JSON: Format in the transform request body:
If data={' user ': ' Tom ', ' pwd ': ' 123 ', the data in the request body is User=tom&pwd=123,json converted to ' {' User ': ' Tom ', ' pwd ': ' 123 '} '
data=json.dumps{' user ': ' Tom ', ' pwd ': ' 123 '} effect equivalent to json={' user ': ' Tom ', ' pwd ': ' 123 '}
Files: File parameters. Example:
file_dict={
' F1 ':(' new filename ', open (' filename ', ' RB ') #参数2可传文件句柄或文件内容
}
Files=file_dict
Auth: Basic authentication method (seldom used, often used for window authentication login) Example
From Requests.auth import Httpbasicauth,httpdigestauth
Ret=requests.get (
Url= ',
Auth=httpbasicauth (' Tom ', ' 123456 ')
)
Print (Ret.text)
Timeout: Time-out, example
Ret=request.get (url= ' www.***.com ', timeout= (10,1)) #参数1是响应时间最多10秒, parameter 2 is a request time of up to 1 seconds and then stops after timeout
Allow_redirects: whether to redirect
Proxies: Proxy IP Example
proxies={' http ': ' **.**.**.** ', ' https ': ' **.**.**.** '} #访问 http use **ip, access HTTPS, use **ip
proxies={' http://**.**.**.** ': ' http://**.**.**.**:* * '}# access **IP, using * * Proxy
Note: If the agent needs to use the username password, then you need to import Httpproxyauth.
From Requests.auth import Httpproxyauth
proxies_dict={' http ': ' **.**.**.** ', ' https ': ' **.**.**.** '}
auth=httpproxyauth{' user ', ' passwd '}
Res=requests.get (url= ", Proxies=proxies_dict,auth=auth)
Print (Res.text)
Stream: Used when downloading large files, like the context management of an iterator.
1. Res=requests.get (url= ' https://www.autohome.com.cn/news/')
For I in Res.iter_content ():
Print (i)

2. Form Contextlib Importan Closing
With closing (requests.get (' https://www.autohome.com.cn/news/', stream=true)) as R;
For I in R.iter_content ():
Print (i)
Cert: Certificate (essentially encryption of data), such as the difference between HTTPS and HTTP
Verify: Confirmation during certificate validation


Cases:
Import requests
Requests.get (
Url= "Http://www.xxx.com",
params={' user ': ' Tom ', ' pwd ': ' 123 '} #同等于http://www.xxx.com?user=tom&pwd=123,
heards={},
cookies={}


)
Requests.post (
Url= "Http://www.xxx.com",
params={' user ': ' Tom ', ' pwd ': ' 123 '} #同等于http://www.xxx.com?user=tom&pwd=123,
heards={},
cookies={},
data={}, #get请求中没有请求体, so no data

)

Getting started with python-----crawling Autohome News,---automatically log in to the drawer and like it,

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Get started with Python-----crawl Autohome News,---automatically log in to the drawer and click Like,

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support