Python combines beautifulsoup to capture data

Source: Internet
Author: User
Tags pprint

This article mainly introduces the use of Python login to know the account number, grab the user name, user avatar, the problem, the source of the problem, the amount of praise, as well as the respondents. The data is parsed with beautiful soup.
First of all, to solve is to know the login problem. Login in the program it is not possible to log in directly with our user name and password, and here we use a clumsy method to attach cookies directly to the request process. This cookie value can be captured directly by Firebug in Firefox when it is logged in. Cookies are obtained as shown in:

   [INFO]email =youremail password = Youpassword [Cookies]q_c1 =  cap_id =  _za =  __utmt =  __utma =  __UTMB =  __UTMC =  __UTMZ =  __UTMV =  z_c0 =  unlock_ticket =  

Then we can use Python's requests module, Urllib, URLLIB2 module to implement the following login code:

#知乎登录Def create_session (): CF = Configparser.configparser () cf.read (' Config.ini ')#从配置文件获取cookies值, and converted to Dictcookies = Cf.items (' Cookies ') cookies = dict (cookies) from Pprint import pprint pprint (cookies)#获取登录名email = cf.get (' Info ',' Email ')#获取登录密码Password = Cf.get (' Info ',' Password ') session = Requests.session () Login_data = {' Email ': Email,' Password ': password} header = {' User-agent ':' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/43.0.2357.124 safari/537.36 ',' Host ':' www.zhihu.com ',' Referer ':' http://www.zhihu.com/'}#post登录R = Session.post (' Http://www.zhihu.com/login/email ', Data=login_data, Headers=header)ifR.json () [' R '] ==1:Print ' Login Failed, Reason is: ', forM in R.json () [' Data ']:PrintR.json () [' Data '][M]Print ' So we use the cookies to login in ... 'Has_cookies =False         forKey in Cookies:ifKey! =' __name__ '  andCookies[key]! ="': Has_cookies =True                 Break        ifHas_cookies isFalse: Raise ValueError (' Please fill in the cookies in the Config.ini file. ')Else: R = Session.get (' Http://www.zhihu.com/login/email ', cookies=cookies)# Log in with cookies    returnSession, Cookies
登录是否成功我们可以查看保存的请求返回的网页是否包含登录用户名。登录成功后我们利用BeautifulSoup 的构造方法将请求后获得的知乎页面文档进行构造得到一个文档的对象。这个转换过程中首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码,然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档。Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象。我们可以通过这些节点结合原始HTML文件来提取我们需要的内容。如所示,是我们获取登录用户名和部分知乎问题在html文档中对应的部分。    我们完整的抓取以及数据提取代码如下所示:


    #-*-Coding:utf-8-*-"'' web crawler user name password and verification code login: crawl to know the site '"'Import requestsimport configparserfrom bs4 import beautifulsoupimport reimport urllibimport urllib2def create_session () : CF = Configparser.configparser () cf.Read(' Config.ini ')#从配置文件获取cookies值, and converted to Dictcookies = Cf.items (' Cookies ') cookies = dict (cookies) from Pprint import pprint pprint (cookies)#获取登录名email = cf.get (' Info ',' Email ')#获取登录密码Password = Cf.get (' Info ',' Password ') session = Requests.session () Login_data = {' Email ': Email,' Password ': password} header = {' User-agent ':' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/43.0.2357.124 safari/537.36 ',' Host ':' www.zhihu.com ',' Referer ':' http://www.zhihu.com/'}#post登录R = Session.post (' Http://www.zhihu.com/login/email ', Data=login_data, Headers=header)ifR.json () [' R '] ==1:Print ' Login Failed, Reason is: ', for mIn R.json () [' Data ']:PrintR.json () [' Data '][m]Print ' So we use the cookies to login in ... 'Has_cookies = False forKey in Cookies:ifKey! =' __name__ '  andCookies[key]! ="': Has_cookies = True Break        ifHas_cookies is False:raise valueerror (' Please fill in the cookies in the Config.ini file. ')Else: R = Session.get (' Http://www.zhihu.com/login/email ', cookies=cookies)# Log in with cookies    returnSession, Cookiesif__name__==' __main__ ': requests_session, requests_cookies = create_session () URL =' http://www.zhihu.com 'reqs= requests_session.get (URL, cookies=requests_cookies)# has landedContent=reqs.content#保存整个内容为html页面WithOpen(' url.html ',' W ') as FP:FP.Write(content) Soup=beautifulsoup (content)#获取登录用户名User_name=soup.find ("Div", class_="Top-nav-profile"). a.span.stringPrint "user_name:%s"% (USER_NAME)#获取用户头像地址Pic_url=soup.find ("Div", class_="Top-nav-profile"). a.img#下载用户头像Urllib.urlretrieve (pic_url[' src '],'/home/zeus/pic1/'+' 1.jpg ')Print "Potos:%s" %(pic_url[' src '])#获取前10个话题的内容     forTopic in Soup.find_all ("Div", class_="Feed-main", limit=Ten):Print '-------------------------------------------------------'    #获取知乎问题来源Topic_source=topic.find ("Div", class_="Feed-source"). A.get_text ()Print "topic Source:%s" %(Topic_source)#获取知乎问题Question=topic.find ("Div", class_="Content"). A.get_text ()Print "question:%s" %(Question#获取该问题被赞次数Votecount=topic.find ("Div", class_="Zm-item-vote"). A.get_text ()Print "Votecount:%s" %(Votecount)#获取问题回答者的用户名Answer=topic.find ("Div", class_="Zm-item-rich-text js-collapse-body")ifAnswerPrint "Answer_name:%s" %(answer[' Data-author-name '])Print '-------------------------------------------------------'

We run the code to get the following result:

Compare browser pages with data:

The captured and saved user picture is placed in the machine:/home/zeus path under the Pic1 folder, the picture name is 1.jpg.

It can be seen that our program has successfully crawled to the logged-in user's knowledge of the data.

Python combines beautifulsoup to capture data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.