Sina Weibo crawl need to design to login, here I do not impersonate login, but use cookies to crawl.
To obtain a cookie:
650) this.width=650; "Src=" https://s1.51cto.com/wyfs02/M01/9D/F2/wKioL1mJML3j2zx3AADTqQSZXG0258.png-wh_500x0-wm_ 3-wmp_4-s_1312038099.png "title=" 05.png "alt=" Wkiol1mjml3j2zx3aadtqqszxg0258.png-wh_50 "/>
650) this.width=650; "Src=" https://s2.51cto.com/wyfs02/M01/9D/F2/wKiom1mJMNWR-NW3AAXbBOQBObw578.png-wh_500x0-wm_ 3-wmp_4-s_3181523793.png "title=" 04.png "alt=" Wkiom1mjmnwr-nw3aaxbboqbobw578.png-wh_50 "/>
Code:
#-*-coding:utf8-*-from bs4 import beautifulsoupimport requestsimport timeimport Osimport sysimport randomreload (SYS) sys.setdefaultencoding (' Utf-8 ') user_id = User idcookie = {"Cookie": "_t_wm=f3a2assae4335dfdf38fdc7a25a88; scf= Apmi3mluv9yh6ykz4i7-hmlhojzptqulc5g0xlrri-neo3xn1frwi5w1helzwg1bmkx4mv_ohkdtnv2ihxjqgls.; SUB=_2A250jET_DeRhGeNN7FsX9CrIzzqIHXVXj2y3rDV6PUJbkdBeLUrnkW1AtfoOlrd_kyd1Izu7Q1uKaFvRDQ..; suhb=0k1ysjsrjvbdgd; ssologinstate=1502098607 "}for page in range (: ) url = ' https://weibo.cn/573550093?page=%d ' % page Response = requests.get (Url, cookies = cookie) html = response.text soup = beautifulsoup (html, ' lxml ') username = soup.title.string cttlisT = [] for ctt in soup.find_all (' span ', class_= "CTT"): cttlist.append (Ctt.get_text ()) ctlist = [] for ct in soup.find_all (' span ', class_= "CT"): ctlist.append (Ct.get_text ()) if page == 0: print "Weibo user profile:" + cttlist[0 ] print "Weibo user personality signature:" + cttlist[1] print "User's Weibo news: \ n" imgurllist = [] for img in soup.find_all (' a '): if img.find (' img ') is not None : if ' Http://tva3. ' not in img.find (' img ') [' src '] and ' https://h5 ' not in img.find ( ' img ') [' src ']: Imgurllist.append (Img.find (' img ') [' src ']) #imgname = soup.title.string + ' _ ' + str (page) + str (Time.time ()) +str (Random.randrange (0, 1000, 3)) + '. jpg ' if not os.path.exists (str (soup.title.string)): os.mkdir (str (soup.title.string)) # imgname = './' + str (soup.title.string) + '/' + soup.title.string + ' _ ' + str (Time.time ()) + '. jpg ' for imgurl in imgurllist: imgname = './' + str (soup.title.string)  +  ' /' +soup.title.string + ' _ ' + str (page) + str (Time.time ()) +str ( Random.randrange (0, 1000, 3)) + '. jpg ' Response = requests.get ('%s ' % imgurl)         DIRW = str (soup.title.string) open (imgname, ' WB '). Write (response.content) time.sleep (1.5) Try: for i in range (Len (ctlist)): print cttlist[2+i] print ctlist[i] print "\ n" except: for i In range (Len (ctlist)): print cttlist[i] print ctlist[i] print "\ n" if "Next page" not in soup.select (' div[id= "pagelist"] ") [0].get_text (): break time.sleep (Random.randint (1,3))
Effect Show:
650) this.width=650; "Src=" https://s2.51cto.com/wyfs02/M01/9D/F3/wKioL1mJQLTy1ZkmAAIULD7liww937.png-wh_500x0-wm_ 3-wmp_4-s_2394453633.png "title=" 96.png "alt=" Wkiol1mjqlty1zkmaaiuld7liww937.png-wh_50 "/>
650) this.width=650; "Src=" https://s1.51cto.com/wyfs02/M02/9D/F3/wKioL1mJQvOA6i52AA2OaKf9Pr0595.png-wh_500x0-wm_ 3-wmp_4-s_2720288567.png "title=" 07.png "alt=" Wkiol1mjqvoa6i52aa2oakf9pr0595.png-wh_50 "/>
This article is from the "Shangwei Super" blog, please make sure to keep this source http://9399369.blog.51cto.com/9389369/1954433
Python Crawl Weibo information