Environment
python2.7 Pycharm
Topic: Python Crawl Video (desktop version)---crawler, desktop application
Advantages: Simple syntax, fast entry, less code, high development efficiency, third-party library
1. Graphical User Interface---GUI
2. Crawler, crawl view screen download
3. Combine, show in GUI
Regular Expressions: What you want to express a form model
Match FindAll (regular expression, source code)
Knowledge Points:
1. How to create a window
2. How to fill the scrollbar Click the button text box
3. Fixed Web site to prohibit crawler---Plus header information (browser), to disguise the browser to access
4. Open Web Access source requests
5. Get the video name
6. Download and Show
Code
#!/usr/bin/env python#-*-coding:utf-8-*-#Author:benjaminyang fromTkinterImport* fromScrolledtextImportScrolledtext#Text scroll barImporturllib,requestsImportReImportThreading#multithread processing and control#Import Time #Importsysreload (SYS) sys.setdefaultencoding ("Utf-8") Url_name=[]#Url+nameA=1#pagesdefget ():GlobalA#Changing global variablesHd={ 'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, Likegecko) chrome/62.0.3202.94safari/537.36'} URL='http://www.budejie.com/video/'+Str (a) Varl.set ('the screen for page%s has been obtained'%(a)) HTML=requests.get (URL,HEADERS=HD). Text#send GET request to get source code #Print HTML #网站全部源码Url_content=re.compile (R'(<div class= "j-r-list-c" >.*?</div>.*?</div>)', Re. S#compile, improve efficiency, match line breakurl_contents=Re.findall (url_content,html)#URL for print url_contents #视屏的名称 + video forIinchUrl_contents:url_reg=r'data-mp4= "(. *?)" >'Url_items=Re.findall (url_reg,i)#Print Url_items ifUrl_items:#If there is a video screen, I will match the name, if it is a picture, I will skipName_reg=re.compile (R'<a href= "/detail-. {8}?. HTML "> (. *?) </a>', Re. S) Name_items=Re.findall (name_reg,i)#Print Name_items #列表中的中文是可迭代对象都是Unicode格式 forI,kinchZip (name_items,url_items):#the ZIP function corresponds to two iterations of object one by oneurl_name.append ([i,k])Printi,kreturnUrl_nameid=1#number of video screensdefwrite ():GlobalID whileId<10: Url_name=get ()#call Get video + name forIinchUrl_name:#Windows only recognizes that GBK first decodes Unicode and then encodes it into GBKUrllib.urlretrieve (I[1],'Video\\%s.mp4'% (I[0]). Decode ('Utf-8'). Encode ('GBK'))#Download the method UrlretrieveText.insert (END,STR (ID) +'.'+i[1]+'\ n'+i[0]+'\ n') url_name.pop (0)#Delete an elementId+=1Varl.set ('Hi: Video link and video crawl complete, over!. ')defstart (): th=threading. Thread (Target=write ())#instance one threadTh.start () root=TK ()#instantiate a variableRoot.title ('fuck download Real fucked') Text= Scrolledtext (root,font= ('Microsoft Ya-Black', 10) ) Text.grid ()#one way to implement layoutButton=button (root,text='Start Crawl', Font= ('Microsoft Ya-Black', ten), Command=start)#button binding start functionButton.grid ()#ButtonVarl=stringvar ()#binding a variable through the TK methodLabel=label (root,font= ('Microsoft Ya-Black', ten), fg='Red', textvariable=Varl) Label.grid () Varl.set ('The panda is ready ...') Root.mainloop ()#Create a window directive
Demonstrate:
Crawler Exercise II: gui+ download best sister website video