Subjects: Wheat Academy
1, most of the video information is present in http://www.maiziedu.com/course/all/, all the video information has its own ID, the first query address should be in: ' http://www.maiziedu.com/course/' + In the ID
?
Analysis page get title, get directory for Create folder
Url_dict1 = {} URL = ' http://www.maiziedu.com/course/{} '. Format (num)page = Urllib.request.urlopen (URL) context = page.Read().Decode(' UTF8 ') title =Re.Search(' <title>.*</title> ', context) title = title.Group().Strip(' </title> ')if' + ' in title:return{}Else: #对文件夹进行进一步的精简if(Len (title).Split(":"))! = 1): title = title.Split(":") [1] title=title.Split(‘-‘) [0] url_dict1[' url '] = URL url_dict1[' title '] = title # urls.append (Url_dict1)returnUrl_dict1
?
Gets the dictionary that contains the URL and title
This section is for analysis to get the address and title of all chapters in the page
Analyze and get links to play pages in Python code with BS4
?
URLs = []page = Urllib.request.urlopen (URL) context = page. Read (). Decode (' utf8 ') soup = beautifulsoup (Context, "html.parser") for tag in soup. Find (' ul ', class_= ' lesson-lists '). Find_all (' Li '): urls.append (tag. Find(' a '). Get (' href '). Split (‘/‘) [-2]) return URLs
Returns the URL and title of all chapters
2, Play page analysis page playback path is from the JS in clear text call, directly get the title of the site and video call JS file URL path
page = Urllib.request.urlopen (URL) context = page. Read (). Decode (' utf8 ') soup = beautifulsoup (Context, "html.parser") title = Soup. Find ('div', class_= ' bottom-module'). Find_all ('spanrere. S). Sub (", str (title)) + '. mp4 ' OK = soup.find_all (' script ') [2]return OK. string. Split (' "') [-2], title
The title is obtained here in order to rename the contents of the Web site as it appears when the file is saved. 3. Download the file
defReportCount, BlockSize, totalsize): j = ' # ' percent = Int (Count* BlockSize * 100/totalsize)SYS.stdout.Write(str (percent) + '% [' + j * Int (PERCENT/2) + ' + ')]\r")SYS.stdout. Flush ()defDownload (URL, filename): # Base_dir =OS.Path.Split(OS.Path. Realpath (__file__)) [0] SaveFile =OS.Path.Join(Base_dir, filename)ifNotOS.Path. Exists (SaveFile): Urllib.request.urlretrieve (URL, saveFile, reporthook=report)SYS.stdout.Write("\n\rdownload complete, saved as%s"% (saveFile) + ' \n\r ')SYS.stdout. Flush ()Else:Print(' File already exists! Skip continue download Next ')
?
Report displays the download progress at the command line
Urlretrieve can be downloaded or used in other ways
?
Summarize:
Use URLIB+BS4 directly rough download of wheat video files, there are many unsatisfactory places need to improve, for copyright reasons, do not post the source code
Crawling videos with BS4 and Urllib