The last article through the requests+ to crawl the cat's-eye movie list, this time through the Requests+beautifulsoup to crawl again (in fact, this site is more suitable to use the BeautifulSoup library crawl)
1. Analyze the Web page source code First
You can see that each movie message is contained in a bunch of <dd>...</dd> tags, so the first step is to parse all the <dd> tag pairs through the BeautifulSoup library and then from the <dd> The <i> tag in which the position of the tag is parsed, the <p> tag where the movie name is located, the <p> tag where the release time is located, and the <p> tag where the score is located.
2. Information Extraction Code
#Coding:utf-8#AUTHOR:HMK fromBs4ImportBeautifulSoupImportRequestsImportBs4url='HTTP://MAOYAN.COM/BOARD/4'Header= {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "accept-encoding":"gzip, deflate, SDCH", "Accept-language":"zh-cn,zh;q=0.8", "Cache-control":"max-age=0", "Connection":"keep-alive", "Host":"maoyan.com", "Referer":"Http://maoyan.com/board", "upgrade-insecure-requests":"1", "user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.75 safari/537.36"}r= Requests.get (URL, headers=header) R.encoding=r.apparent_encodinghtml=R.textsoup= BeautifulSoup (HTML,'Html.parser')#Print (Soup.find_all (' DD '))List=[]#define a list, save all the movie data, do not define in the loop, or each time will be emptied, and finally will only leave the last movie data forDdinchSoup.find_all ('DD'): Index= Dd.i.string#Movie Rankings #Print (index)Movie = Dd.find ('P', class_='name'). String#Movie Name #print (movie.string)Release_times = Dd.find ('P', class_='Releasetime')#Release TimeRelease_time =release_times.string#print (release_time.string)s = Dd.find ('P', class_='score'). Contents#scoreScore = s[0].string+s[1].string#stitching the integer and fractional parts of fractions #print (score[0].string+score[1].string)list.append ([Index,movie,release_time,score])#add each movie's rank, name, release time, score to a list, and then append to a larger listPrint(list)
the focus of the above code is how the information in the for loop is extracted and then combined, with the following ideas: (1) First extract all the <dd> tag pairs from the page, and use the For loop to set each group <dd>The tag is assigned to a DD variable, and each DD variable is a tag object of a BS4 element;2) After getting the return object of the DD tag, you can use the Find method to extract the sub-tags of the DD tag (beginning to fall into a misunderstanding, because the printed DD content is the label element, and then think about whether it can be passed into the BeautifulSoup,
To generate a new BeautifulSoup object, the actual proof is no, because the type of DD is already <class 'Bs4.element.Tag'>, and the type of html=r.text that was previously passed in is <class 'Str'>, obviously can't do this!!
So just print out the object type when you can't figure it out .3) Extraction Ranking--use DD.I.STRING,DD.I to extract the first I tag under the DD label, just the first I tag under the DD tab, plus. String, which represents the extracted text (4) Extract the name of the movie-Using Dd.find ('P', class_='name'). String, extract the P label for the class property under the DD tag, because the movie name is under the P tag (5) Extract release time-Using Dd.find ('P', class_='Releasetime')(6) Extract the score-because the score is divided into 2 parts, the integer part and the fractional part, and belongs to the I label under a P tag respectively, so that the Tag.contents method (tag's. Contents property can output the child nodes of the tag as a list).
The 2 parts are then stitched together to form a complete score, as follows: Dd.find ('P', class_='score'). Contents[0].string + Dd.find ('P', class_='score'). contents[1].string
3. Complete code
#Coding:utf-8#AUTHOR:HMKImportRequests fromBs4ImportBeautifulSoupImportpymysql.cursorsdefget_html (URL, header):Try: R= Requests.get (Url=url, Headers=header, timeout=20) r.encoding=r.apparent_encodingifR.status_code = = 200: returnR.textElse: returnNoneexcept: returnNonedefget_data (HTML, list_data): Soup= BeautifulSoup (HTML,'Html.parser') DD= Soup.find_all ('DD') forTinchdd:ranking= T.i.string#rankingMovie = T.find ('P', class_='name'). String Release_time= T.find ('P', class_='Releasetime'). String Score= T.find ('P', class_='score'). Contents[0].string + T.find ('P', class_='score'). contents[1].string list_data.append ([ranking, movie, Release_time, score])defwrite_sql (data): Conn= Pymysql.connect (host='localhost', the user='Root', Password='123456', DB='Test', CharSet='UTF8') cur=conn.cursor () forIinchData:"""The data parameter here is a list that matches and processes (is a large list that contains all the movie information, each of which has its own list; iterate over a large list to extract each set of movie information so that each group of movie information extracted is a small list, Then you can write each set of movie information to the database."""movie= I#For each set of movie information, this can be seen as a set of movie data to be inserted into the databasesql ="INSERT INTO Maoyan_movie (Ranking,movie,release_time,score) values (%s,%s,%s,%s)" #SQL INSERT Statement Try: Cur.execute (sql, Movie)#execute the SQL statement, which refers to the data to be inserted into the databaseConn.commit ()#do not forget to submit the action after the insert is complete Print('Import succeeded') except: Print('Import Failed') Cur.close ()#Close CursorsConn.close ()#Close ConnectiondefMain (): Start_url='HTTP://MAOYAN.COM/BOARD/4'Depth= 10#Crawl Depth (page flip)Header = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "accept-encoding":"gzip, deflate, SDCH", "Accept-language":"zh-cn,zh;q=0.8", "Cache-control":"max-age=0", "Connection":"keep-alive", "Host":"maoyan.com", "Referer":"Http://maoyan.com/board", "upgrade-insecure-requests":"1", "user-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.75 safari/537.36"} forIinchRange (depth): URL= Start_url +'? offset='+ STR (10 *i) HTML=get_html (URL, header) List_data=[] get_data (HTML, List_data) Write_sql (list_data)Print(List_data)if __name__=="__main__": Main ()
Cat's Eye Movie Crawl (ii): Requests+beautifulsoup and store data in MySQL database