Python crawlers-(2) crawler basics-Selenium PhantomJS and phantomjspython Crawlers
I. Preface
Some time ago I tried to crawl Netease cloud music songs. This time I plan to crawl QQ music's song information. Netease cloud music songs are displayed through iframe. You can use Selenium to obtain the iframe page elements,
QQ music uses asynchronous loading methods, with different routines. This is the mainstream page loading method, and crawling is a bit difficult, but it is also a challenge for you.
Ii. Python crawls QQ music single
A Previously watched MOOC video provides a good explanation of the general crawler writing steps. We also follow this.
Crawler steps
1. Determine the target
First, we need to clarify our goal. This time we crawled the single song of QQ music singer Liu Dehua.
(Baidu encyclopedia)-> analysis target (Policy: url format (range), data format, webpage encoding)-> code writing-> execution Crawler
2. Analysis Objectives
Link to: https://y.qq.com/n/yqq/singer/003aQYLo2x8izP.html#tab=song &
On the left side, you can see that songs are arranged by page. 30 songs are displayed on each page, 30 pages in total. Click the page number or ">" on the rightmost side to jump to the next page. The browser sends an ajax asynchronous request to the server. The begin and num parameters are displayed from the link, indicates the subscript of the Start Song (2nd pages, 30 start scripts) and 30 logs are returned on one page. The server returns the json-format information of the Song (MusicJsonCallbacksinger_track ({"code": 0, "data": {"list": [{"Flisten_count1":...]}), if you only want to obtain the song information separately, you can directly splice the link request and parse the returned json format data. The method of directly parsing the data format is not used here. I use the Python Selenium method. Every time I get and parse the single information on one page, click ">" to jump to the next page and continue parsing, until all single information is parsed and recorded. Finally, request the link of each single to obtain detailed single information.
On the right is the source code of the webpage. All the song information is in the div floating layer named mod_songlist. the unordered list ul named songlist_list shows a single song for each sub-element li, class Name: a tag under songlist _ album, including the single link, name, and duration.
3. write code
1) download the webpage content. Here we use the Python Urllib standard library and encapsulate a download method:
Def download (url, user_agent = 'wswp', num_retries = 2 ):
If url is None:
Return None
Print ('downloading: ', url)
Headers = {'user-agent': 'mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/100 '}
Request = urllib. request. Request (url, headers = headers) # Set the user proxy wswp (Web Scraping with Python)
Try:
Html = urllib. request. urlopen (request). read (). decode ('utf-8 ')
Failed t urllib. error. URLError as e:
Print ('downloading Error: ', e. reason)
Html = None
If num_retries> 0:
If hasattr (e, 'code') and 500 <= e. code <600:
# Retry when return code is 5xx HTTP erros
Return download (url, num_retries-1) # request failed, retry twice by default,
Return html
2) parse the webpage content. Here we use the third-party plug-in BeautifulSoup. For details, refer to BeautifulSoup API.
Def music_scrapter (html, page_num = 0 ):
Try:
Soup = BeautifulSoup (html, 'html. parser ')
Mod_songlist_div = soup. find_all ('div ', class _ = 'mod _ songlist ')
Songlist_ul = mod_songlist_div [1]. find ('ul ', class _ = 'songlist _ list ')
'''Start parsing li Song information '''
Lis = songlist_ul.find_all ('lil ')
For li in lis:
A = li. find ('div ', class _ = 'songlist _ album'). find ('A ')
Music_url = a ['href '] # single link
Urls. add_new_url (music_url) # Save the single link
# Print ('music _ url: {0} '. format (music_url ))
Print ('total music link num: % s' % len (urls. new_urls ))
Next_page (page_num + 1)
Expiration t TimeoutException as err:
Print ('webpage parsing error: ', err. args)
Return next_page (page_num + 1)
Return None
Def get_music ():
Try:
While urls. has_new_url ():
# Print ('urls count: % s' % len (urls. new_urls ))
'''Jump to the song link and get the song details '''
New_music_url = urls. get_new_url ()
Print ('url leave count: % s' % str (len (urls. new_urls)-1 ))
Html_data_info = download (new_music_url)
# Failed to download the webpage. Go directly to the next loop to avoid program interruption
If html_data_info is None:
Continue
Soup_data_info = BeautifulSoup (html_data_info, 'html. parser ')
If soup_data_info.find ('div ', class _ = 'None _ txt') is not none:
Print (new_music_url, 'sorry, this album cannot be viewed for copyright reasons! ')
Continue
Mod_songlist_div = soup_data_info.find ('div ', class _ = 'mod _ songlist ')
Songlist_ul = mod_songlist_div.find ('ul ', class _ = 'songlist _ list ')
Lis = songlist_ul.find_all ('lil ')
Del lis [0] # Delete the first li
# Print ('len (lis): $ s' % len (lis ))
For li in lis:
A_songname_txt = li. find ('div ', class _ = 'songlist _ songname '). find ('span ', class _ = 'songlist _ songname_txt '). find ('A ')
If 'https' not in a_songname_txt ['href ']: # if the single link does not contain the protocol header, add
Song_url = 'https: '+ a_songname_txt ['href']
Song_name = a_songname_txt ['title']
Singer_name = li. find ('div ', class _ = 'songlist _ artist'). find ('A'). get_text ()
Song_time = li. find ('div ', class _ = 'songlist _ Time'). get_text ()
Music_info = {}
Music_info ['song _ name'] = song_name
Music_info ['song _ url'] = song_url
Music_info ['singer _ name'] = singer_name
Music_info ['song _ time'] = song_time
Collect_data (music_info)
Failed t Exception as err: # Skip If parsing Exception
Print ('downloading or parse music information error continue: ', err. args)
4. Execute Crawler
The crawler ran, crawled the album link one page at a time, saved it to the collection, and finally obtained the single name and link through the get_music () method, name and length of the artist and save it to an Excel file.
Iii. Summary of Python crawling QQ music single
1. singleton uses the paging mode. The next page is switched to get json-format data from the server through asynchronous ajax requests and render the data to the page. The URL in the browser's Address Bar remains unchanged and cannot be requested through splicing links. I first thought about simulating ajax requests through the Python Urllib library. Later I thought about using Selenium. Selenium can simulate the actual operations of the browser, and the page element positioning is also very convenient. It simulates clicking the next page, constantly switching the single page, and then parse the webpage source code through BeautifulSoup to obtain single information.
2. the url link Manager uses the Set Data Structure to save single links. Why should we use the set? Because multiple singles may come from the same album (the album URL is the same), this can reduce the number of requests.
Class UrlManager (object ):
Def _ init _ (self ):
Self. new_urls = set () # use the set data structure to filter duplicate elements
Self. old_urls = set () # use the set data structure to filter duplicate elements
def add_new_url(self, url):
if url is None:
return
if url not in self.new_urls and url not in self.old_urls:
self.new_urls.add(url)
def add_new_urls(self, urls):
if urls is None or len(urls) == 0:
return
for url in urls:
self.add_new_url(url)
def has_new_url(self):
return len(self.new_urls) != 0
def get_new_url(self):
new_url = self.new_urls.pop()
self.old_urls.add(new_url)
return new_url
3. Third-party Python plug-insOpenpyxlIt is very convenient to read and write Excel, and the single information can be well saved through the Excel file.
Def write_to_excel (self, content ):
Try:
For row in content:
Self. workSheet. append ([row ['song _ name'], row ['song _ url'], row ['singer _ name'], row ['song _ time'])
Self. workBook. save (self. excelName) # save the single information to the Excel file.
Except t Exception as arr:
Print ('Write to excel error', arr. args)
Iv. Background
Finally, we should celebrate that, after all, we have successfully crawled the single information of QQ music. This time, Selenium was able to successfully crawl a single. This time, we only used some simple functions of selenium. We will learn more about Selenium in the future, not only in crawling, but also in UI automation.
To be optimized in the future:
1. There are many download links, which are slow to download one by one. We plan to use multiple threads for concurrent download later.
2. the download speed is too fast. In order to prevent the server from disabling IP addresses, there is also a waiting mechanism for too frequent access to the same domain name. There is a waiting interval between each request.
3. parsing Web pages is an important process. Regular Expressions, BeautifulSoup and lxml can be used. Currently, the BeautifulSoup library is used. In terms of efficiency, BeautifulSoup is less efficient than lxml, and lxml will be used later.