Python crawler, designed to crawl all of the individual merchant commodity information and details, and data classification analysis
The entire workflow flowchart:
The first step: get all the channels from the front page in an automated way
fromBs4ImportBeautifulSoupImportRequests#1. Find links to all channels on the left sidebarStart_url ='http://hz.58.com/sale.shtml'Url_host='http://hz.58.com'defget_channel_urls (URL): Wb_data=requests.get (start_url) Soup= BeautifulSoup (Wb_data.text,'lxml') Links= Soup.select ('ul.ym-mainmnu > li > Span > a["href"]') forLinkinchLinks:page_url= Url_host + Link.get ('href') Print(Page_url)#Print (links)get_channel_urls (start_url) channel_list=" "HTTP://HZ.58.COM/SHOUJI/HTTP://HZ.58.COM/TONGXUNYW/HTTP://HZ.58.COM/DANCHE/HTTP://HZ.58.COM/DIANDONGCH e/http://hz.58.com/diannao/http://hz.58.com/shuma/http://hz.58.com/jiadian/http://hz.58.com/ershoujiaju/ http://hz.58.com/yingyou/http://hz.58.com/fushi/http://hz.58.com/meirong/http://hz.58.com/yishu/http://h Z.58.com/tushu/http://hz.58.com/wenti/http://hz.58.com/bangong/http://hz.58.com/shebei.shtml http://hz.58.c om/chengren/" "
Step two: All the channels obtained through the first step to get all the list details, and into the Url_list table, and get the product details information
fromBs4ImportBeautifulSoupImportRequestsImport TimeImportpymongoclient= Pymongo. Mongoclient ('localhost', 27017) Ceshi= client['Ceshi']url_list= ceshi['url_list']item_info= ceshi['Item_info']defGet_links_from (channel,pages,who_sells=0):#http://hz.58.com/shouji/0/pn7/List_view ='{}{}/pn{}/'. Format (CHANNEL,STR (who_sells), str (pages)) Wb_data=requests.get (List_view) time.sleep (1) Soup= BeautifulSoup (Wb_data.text,'lxml') Links= Soup.select ('td.t > A[onclick]') ifSoup.find ('TD','T'): forLinkinchLinks:item_link= Link.get ('href'). Split ('?') [0] Url_list.insert_one ({'URL': Item_link}) Print(Item_link)Else: Pass # Nothingdefget_item_info (URL): Wb_data=requests.get (URL) soup= BeautifulSoup (Wb_data.text,'lxml') No_longer_exist='goods have been off the shelf' inchSoupifno_longer_exist:Pass Else: Title=Soup.title.text Price= Soup.select ('Span.price_now > I') [0].text area= Soup.select ('Div.palce_li > Span > I') [0].text#url_list.insert_one ({' title ': Title, ' Price ':p rice, ' area ': area}) Print({'title': Title,' Price':p Rice,' Area': Area})#get_links_from (' http://hz.58.com/pbdn/', 7)#get_item_info (' http://zhuanzhuan.58.com/detail/840577950118920199z.shtml ')
Step three: main main function entrance with multi-process mode
fromMultiprocessingImportPool fromChannel_extractImportchannel_list fromPage_parsingImportGet_links_fromdefGet_all_links_from (channel): forNuminchRange (1,31): Get_links_from (Channel,num)if __name__=='__main__': Pool=Pool () Pool.map (Get_all_links_from,channel_list.split ())
Fourth step: real-time monitoring of acquired data
from Import Sleep from Import url_list while True: Print (Url_list.find (). Count ()) sleep (5)
Specific operating results:
Large-scale data crawling-Python