1. Open Google Chrome, enter www.toutiao.com, search for street beats.
2. Open Developer options, Network Monitor load XHR, the data is Ajax asynchronous loading, you can see the data in preview
3. Drop-down refresh to view the loaded offset, each time loading 20 data, data is the JSON data, inside the Article_url, is the Atlas details page URL.
4. First crawl the contents of the index page
The data from the index page of the request is inside the query str
1 #extracting data from an index page2 defget_page_index (offset, keyword):3data = {4 'Offset': Offset,5 'format':'JSON',6 'keyword': keyword,7 'AutoLoad':'true',8 'Count': 20,9 'Cur_tab': 1Ten } One #Enter and exit parameters into the route to build the full URL AURL ='http://www.toutiao.com/search_content/?'+urlencode (data) - Try: -Response = Requests.get (URL) the ifResponse.status_code = = 200: - returnResponse.text - returnNone - exceptrequestexception: + Print('error requesting index page') - returnNone
5. Next is to parse the index page data, extract the desired detail page URL, the index page data is the JSON data, inside the Article_url, is the Atlas details page URL.
1 #parse the JSON data (Preview) of the index page and extract the details page URL2 defParse_page_index (HTML):3 Try:4data =json.loads (HTML)5 ifData and 'Data' inchData.keys ():6 forIteminchData.get ('Data'):7 #lazy Gets the URL of the detail page in the form of a generator8 yieldItem.get ('Article_url')9 exceptJsondecodeerror:Ten Pass
6. With the URL of the details page, the next step is to get the data and code for the details page.
1 #get the data code for the details page2 defget_page_detail (URL):3 Try:4Response = Requests.get (URL, timeout=5)5 ifResponse.status_code = = 200:6 returnResponse.text7 returnNone8 exceptrequestexception:9 Print('Request Details page error', URL)Ten returnNone
7. Then parse the details page, and extract the title, and the image URL, detail page code data in the Doc view, note that the extraction is a group, non-photo is filtered. Url_list refers to the three address is the address of the picture, we just have an original URL on it.
For example, in a detail page, enter http://www.toutiao.com/a6473742013958193678/in the browser. Developer options, Network, DOC
1 #parse the details page and extract the title, and the image URL2 defparse_page_detail (HTML, URL):3Soup = BeautifulSoup (HTML,'lxml')4title = Soup.select ('title') [0].get_text ()5 Print(title)6Images_pattern = Re.compile (r'Gallery: (. *?), \ n')7result =Re.search (Images_pattern, HTML)8 ifResult:9 #convert JSON data to a Python Dictionary objectTendata = Json.loads (Result.group (1)) One ifData and 'sub_images' inchData.keys (): ASub_images = Data.get ('sub_images') -Images = [Item.get ('URL') forIteminchSub_images] - forImageinchImages:download_image (image)
# returns the title, the URL of this detail page, and a list of picture URLs the return { - 'title': Title, - 'URL': URL, - 'Images': Images +}
8. Store the parsed extracted data into MongoDB, in a dictionary manner.
Write a MONGO configuration file first config.py
1 ' localhost ' 2 ' Toutiao ' 3 ' Toutiao ' 4 5 Group_start = 06 group_end =78' Street shot '
Then connect the local MONGO to store the data
1Client = Pymongo. Mongoclient (Mongo_url, connect=False) # Connect=false prevents multi-threaded multiple connections MONGO databases in the background2db =client[mongo_db]3 4 #Save the URL and title (title) of the details page and the address list of the group to MONGO5 defSave_to_mongo (Result):6 ifDb[mongo_table].insert (Result):7 Print('Store to MongoDB success', result)8 returnTrue9 returnFalse
9. Download the image
1 #Download the graph data and save the picture2 defdownload_image (URL):3 Print('is downloading', URL)4 Try:5Response =requests.get (URL)6 ifResponse.status_code = = 200:7 #response.content as binary data8 save_image (response.content)9 returnNoneTen exceptrequestexception: One Print('Request picture failed', URL) A returnNone - - the #Save Picture - defsave_image (content): - #use MD5 to generate encrypted names while preventing duplicate downloads of identical images -File_path ='{0}/{1}. {2}'. Format (OS.GETCWD (), MD5 (content). Hexdigest (),'jpg') + if notos.path.exists (file_path): -With open (File_path,'WB') as F: +F.write (content)
10. Crawler Main function
1 def Main (offset): 2 html = Get_page_index (offset, KEYWORD)3for in parse_page_index (HTML): 4 html = get_page_detail (URL)5 if HTML: 6 result = parse_page_detail (HTML, URL)7 if result:save_to_mongo (Result)
11. Turn on multi-process
1 if __name__ ' __main__ ' : 2 for inch Range (Group_start, group_end)] 3 Pool = Pool ()4 pool.map (Main, groups)
12. Required Library functions
1 ImportRequests2 ImportRe3 ImportPymongo4 fromUrllib.parseImportUrlEncode5 fromRequests.exceptionsImportrequestexception6 fromJson.decoderImportJsondecodeerror7 fromBs4ImportBeautifulSoup8 ImportJSON9 ImportOSTen fromHashlibImportMD5 One fromConfigImport* A fromMultiprocessingImportPool
Complete code: Https://github.com/huazhicai/Spider/tree/master/jiepai
Python crawler captures today's headline street photos