Python crawler captures today's headline street photos

Source: Internet
Author: User

1. Open Google Chrome, enter www.toutiao.com, search for street beats.

2. Open Developer options, Network Monitor load XHR, the data is Ajax asynchronous loading, you can see the data in preview

3. Drop-down refresh to view the loaded offset, each time loading 20 data, data is the JSON data, inside the Article_url, is the Atlas details page URL.

4. First crawl the contents of the index page

The data from the index page of the request is inside the query str

1 #extracting data from an index page2 defget_page_index (offset, keyword):3data = {4         'Offset': Offset,5         'format':'JSON',6         'keyword': keyword,7         'AutoLoad':'true',8         'Count': 20,9         'Cur_tab': 1Ten     } One     #Enter and exit parameters into the route to build the full URL AURL ='http://www.toutiao.com/search_content/?'+urlencode (data) -     Try: -Response = Requests.get (URL) the         ifResponse.status_code = = 200: -             returnResponse.text -         returnNone -     exceptrequestexception: +         Print('error requesting index page') -         returnNone

5. Next is to parse the index page data, extract the desired detail page URL, the index page data is the JSON data, inside the Article_url, is the Atlas details page URL.

1 #parse the JSON data (Preview) of the index page and extract the details page URL2 defParse_page_index (HTML):3     Try:4data =json.loads (HTML)5         ifData and 'Data' inchData.keys ():6              forIteminchData.get ('Data'):7                 #lazy Gets the URL of the detail page in the form of a generator8                 yieldItem.get ('Article_url')9     exceptJsondecodeerror:Ten         Pass

6. With the URL of the details page, the next step is to get the data and code for the details page.

1 #get the data code for the details page2 defget_page_detail (URL):3     Try:4Response = Requests.get (URL, timeout=5)5         ifResponse.status_code = = 200:6             returnResponse.text7         returnNone8     exceptrequestexception:9         Print('Request Details page error', URL)Ten         returnNone

7. Then parse the details page, and extract the title, and the image URL, detail page code data in the Doc view, note that the extraction is a group, non-photo is filtered. Url_list refers to the three address is the address of the picture, we just have an original URL on it.
For example, in a detail page, enter http://www.toutiao.com/a6473742013958193678/in the browser. Developer options, Network, DOC

1 #parse the details page and extract the title, and the image URL2 defparse_page_detail (HTML, URL):3Soup = BeautifulSoup (HTML,'lxml')4title = Soup.select ('title') [0].get_text ()5     Print(title)6Images_pattern = Re.compile (r'Gallery: (. *?), \ n')7result =Re.search (Images_pattern, HTML)8     ifResult:9         #convert JSON data to a Python Dictionary objectTendata = Json.loads (Result.group (1)) One         ifData and 'sub_images' inchData.keys (): ASub_images = Data.get ('sub_images') -Images = [Item.get ('URL') forIteminchSub_images] -              forImageinchImages:download_image (image)
# returns the title, the URL of this detail page, and a list of picture URLs the return { - 'title': Title, - 'URL': URL, - 'Images': Images +}

8. Store the parsed extracted data into MongoDB, in a dictionary manner.

Write a MONGO configuration file first config.py

1 ' localhost ' 2 ' Toutiao ' 3 ' Toutiao ' 4 5 Group_start = 06 group_end =78' Street shot '

Then connect the local MONGO to store the data

1Client = Pymongo. Mongoclient (Mongo_url, connect=False) # Connect=false prevents multi-threaded multiple connections MONGO databases in the background2db =client[mongo_db]3 4 #Save the URL and title (title) of the details page and the address list of the group to MONGO5 defSave_to_mongo (Result):6     ifDb[mongo_table].insert (Result):7         Print('Store to MongoDB success', result)8         returnTrue9     returnFalse

9. Download the image

1 #Download the graph data and save the picture2 defdownload_image (URL):3     Print('is downloading', URL)4     Try:5Response =requests.get (URL)6         ifResponse.status_code = = 200:7             #response.content as binary data8 save_image (response.content)9         returnNoneTen     exceptrequestexception: One         Print('Request picture failed', URL) A         returnNone -  -  the #Save Picture - defsave_image (content): -     #use MD5 to generate encrypted names while preventing duplicate downloads of identical images -File_path ='{0}/{1}. {2}'. Format (OS.GETCWD (), MD5 (content). Hexdigest (),'jpg') +     if  notos.path.exists (file_path): -With open (File_path,'WB') as F: +F.write (content)

10. Crawler Main function

1 def Main (offset): 2     html = Get_page_index (offset, KEYWORD)3for in       parse_page_index (HTML):  4         html = get_page_detail (URL)5         if  HTML: 6             result = parse_page_detail (HTML, URL)7             if result:save_to_mongo (Result)

11. Turn on multi-process

1 if __name__ ' __main__ ' : 2      for inch Range (Group_start, group_end)] 3     Pool = Pool ()4     pool.map (Main, groups)

12. Required Library functions

1 ImportRequests2 ImportRe3 ImportPymongo4  fromUrllib.parseImportUrlEncode5  fromRequests.exceptionsImportrequestexception6  fromJson.decoderImportJsondecodeerror7  fromBs4ImportBeautifulSoup8 ImportJSON9 ImportOSTen  fromHashlibImportMD5 One  fromConfigImport* A  fromMultiprocessingImportPool

Complete code: Https://github.com/huazhicai/Spider/tree/master/jiepai

Python crawler captures today's headline street photos

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.