Python crawls the data worth buying in the rebate Network (v1 single thread, non-scrapy framework), pythonscrapy

Source: Internet
Author: User

Python crawls the data worth buying in the rebate Network (v1 single thread, non-scrapy framework), pythonscrapy

First, use the previous method to crawl the data of the rebate network. The scrapy framework is not skilled yet, and then fight scrapy tomorrow.

The beautifulsoup module is used to find the target data.

1. Observe the web page and find the rule

Open the content worth buying

1> analyze data sources

Data on the webpage is divided into data that exists when the page is opened (data that can be seen in the source code ),

Data that is dynamically loaded with the mouse sliding (data not displayed in the source code ).

2>Search rules

After loading the data to the bottom of the page, there are 50 pieces of related data on the page. Check the source code and find that there are only 5 pieces of data in the source code. All the remaining data is

Dynamically Loaded. Analyze the dynamic data:

F12 open the Network part, refresh the page, and when the mouse does not slide down, there is no data behind our needs, as the mouse slides,

We found two items that may contain data and found that only ajaxGetItem... this is what we need. Use filter to filter the items.

The following rule is found after filtering:

1-2 is article 6-10, and 1-3 is Article 11-15 ......

This is also the rule for other pages. In the second page, page = 0-2 is marked from 0. After I change to page = 2-2, there is no effect.

So the rule is to replace the page part with the corresponding page number.

2. Code

After finding the rule, you can write the code. Because it uses a single thread, it is too difficult to crawl data to get the monkey year and month. Sorry, It will be improved later.

I still have little knowledge about scrapy. After I got into touch with scrapy, I felt that what I knew about crawlers is still the tip of the iceberg. What is the depth and breadth,

Distributed Systems and so on are not very familiar with each other. I feel that I am too low and have a long way to go. I also think scrapy is very strong, although it is not used in many cases.

1 # encoding = UTF-8 2 import urllib2 3 from bs4 import BeautifulSoup 4 import time 5 # the source code of the page worth buying on the rebate network contains only five pieces of data, 6 # other data is dynamically loaded. Each page contains 50 data entries. 7 8 class FanLi (): 9 def _ init _ (self): 10 self. user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '11 self. headers = {'user-agent': self. user_agent} 12 def get_url (self): 13 14 list_url = [] 15 for I in range (1,760): 16 # url117 url1 = 'HTTP: // zhide.fanli.com/ P' + str (I) 18 list_url.append (url1) 19 for j in range (2, 11): 20 url2 = 'HTTP: // zhide.fanli.com/index/ajaxGetItem? Cat_id = 0 & tag = & page = '+ str (I) +'-'+ str (j) + '& area = 0 & tag_id = 0 & shop_id = 0' 21 list_url.append (url2) 22 return list_url23 def getHtml (self, url): 24 # url = 'HTTP: // zhide.fanli.com/p'{str (pageIndex) 25 try: 26 request = urllib2.Request (url, headers = self. headers) 27 response = urllib2.urlopen (request) 28 html = response. read () 29 return html30 failed t urllib2.URLError, e: 31 if hasattr (e, 'reason '): 32 print u "connection failed", e. reason33 return None34 def parse (self): 35 urls = self. get_url () 36 I = 037 # with open('zhide.txt ', a) as f: 38 # f. write () 39 for url in urls: 40 I = I + 141 html = self. getHtml (url) 42 soup = BeautifulSoup (html, 'html. parser ') 43 divs = soup. find_all ('div ', class _ = 'zdm-list-item J-item-wrap item-no-expired') 44 45 # for item in divs [0]: 46 # print 'item' + str (item) 47 48 for div in divs: 49 con_list = [] 50 # item name 51 title = div. find ('h4 '). get_text () 52 # category 53 item_type = div. find ('div ', class _ = 'item-type '). a. string54 # recommender 55 item_user = div. find ('div ', class _ = 'item-user '). string56 # Content 57 item_cont = div. find ('div ', class _ = 'item-content '). get_text (strip = True) 58 #59 type_yes = div. find ('A', attrs = {'data-type': 'yes '}). string60 #61 type_no = div. find ('A', attrs = {'data-type': 'no '}). string62 con_list.append (title) 63 con_list.append (item_type) 64 partition (item_user) 65 con_list.append (item_cont) 66 partition (type_yes) 67 partition (type_no) 68 69 70 comment ', 'A ') 71 for item in con_list: 72 f. write (item. encode ('utf-8') + '|') 73 f. write ('\ n') 74 f. close () 75 print 'sleeping loading % d' % i76 time. sleep (3) 77 78 79 80 81 zhide = FanLi () 82 zhide. parse ()View Code

I hope you can correct all the shortcomings.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.