A magical _python of the adorner in Python

Source: Internet
Author: User
Tags md5 in python

Well, I know it's midnight ... but I still think it's worth it to spend half an hour, to share the latest ideas.

We simulate a scene, need you to grab a page, and then this page has a lot of URLs to crawl separately, and enter these sub URLs, there are data to crawl. To be simple, we'll look at the three level, and our code is as follows:

Copy Code code as follows:

def func_top (URL):
data_dict= {}

#在页面上获取到子url
Sub_urls = xxxx

Data_list = []
For it in Sub_urls:
Data_list.append (Func_sub (IT))

Data_dict[\ ' data\ '] = data_list

Return data_dict

def func_sub (URL):
data_dict= {}

#在页面上获取到子url
Bottom_urls = xxxx

Data_list = []
For it in Bottom_urls:
Data_list.append (Func_bottom (IT))

Data_dict[\ ' data\ '] = data_list

Return data_dict

def func_bottom (URL):
#获取数据
data = xxxx
Return data

Func_top is the processing function of the upper page, func_sub is the processing function of the child page, Func_bottom is the most deep processing function of the page, Func_top will walk through the call func_sub,func_sub after fetching the child page URL.

If the normal situation, this has satisfied the demand, but this is the site you want to crawl may be extremely unstable, often linked to, resulting in data can not get.

So this time you have two choices:

1. Stop the error, then start again from the broken position
2. Continue to encounter errors, but to run again after, this time already have the data do not want to go to the site to pull once, but only to pull the data not taken

The first scenario is basically impossible, because if the URL of someone else's site is adjusted in order, then the location of your record is not valid. Then only the second option, plainly, is to get the data cache down, and so on when needed, directly from the cache to take.

OK, the target is already there, how to achieve it?

If it is in C + +, this is a very troublesome thing, and the written code must be ugly, but fortunately, we are using Python, and Python has an adorner for the function.

So the implementation of the program will have:

Define an adorner, if you get the data before, directly to the cache data, if not taken before, then pull from the site, and into the cache.

The code is as follows:

Copy Code code as follows:

def get_dump_data (dir_name, URL):
m = hashlib.md5 (URL)
filename = M.hexdigest ()
Full_file_name = \ ' dumps/%s/%s\ '% (dir_name,filename)

If Os.path.isfile (full_file_name):
Return eval (file (full_file_name,\ ' r\ '). Read ())
Else
Return None


def set_dump_data (dir_name, URL, data):
If not os.path.isdir (\ ' dumps/\ ' +dir_name):
Os.makedirs (\ ' dumps/\ ' +dir_name)

m = hashlib.md5 (URL)
filename = M.hexdigest ()
Full_file_name = \ ' dumps/%s/%s\ '% (dir_name,filename)

f = file (Full_file_name, \ ' w+\ ')
F.write (repr (data))
F.close ()


def deco_dump_data (func):
def func_wrapper (URL):
data = Get_dump_data (Func.__name__,url)
If not None:
Return data

data = func (URL)
If not None:
Set_dump_data (Func.__name__,url,data)
Return data

Return Func_wrapper

Then, we just need to add deco_dump_data this adorner in each Func_top,func_sub,func_bottom ~ ~

Get! The biggest advantage of this is that, because top,sub,bottom, each layer will dump data, so such as a sub-layer data dump, is not going to his corresponding bottom layer, reduced a lot of overhead!

OK, that's it ~ life is short, I use python!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.