Open source Project recommendation Databot:python High-performance data-driven development framework-crawler case

Source: Internet
Author: User
Tags uuid

There's a sudden 300 stars on GitHub today.

Worked on data-related work for many years. Have a deep understanding of various problems in data development. Data processing work mainly include: Crawler, ETL, machine learning.  The development process is the process of building the pipeline pipeline of data processing. The various modules are spliced together. The summary steps are: Get data, convert, merge, store, send. There are many differences in data development and business system development. Data projects are more pipeline process, each module through data dependence, and business system development is building process. In many cases, crawler engineers, algorithmic engineers, write out the data processing code, very confusing.  Because you can't make accurate designs, let alone performance requirements, before you see real data. The previous time spent a lot of time on the Asyncio library in-depth study. decided to develop a data-driven framework that addresses data processing issues in terms of modularity, flexibility, and performance. This is why I created the Databot open source framework.

Spend a half month time frame basically complete, can solve processing data processing work, crawler, ETL, quantitative transactions. and has very good performance. You are welcome to use and advise.

Project Address: Github.com/kkyon/databot

Installation method: PIP3 install-u Databot

Code Case: Github.com/kkyon/databot/tree/master/examples

Multi-threaded VS asynchronous co-process:

In general, high concurrency data IO has the advantage of using asynchronous coprocessor. The recommended number of threads is cpu*2 because the thread consumes a lot of resources and the thread switches at a great cost. Python is limited by the Gil and it is difficult to improve performance through multithreading.

The Asyncio can achieve very good throughput. There is almost no limit to the number of concurrency.

For details, refer to this article:

Pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html

Python Asyncio in a regular notebook completes 1 million Web requests in 9 minutes.

Databot Performance Test Results:

Use the Baidu crawler case to make:

There are a number of keywords, need in Baidu search engine. Record the first 10 pages of the article title. In SEO, public sentiment and other scenes often to do similar things. 100 keywords are used in the test (1000 pages are crawled), and it takes about three minutes to complete. The test environment results are as follows:

#---Run result----
HTTP returns in about 1 seconds
#post man test result for a page requrest; 1100ms

The ping is time 42ms
# PING www.a.shifen.com (180.97.33.108): Data bytes
# bytes from 180.97.33.108:icmp_seq=0 ttl=55 time=41.159 ms

Databot test results: Up to 50 entries per second can be processed and 6 pages per second.

# Got Len Item 9274 speed:52.994286 per second,total cost:175s
# Got Len item 9543 speed:53.016667 per second,total cost:180s
# Got Len Item 9614 speed:51.967568 per second,total cost:185s




Python Asyncio Questions:

Asyncio itself, such as complex concepts, futrue,task, differences, ensure futer,crate_task.

The co-authoring requirements are high for engineers, especially in data projects.

Asyncio supports a limited three-party library that needs to be developed in combination with multithreading and multi-process.

Databot Philosophy and

Data engineers focus only on core logic and write modular functions, without regard to the Asyncio features. Databot will handle external io, concurrency, scheduling problems.

Databot Basic Concepts:

Databot design is very concise, altogether only three concepts: Pipe,route,node

pipe is the main flow , a program can have more than one pipe, interconnected or independent. Route,node, are included inside the pipe.

The route is the router , which primarily acts as a data route, summarizing the merging functions. There are branch, Return,fork,join,blockedjoin. Where branch,fork does not change the master process data. Return,join, the processed data is put back into the main flow. Complex data networks can be combined by nesting the route.

Node is a data-driven node . Working with data logic nodes, some http,mysql,aiofile, custom functions for clients, and timer,loop all belong to node.

How to install Databot:

PIP3 Install-u Databot

GitHub Address: Github.com/kkyon/databot

Crawler Code parsing:

For more examples, refer to: Github.com/kkyon/databot/tree/master/examples

For the Baidu Crawler example, the main process code is as follows:

Get_all_items, a client-written function used to parse an entry on a Web page.
Get_all_page_url is a custom write function for getting page-flipping links on web pages.
    1. Loops send a link to the pipe through a loop list.
    2. Httploader will read the URL and download the HTML. Generate an HTTP response object into the pipe
    3. Branch copies a copy of the data (HttpResponse) into the branch, and the Get_all_items resolves to the final result, which is stored in the file. The master process data is not affected at this time. There is still a copy of HTTP response
    4. Branch copies the HttpResponse from the pipe to the branch and then parses the full page link through Get_all_page_url. then download the corresponding webpage via Httploader, parsing and maintaining.

Each of these steps is invoked through the Databot framework and concurrently.

The Botframe.render (' Baiduspider ') function can be used to produce a structural diagram of a pipe. Need to install www.graphviz.org/download/
Main function Code:
1 defMain ():2words = ['Trade War','World Cup']3Baidu_url ='www.baidu.com/s?wd=%s'4urls = [Baidu_url% (word) forWordinchwords]5 6 7Outputfile=aiofile ('Baidu.txt')8 Pipe (9 Loop (URLs),Ten Httploader (), One Branch (get_all_items,outputfile), A Branch (Get_all_page_url, Httploader (), Get_all_items, outputfile), -  -     ) the  -     #Generate Flowchart -Botframe.render ('Baiduspider') - Botframe.run () +  -  +Main ()

The following is the generated flowchart

All code:

1  fromDatabot.flowImportPipe, Branch, Loop2  fromDatabot.botframeImportBotframe3  fromBs4ImportBeautifulSoup4  fromDatabot.http.httpImportHttploader5  fromDatabot.db.aiofileImportAiofile6 ImportLogging7Logging.basicconfig (level=logging. DEBUG)8 9 Ten  One #defining the parsing structure A classResultitem: -  -     def __init__(self): theSelf.id:str ='' -Self.name:str ='' -Self.url:str =' ' -Self.page_rank:int =0 +Self.page_no:int =0 -  +     def __repr__(self): A         return  '%s,%s,%d,%d'%(str (self.id), Self.name,self.page_no,self.page_rank) at  -  - #Parse Specific entries - defGet_all_items (response): -Soup = BeautifulSoup (Response.text,"lxml") -Items = Soup.select ('Div.result.c-container') inresult = [] -      forRank, iteminchEnumerate (items): to         ImportUUID +ID =Uuid.uuid4 () -R =Resultitem () theR.id =ID *R.page_rank =Rank $R.name =Item.h3.get_text ()Panax Notoginseng Result.append (R) -     returnresult the  +  A #Resolve Paging Links the defGet_all_page_url (response): +ItemList = [] -Soup = BeautifulSoup (Response.text,"lxml") $page = Soup.select ('Div#page') $      forIteminchPage[0].find_all ('a'): -href = Item.get ('href') -No =Item.get_text () the         if 'Next Page' inchNo: -              BreakWuyiItemlist.append ('www.baidu.com'+href) the  -     returnitemList Wu  -  About defMain (): $words = ['Trade War','World Cup'] -Baidu_url ='www.baidu.com/s?wd=%s' -urls = [Baidu_url% (word) forWordinchwords] -  A  +Outputfile=aiofile ('Baidu.txt') the Pipe ( - Loop (URLs), $ Httploader (), the Branch (get_all_items,outputfile), the Branch (Get_all_page_url, Httploader (), Get_all_items, outputfile), the  the     ) -     #Generate Flowchart inBotframe.render ('Baiduspider') the Botframe.run () the  About  theMain ()

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.