There's a sudden 300 stars on GitHub today.
Worked on data-related work for many years. Have a deep understanding of various problems in data development. Data processing work mainly include: Crawler, ETL, machine learning. The development process is the process of building the pipeline pipeline of data processing. The various modules are spliced together. The summary steps are: Get data, convert, merge, store, send. There are many differences in data development and business system development. Data projects are more pipeline process, each module through data dependence, and business system development is building process. In many cases, crawler engineers, algorithmic engineers, write out the data processing code, very confusing. Because you can't make accurate designs, let alone performance requirements, before you see real data. The previous time spent a lot of time on the Asyncio library in-depth study. decided to develop a data-driven framework that addresses data processing issues in terms of modularity, flexibility, and performance. This is why I created the Databot open source framework.
Spend a half month time frame basically complete, can solve processing data processing work, crawler, ETL, quantitative transactions. and has very good performance. You are welcome to use and advise.
Project Address: Github.com/kkyon/databot
Installation method: PIP3 install-u Databot
Code Case: Github.com/kkyon/databot/tree/master/examples
Multi-threaded VS asynchronous co-process:
In general, high concurrency data IO has the advantage of using asynchronous coprocessor. The recommended number of threads is cpu*2 because the thread consumes a lot of resources and the thread switches at a great cost. Python is limited by the Gil and it is difficult to improve performance through multithreading.
The Asyncio can achieve very good throughput. There is almost no limit to the number of concurrency.
For details, refer to this article:
Pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html
Python Asyncio in a regular notebook completes 1 million Web requests in 9 minutes.
Databot Performance Test Results:
Use the Baidu crawler case to make:
There are a number of keywords, need in Baidu search engine. Record the first 10 pages of the article title. In SEO, public sentiment and other scenes often to do similar things. 100 keywords are used in the test (1000 pages are crawled), and it takes about three minutes to complete. The test environment results are as follows:
#---Run result----
HTTP returns in about 1 seconds
#post man test result for a page requrest; 1100ms
The ping is time 42ms
# PING www.a.shifen.com (180.97.33.108): Data bytes
# bytes from 180.97.33.108:icmp_seq=0 ttl=55 time=41.159 ms
Databot test results: Up to 50 entries per second can be processed and 6 pages per second.
# Got Len Item 9274 speed:52.994286 per second,total cost:175s
# Got Len item 9543 speed:53.016667 per second,total cost:180s
# Got Len Item 9614 speed:51.967568 per second,total cost:185s
Python Asyncio Questions:
Asyncio itself, such as complex concepts, futrue,task, differences, ensure futer,crate_task.
The co-authoring requirements are high for engineers, especially in data projects.
Asyncio supports a limited three-party library that needs to be developed in combination with multithreading and multi-process.
Databot Philosophy and
Data engineers focus only on core logic and write modular functions, without regard to the Asyncio features. Databot will handle external io, concurrency, scheduling problems.
Databot Basic Concepts:
Databot design is very concise, altogether only three concepts: Pipe,route,node
pipe is the main flow , a program can have more than one pipe, interconnected or independent. Route,node, are included inside the pipe.
The route is the router , which primarily acts as a data route, summarizing the merging functions. There are branch, Return,fork,join,blockedjoin. Where branch,fork does not change the master process data. Return,join, the processed data is put back into the main flow. Complex data networks can be combined by nesting the route.
Node is a data-driven node . Working with data logic nodes, some http,mysql,aiofile, custom functions for clients, and timer,loop all belong to node.
How to install Databot:
PIP3 Install-u Databot
GitHub Address: Github.com/kkyon/databot
Crawler Code parsing:
For more examples, refer to: Github.com/kkyon/databot/tree/master/examples
For the Baidu Crawler example, the main process code is as follows:
Get_all_items, a client-written function used to parse an entry on a Web page.
Get_all_page_url is a custom write function for getting page-flipping links on web pages.
- Loops send a link to the pipe through a loop list.
- Httploader will read the URL and download the HTML. Generate an HTTP response object into the pipe
- Branch copies a copy of the data (HttpResponse) into the branch, and the Get_all_items resolves to the final result, which is stored in the file. The master process data is not affected at this time. There is still a copy of HTTP response
- Branch copies the HttpResponse from the pipe to the branch and then parses the full page link through Get_all_page_url. then download the corresponding webpage via Httploader, parsing and maintaining.
Each of these steps is invoked through the Databot framework and concurrently.
The Botframe.render (' Baiduspider ') function can be used to produce a structural diagram of a pipe. Need to install www.graphviz.org/download/
Main function Code:
1 defMain ():2words = ['Trade War','World Cup']3Baidu_url ='www.baidu.com/s?wd=%s'4urls = [Baidu_url% (word) forWordinchwords]5 6 7Outputfile=aiofile ('Baidu.txt')8 Pipe (9 Loop (URLs),Ten Httploader (), One Branch (get_all_items,outputfile), A Branch (Get_all_page_url, Httploader (), Get_all_items, outputfile), - - ) the - #Generate Flowchart -Botframe.render ('Baiduspider') - Botframe.run () + - +Main ()
The following is the generated flowchart
All code:
1 fromDatabot.flowImportPipe, Branch, Loop2 fromDatabot.botframeImportBotframe3 fromBs4ImportBeautifulSoup4 fromDatabot.http.httpImportHttploader5 fromDatabot.db.aiofileImportAiofile6 ImportLogging7Logging.basicconfig (level=logging. DEBUG)8 9 Ten One #defining the parsing structure A classResultitem: - - def __init__(self): theSelf.id:str ='' -Self.name:str ='' -Self.url:str =' ' -Self.page_rank:int =0 +Self.page_no:int =0 - + def __repr__(self): A return '%s,%s,%d,%d'%(str (self.id), Self.name,self.page_no,self.page_rank) at - - #Parse Specific entries - defGet_all_items (response): -Soup = BeautifulSoup (Response.text,"lxml") -Items = Soup.select ('Div.result.c-container') inresult = [] - forRank, iteminchEnumerate (items): to ImportUUID +ID =Uuid.uuid4 () -R =Resultitem () theR.id =ID *R.page_rank =Rank $R.name =Item.h3.get_text ()Panax Notoginseng Result.append (R) - returnresult the + A #Resolve Paging Links the defGet_all_page_url (response): +ItemList = [] -Soup = BeautifulSoup (Response.text,"lxml") $page = Soup.select ('Div#page') $ forIteminchPage[0].find_all ('a'): -href = Item.get ('href') -No =Item.get_text () the if 'Next Page' inchNo: - BreakWuyiItemlist.append ('www.baidu.com'+href) the - returnitemList Wu - About defMain (): $words = ['Trade War','World Cup'] -Baidu_url ='www.baidu.com/s?wd=%s' -urls = [Baidu_url% (word) forWordinchwords] - A +Outputfile=aiofile ('Baidu.txt') the Pipe ( - Loop (URLs), $ Httploader (), the Branch (get_all_items,outputfile), the Branch (Get_all_page_url, Httploader (), Get_all_items, outputfile), the the ) - #Generate Flowchart inBotframe.render ('Baiduspider') the Botframe.run () the About theMain ()