developed with C#/WPF with a simple ETL function.
Skyscraper-a web crawler that supports asynchronous networks and has a good extensibility.
Javascript
Scraperjs-A full-featured web crawler based on JS.
Scrape-it-web crawler based on node. js.
Simplecrawler-a web crawler based on event-driven development.
Node-crawler-Provides a simple API for two-time web crawler development.
Js-crawler-a web crawler that supports H
There's a sudden 300 stars on GitHub today.
Worked on data-related work for many years. Have a deep understanding of various problems in data development. Data processing work mainly include: Crawler, ETL, machine learning. The development process is the process of building the pipeline pipeline of data processing. The various modules are spliced together. The summary steps are: Get data, convert, merge, store, send. There are many differences in dat
Open-source MySQL efficient data warehouse solution: Infobright details, mysqlinfobright
Infobright is a column-based database based on unique patented knowledge grid technology. Infobright is an open-source MySQL Data Warehouse solution that introduces the column storage solution, high-strength data compression, and o
Unconsciously, ". NET Platform Open source project Quick Glance "series has 15 articles, each is very popular, may not be a high level of technology, but enough to get started. Although the work is very busy, but still will take the time to know, already met in the usual good open source projects to share. Let's introd
Kubernetes architecture and component introduction of open-source container Cluster Management System
This article is based on an Infoq article (see the reference section) and has been modified based on your understanding in difficult areas. For more information about deploying kubernetes on Ubuntu, see.
Together we will ensure that Kubernetes is a strong and open
Quartz is an open-source job scheduling framework that provides a simple but powerful mechanism for Job Scheduling in Java applications. The quartz framework includes the scheduler listener, job, and trigger listener. You can configure a job and trigger listener as a global listener or a job and trigger-specific listener. Quartz allows developers to schedule jobs
processing for pipeline use. Its API is similar to map, and it is worth noting that it has a field of skip, and if set to true, it should not be pipeline processed.The engine that controls the crawler's Operation--spiderSpiders are at the heart of webmagic internal processes. Downloader, Pageprocessor, Scheduler, and pipeline are all properties of the spider, which are freely set and can be implemented by setting this property. Spider is also the ent
Pentaho
Pentaho is the world's most popular open-source business intelligence software. It is a workflow-oriented Bi suite that focuses on solutions rather than tool components. It integrates multiple open-source projects, the goal is to compete with commercial bi. It is a business intelligence (BI) Suite Based on the
Reprinted from Http://www.cnblogs.com/gaochundong/p/opensource_ip_video_surveillance_system_part_1_introduction.htmlOpen source dedicated series of links
Open Source dedication: based on. NET build IP Intelligent Network Video Surveillance System (i) Open source cod
(FeatureDataObjects)Provider implements unified access and performance for multiple sources and different spatial data structures, without converting other spatial data into private spatial data model data.
3. Hierarchical comparison of systems1) Data Access ChannelComparison objects: FDO, FME, ArcSDE, and MapInfo SpatialWareSupported types of data formats: FME> = FDO> ArcSDE = SpatialWare;As a common spatial data model tool, FDO is equivalent to FME. Currently, FDO supports the following data
relatively large frameworks, integrated with a considerable number of open-source projects, jfreereport, Mondrian, kettle, WEKA are basically used. It is particularly suitable for the development of large-scale and complex projects.
PentahoIn China, there are a lot of users and more documents. In particular, it is worth mentioning that on the Internet his Chinese support is quite good, and many vol
Optimization Module Suitable for general application scenarios.
Hadoop is not just a distributed file system for storage, but a framework designed to execute distributed applications on a large cluster composed of general computing devices.
Hive is a hadoop-based data warehouse platform. With hive, we can easily perform ETL work.
Hive defines a query language similar to SQL: hql, which can convert user-written QL into corresponding mapreduce programs
Task Scheduling open-source framework Quartz dynamically add, modify, and delete scheduled tasks
Quartz is an open-source job scheduling framework that provides a simple but powerful mechanism for Job Scheduling in Java applications. The Quartz framework includes the scheduler
Background:The previous post introduced the Leader/follower thread pool model used in Dcm4chee, the main purpose of which is to save context switching and improve operational efficiency. This blog is the "Dicom Open Source Library multithreaded Analysis" series, highlighting the threadpoolqueue thread pool used in fo-dicom.Threadpoolqueue in fo-dicom:Let's take a look at the custom data structure in the Thr
Here to the current industry open source of some real-time stream processing system to do a summary, as a reference for future technical research.S4S4 (Simple scalable streaming System) is Yahoo's latest release of an open source computing platform, it is a general, distributed, extensible, with partition fault toleran
This article mainly introduces the open-source MySQL efficient data warehouse solution: Infobright details. This article describes the features of Infobright, the value of Infobright, the applicable scenarios of Infobright, and the comparison with MySQL, for more information, see Infobright, a column-based database based on the unique patented knowledge grid technology. Infobright is an
automatically generates HANGFIREDB, or you can build the database manually.2. Process Designer supports cron expression editingCron expression Edit Open source project address:Https://github.com/LGX9/cron-expression-editor3. Task Scheduler Module (slickflow.schedule)3.1 Process Overdue automatic completion1) Database fieldsThe Process instance table wfprocessins
);//you first need to find the Iset collection of Jobkey based on the group name. groupmatcher. Groupequals (groupName);//Note: This is not the isetquartz.collection.isetScheduler. Getjobkeys (matcher);//using enumeration objects to loop through lookupsvarEn =keys. GetEnumerator (); while(en. MoveNext ()) {stringrowID = en. Current.Name.Replace ("Reporttime",""); if(dt. Select ("id= '"+ rowID +"'"). Length = =0) {Loghelper.addlog ("Timing Module","detects that the schedule configuration informat
The functionality of the scrapy. Third, data processing flowScrapy 's entire data processing process is controlled by the scrapy engine, which operates mainly in the following ways:The engine opens a domain name, when the spider handles the domain name and lets the spider get the first crawl URL. The engine gets the first URL to crawl from the spider , and then dispatches it as a request in the schedule. The engine gets the page that crawls next from the dispatch.The schedule returns the next
client's request is cached, add an HTTP header parameter that explicitly tells the user that the requested resource is loaded from the cacheif (obj.hits>0) {Set Resp.http.x-cache ="hits from" + Server.hostname;}else{Set Resp.http.x-cache ="MISS from" + server.hostname;}}
(v) varnish工具介绍 (for the cache server, modify the configuration, must not restart, restart will clear all the memory)
varnishadm
# Get HelpVarnishadm-h# Login to varnishadm command-line interfaceVar
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.