"Turn" Big Data Batch Framework Spring batch comprehensive analysis

Last Update:2017-04-17 Source: Internet

Author: User

Tags java se

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today's micro-service architecture is in full swing. But in the enterprise architecture, in addition to a large number of OLTP transactions, there is a huge amount of batch transactions. In financial institutions such as banks, there are 340,000 batches of processing work to be processed per day. For OLTP, the industry has a large number of open-source frameworks, excellent architectural design support, but the framework of batch processing is very rare. It's time to come with us to see what excellent frameworks and designs are in the world of the next batch, and today I will take spring batch as an example to explore the world of batch processing with you.
Typical scenarios for initial knowledge batch processing
Exploring domain models and key architectures
Achieve robustness and extensibility of jobs
Shortage and enhancement of batch processing framework
Batch processing typical business scenarios
Reconciliation is a typical batch business processing scenario, the transactions of various financial institutions and cross-host system business will involve the process of reconciliation, such as large small payments, UnionPay transactions, people's dealings, Cash Management, POS business, ATM business, securities company funds accounts, securities companies and securities clearing company.

The following is a section of the Bank of the day end run batch instance scene requirements.

The required points involved include:
Each unit in bulk requires error handling and fallback;
Each unit runs on different platforms;
Need to have branch selection;
Each unit needs to monitor and retrieve the unit processing log;
Provide a variety of trigger rules, by date, calendar, cycle trigger;
In addition to the typical batch processing, the following business scenarios are available:
Regularly submit batch tasks (day end processing)
Parallel batch processing: Parallel processing tasks
Enterprise Message-driven processing
Large-scale parallel processing
Manual or timed restart
Process dependent tasks sequentially (scalable to workflow-driven batching)
Partial processing: Ignoring records (for example, when rolling back)
A full batch transaction
Unlike OLTP-type transactions, the batch job two typical features are batch execution and automatic execution (unattended): The former can handle the import, export, and business logic calculations of large volumes of data, which can be automated to perform bulk tasks without human intervention.

In addition to focusing on its basic functions, there are a few things to focus on:
Robustness: does not cause the program to crash because of invalid data or bad data;
Reliability: Achieve reliable execution of batch jobs through tracking, monitoring, logging and related processing strategies (retry, skip, restart);
Extensibility: Through concurrent or parallel technology to achieve the application of vertical and horizontal expansion, to meet the performance requirements of massive data processing;
Struggling with the industry's lack of a better batch framework, Spring batch is one of the few outstanding batch frameworks (Java language development) in the industry, and SpringSource and Accenture (Accenture) Contribute wisdom together.

Accenture has a wealth of industry-level experience in the batch architecture, contributing to the previously dedicated Batch architecture framework (these frameworks have been developed and used for decades, providing a lot of reference experience for spring batch).
SpringSource has a deep technical cognition and spring framework programming model, while drawing on the JCL (Job Control Language) and COBOL language features. 2013 JSR-352 The batch process into the normative system and was included in the JEE7. This means that all JEE7 application servers will have the ability to batch, and the first application server to implement this specification is GlassFish 4. Of course, it can also be used in Java SE.

But the most critical point is that the JSR-352 specification draws heavily on the design of the spring batch framework, from which the core model and concept can be seen to be exactly the same. The complete JSR-252 specification can be downloaded from https://jcp.org/aboutJava/communityprocess/final/jsr352/index.html.
The spring batch framework enables you to build lightweight, robust parallel processing applications that support transactional, concurrency, process, monitoring, portrait and scale-out, and provide unified interface management and task management.

The framework provides core competencies, such as the following, to make people more focused on business processes. It provides the following rich capabilities:
Explicitly separating the execution environment and applications for batch processing
Providing a Common Core service as an interface
Simple default core execution interface with "out-of-the-box"
Provides configuration, customization, and extension services in the spring framework
All default implementations of core services can be easily extended and replaced without impacting the underlying layer
Provides a simple deployment pattern, compiled with Maven
Batch processing key domain models and key architectures
Let's start with a Hello World example, a typical batch job.

A typical job is divided into 3 parts: Job reading, job processing, job writing, and a typical three-step architecture. The entire batch framework is basically handled by read, process, writer. In addition, the framework provides job scheduler, job warehouse (for storing job metadata information, supporting memory, db two modes).
Complete domain concept model participation:

Job Launcher is the ability of the Spring batch framework infrastructure layer to run the job. Job launcher can be performed through job Parameters given the job name and job.
Job launcher can call batch tasks in a Java program, or you can call a batch task in a command line or other framework, such as a scheduled schedule framework quartz.
Job repository to store metadata for the job execution period (metadata here refers to job Instance, job execution, job Parameters, Step execution, execution context, etc.), and provides two default implementations.
One is stored in memory, and the other holds the metadata in the database. You can monitor the execution status of a batch job at any time by storing the metadata in the database. The job execution result is a success or failure, and it makes it possible to restart the job in case the job fails. Step represents a complete step in the job, and a job can have one or more step components.
The model for the batch framework run time is also very simple:

Job Instance is a run-time concept that involves a job Instance every time the job is executed.
There are two possible sources of job instance: one is to get one from the job Repository (Job warehouse) based on the job parameters you set up, or if you do not get the job from job Repository based on job parameters Instance, a new job Instance is created.
Job execution represents a handle to the job execution, and the execution of a job may or may not succeed. The job instance will be completed only if the job execution succeeds. Therefore, in the case of job execution failure, there will be a job instance that corresponds to multiple job execution scenarios.
This paper summarizes the typical conceptual model of batch processing, and the design of the 10 concepts is very concise, which fully supports the whole framework.

The core competencies provided by the job include abstraction and inheritance of the job, similar to the concept of object-oriented. Provides the ability to restart for jobs that perform exceptions.

At the job level, the framework also provides the concept of job orchestration, including sequence, condition, and parallel job orchestration.

Configure multiple step in one job. Different step can be executed sequentially, or according to different conditions have the choice of execution (conditions usually use the exit state of step), through the next element or decision element to define the jump rule;
In order to improve the execution efficiency of multiple step, the framework provides the ability to execute step in parallel (using split to declare, usually in this case there is no dependency between the step, otherwise it is easy to cause business errors). Step contains all the necessary information in a real-world batch task, which can be a very simple business implementation or a very complex business process, and the complexity of step is usually business-determined.

Each step is composed of Itemreader, Itemprocessor and Itemwriter, of course, according to different business needs, itemprocessor can do the appropriate streamlining. At the same time, the framework provides a large number of implementations of Itemreader, Itemwriter, and provides support for a variety of data types such as Flatfile, XML, Json, DataBase, and message.
The framework also provides step with the ability to restart, transaction, restart Count, concurrency, and commit interval, exception skip, retry, completion policy, and so on. The flexible configuration based on step enables common business function requirements to be fulfilled. Three steps (Read, Processor, Writer) are classic abstractions in batch processing.

As a batch-oriented processing, the step layer provides the ability to read, process, and commit once.
In the chunk operation, you can set the read number of records to commit once by setting the property Commit-interval. Reduce the frequency of submissions and reduce resource usage by setting the Commit-interval interval value. Each commit of step is present as a complete transaction. By default, the declarative transaction management mode provided by spring provides a convenient transaction orchestration. Here is an example of declaring a transaction:

The framework's support capabilities for transactions include:
Chunk support transaction management, set the number of records per commit through Commit-interval;
Supports fine-grained transaction configuration for each tasklet: Isolation, propagation behavior, timeout;
Support rollback and no rollback, supported by skippable-exception-classes and no-rollback-exception-classes;
Support for transaction level configuration of JMS queue;
In addition, Spring batch also makes a very concise abstraction of the framework's senior model abstraction.

Only six business tables are used to store all metadata information (including job, step instance, context, actuator information, which is possible for subsequent monitoring, restart, retry, state recovery, etc.).
Batch_job_instance: Job instance table, which holds instance information for the job
Batch_job_execution_params: The job parameter table that holds the parameter information for each job execution time, which actually corresponds to the job instance.
Batch_job_execution: The Job executor table that holds execution information for the current job, such as creation time, execution start time, execution end time, the job instance executed, execution status, and so on.
Batch_job_execution_context: The job Execution context table that holds information about the context of the job executor.
Batch_step_execution: A job step executor table that holds information about each step executor, such as the time the job was executed, the execution time, execution status, read and write times, and number of skips.
Batch_step_execution_context: A job step execution context table that holds information for each job step context.
Achieve the robustness and extensibility of the job
Batch processing requires the job to have strong robustness, usually the job is batch data processing, unattended, which requires that during job execution can be able to deal with a variety of exceptions, errors, and job execution to effectively track.
A robust job typically requires several features such as:
1. Fault tolerance
During a non-fatal exception during job execution, the job execution framework should be able to perform effective fault-tolerant processing rather than having the entire job fail; typically only fatal exceptions that cause a business error can terminate the execution of the job.
2. Traceability
Any errors occurring during job execution need to be recorded effectively to facilitate the effective processing of the error points at a later stage. For example, any ignored record lines during job execution need to be effectively logged, and the application maintainer can follow up on the ignored records for effective processing.
3. Can be restarted
During job execution, you should be able to restart the job at the point of failure if it fails because of an exception, instead of re-executing the job from the beginning.

The framework provides features that support all of the above capabilities, including Skip (skip record Processing), Retry (retry a given operation), Restart (restart failed job starting from the point of error):
Skip, during the processing of data, if a certain few of the format can not meet the requirements, skip skips the processing of the row records, so that processor can successfully process the remaining record lines.
Retry, the given operation is repeated several times, in some cases operation because of a brief exception caused the execution failure, such as network connection exception, concurrency processing exception, can be retried to avoid a single failure, the next time the operation of the network back to normal, no more concurrency exception, This ability to retry can effectively avoid this kind of transient anomalies.
Restart, after the job execution fails, the job execution can continue to be completed by restarting the function. At restart time, the batch framework allows the job to be restarted at the point where the last execution failed, rather than starting from scratch, which can significantly increase the efficiency of job execution.
For extensibility, the framework provides the following four modes of extensibility:
Multithreaded step multithreading executes a step;
Parallel Step executes multiple step in parallel through multi-thread;
Remote Chunking performs distributed chunk operation on the remote node;
Partitioning step to partition the data and execute it separately;
Let's first look at the first implementation of multithreaded Step:

The batch framework uses a single thread to complete the execution of the task by default when the job executes, while the framework provides the thread pool support (multithreaded step mode), which can be processed concurrently at step execution time, where parallelism refers to the same step using the thread pooling for execution. The same step is executed in parallel. Using the Tasklet attribute Task-executor makes it very easy to turn a normal step into a multi-threaded step.
Example of multithreaded Step implementation:

It is important to note that most of the Itemreader, Itemwriter, and other operations provided by the Spring batch framework are thread insecure.
Thread-safe step can be manifested in an extended manner.
Here is an implementation of an extension that shows you:

Requirements: For batch processing of data tables, implement thread-safe step, and support the ability to restart, that is, the state of the batch can be logged at the point of execution failure.
For the database read component Jdbccursoritemreader in the example, when designing a database table, add a field flag to the table that identifies whether the current record has been read and processed successfully, and if the process succeeds, identifies Flag=true, and, when the next time it is re-read, For records that have been successfully read and successfully processed, skip processing directly.
Multithreaded Step (multithreading) provides the ability for multiple threads to execute a step, but this scenario is not used very much in the actual business.
The more business scenario is that the different step in the job does not have a clear sequencing and can be executed concurrently in the execution period.
Parallel Step: Provides the ability to scale a single node horizontally

Usage Scenario: Step A, Step b two job steps are executed by different threads, and after both are executed, step c will not be executed.
The framework provides the ability to parallel step. You can define a parallel job flow through the split element and develop a thread pool for use.
The Parallel Step mode performs the following results:

Each job step handles different records in parallel, with three job steps in the example, processing different data in the same table.
Parallel step provides horizontal processing on a single node, but as the amount of work processed increases, it is possible that a node cannot meet the job's processing, at which point we can use remote step to combine multiple machine nodes to complete a job processing.
Remote Chunking: The long-distance step technology essentially separates the processing logic of the item read and write, and usually reads the logic on one node and distributes the write operation to the other node execution.

Remote chunking is a work that divides step into technology, and does not require a clear understanding of the structure of the data being processed.
Any input source can use a single-process read and be sent to a remote worker process as a "block" after dynamic segmentation.
The remote process implements the listener pattern, and the feedback request and processing data eventually return the processing result asynchronously. The transfer between the request and the return is ensured between the sender and the individual consumer.
At the master node, the job step is responsible for reading the data and sending the read data through remote technology to the specified remote node for processing, and the master is responsible for recovering the remote execution when the processing is complete.
In the spring batch framework, the task of remote step is accomplished through two core interfaces, namely Chunkprovider and Chunkprocessor.
Chunkprovider: Generates batch chunk operations based on a given itemreader operation;
Chunkprocessor: Responsible for obtaining the chunk operation generated by Chunkprovider and executing the specific writing logic;
There is no default implementation for remote step in Spring batch, but we can use SI or AMQP implementations to achieve remote communication capabilities.
Example of remote chunking mode based on SI:

The step local node is responsible for reading the data and sending the request to the remote step via Messaginggateway, and the remote step provides a queue listener that obtains the request information when there is a message in the request queue and gives it to chunkhander for processing.
Next we look at the last partition mode; partitioning Step: partition mode requires a certain understanding of the structure of the data, such as the scope of the primary key, the name of the file to be processed, and so on.

The advantage of this pattern is that each element of the partition's processor is able to run like a single step of a regular Spring batch task, and does not have to implement any special or new patterns to make them easier to configure and test.
The following benefits can be achieved by partitioning:
Partitioning enables finer-grained scaling;
High-performance data segmentation can be achieved based on partitioning;
Partitioning is usually more scalable than remote;
After partitioning the processing logic, support both local and remote mode;
Partitioning operations typically can be divided into two processing stages, data partitioning, partition processing;
Data partitioning: Based on special rules (for example: Based on file name, data uniqueness identifier, or hashing algorithm) data is made reasonable data slicing, for different slices to generate data execution context execution context, job step executor step execution. Custom partition logic can be generated via interface partitioner, Spring The batch batching framework implements Org.springframework.batch.core.partition.support.MultiResourcePartitioner for multiple files by default, or it can extend the interface Partitioner itself to implement custom Logical partitioning logic is defined.
Partition processing: After partitioning the data, different data has been assigned to different job stepping actuators, which need to be handed over to the partition processor for the job, and the partition processor can execute the partitioned job locally or remotely. Interface Partitionhandler defines the logic for partitioning processing, and Spring The batch batching framework defaults to local multi-threaded partition processing Org.springframework.batch.core.partition.support.TaskExecutorPartitionHandler, or it can expand its own interface partition Handler to implement the custom partition processing logic.

The Spring batch framework provides support for file partitioning, The implementation class Org.springframework.batch.core.partition.support.MultiResourcePartitioner provides default support for file partitioning, partitioning different file processing according to file names, and improving processing speed and efficiency for There are a lot of small files that need to be processed.

The example shows that assigning different files to different job steps, using Multiresourcepartitioner for partitioning, means that each file is assigned to a different partition. If there are other partitioning rules, you can customize the extension by implementing the interface Partitioner. Interested in TX, you can implement your own database-based partitioning capabilities OH.
To summarize, the batch framework provides 4 different capabilities in terms of extensibility, each of which is a separate usage scenario, and we can choose according to the actual business needs.

Shortage and enhancement of batch processing framework
Although the Spring batch batch framework provides 4 different ways of monitoring, it is not very friendly from the current usage situation.

Direct view through the DB, for managers, really can't bear to look straight;
Through the API implementation of custom queries, this is the programmer's paradise, and indeed the operations of the people of Hell;
Provide a web console, job monitoring and operation, the present function is too bare, can not be directly used in production;
Provide JMX query method, for non-developers too unfriendly;
However, in the enterprise-level application facing batch processing, only the batch processing framework can only meet the rapid development and execution ability of batch processing job.
Enterprises need a unified batch processing platform to deal with complex enterprise batch processing applications, batch processing platform needs to solve the unified scheduling of operations, batch processing operations of centralized management and control, batch processing operations unified monitoring and other capabilities.
What is the perfect solution?
The Enterprise batch processing platform needs to be based on the Spring batch processing framework, and the scheduling framework can be used to carry out tasks according to the requirements of the enterprise.
Enrich the current Spring batch admin (Spring batch Management monitoring platform, the current capacity is weak) framework, to provide unified management of job functions, enhance job job monitoring, early warning and other capabilities;

Through the reasonable integration with the organization, authority management and authentication system of the enterprise, the Authority control and security management ability of the platform for job job can be enhanced.

Spring Batch Batch Framework e-version download

"Turn" Big Data Batch Framework Spring batch comprehensive analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More