In large enterprises, because of complex business, large data volume, different data formats, and complex data interaction formats, not all operations can be processed through interactive interface. Some operations require regular reading of large quantities of data, followed by a series of subsequent processing. This is the process of "batching."
Batch applications typically have the following characteristics:
- Large data volume, ranging from tens of thousands of to millions of or even hundreds of millions;
- The whole process is fully automated, and a certain interface is reserved for custom configuration;
- Such applications are typically run periodically, such as by day, week, and month;
- The accuracy of data processing requires high, and the need for fault-tolerant mechanism, rollback mechanism, perfect log monitoring and so on.
What is spring batch
Spring batch is a lightweight, comprehensive batch processing framework designed for large enterprises to help develop robust batch applications. Spring Batch provides many of the necessary reusable features for processing large volumes of data, such as log tracking, transaction management, job execution statistics, restart job, and resource management. It also provides optimization and fragmentation techniques for high-performance batch processing tasks.
Its core features include:
- Transaction management
- Block-based processing
- Declarative input/output operations
- Start, stop, restart a task
- Retry/Skip a task
- Web-based Administrator interface
The author's department belongs to the CRM Department of a large financial company abroad, in the daily work we often need to develop some batch processing applications, have rich experience in the use of spring batch. In recent times, I have deliberately summed up these experiences.
Using Spring Batch 3.0 and spring Boot
It is recommended that you use the latest Spring batch 3.0 version when using Spring batch. Compared to spring Batch2.2, it does the following improvements:
- Support JSR-352 Standard
- Support Spring4 and JAVA8
- Enhanced the functionality of the spring Batch integration
- Support Jobscope
- Support for SQLite
Support for Spring4 and JAVA8 is a major boost. This makes it possible to use the Spring boot component introduced by Spring4 to make a qualitative leap in development efficiency. The introduction of the Spring-batch framework requires only one line of code to be added to the Build.gradle:
1 |
compile("org.springframework.boot:spring-boot-starter-batch")
|
By enhancing the functionality of the spring Batch integration, we can easily integrate with other components of the spring family, and can invoke the job in several ways, as well as remote partitioning operations and Remote block processing.
While supporting Jobscope, we can inject context information for an object at any time for the current job instance. As long as we set the scope of the bean to job scope, information such as jobparameters and Jobexecutioncontext can be used at any time.
1 2 3 4 5 6 7 |
<bean id="..." class="..." scope="job"> <property name="name" value="#{jobParameters[input]}" /></bean> <bean id="..." class="..." scope="job"> <property name="name" value="#{jobExecutionContext[‘input.name‘]}.txt" /></bean>
|
Configuration using Java Config instead of XML
We used to configure the XML configuration for both job and step, but we found a lot of problems over time.
- The number of XML files expands rapidly, the configuration blocks are long and complex, and the readability is poor;
- XML files are missing syntax checking, and some low-level errors can only be found when running integration tests;
- IDE support is not enough for code jump in XML file;
We have come to realize that the use of pure Java classes is more flexible, it is type-safe, and the IDE's support is better. The streaming syntax used when building a job or step is more straightforward than XML.
1 2 3 4 5 6 7 8 9 Ten One A - - the |
@BeanPublic Step Step () {return Stepbuilders.get ("step"). <partner,partner>chunk (1). Reader (Reader ()). Processor (processor ()). Writer (writer ()). Listener (Logprocesslistener ()). Faulttolerant (). Skiplimit (Ten). Skip (Unknowngenderexception.class). Listener (Logskiplistener ()). Build (); }
|
In this example, it is clear to see the configuration of the step, such as the Reader/processor/writer component, and which listener are configured.
Using in-memory databases in local integration tests
Spring batch requires database support at run time because it requires a set of schemas in the database to store statistics for job and step runs. In the local integration test, we can use the memory repository provided by spring batch to store the task execution information for spring batch, which avoids configuring a database locally and speeding job execution.
1 2 3 4 |
<bean id="jobRepository" class="org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean"> <property name="transactionManager" ref="transactionManager"/></bean>
|
We have added a dependency on hsqldb in Build.gradle:
1 |
runtime(‘org.hsqldb:hsqldb:2.3.2’)
|
Then add the configuration to the DataSource in the test class.
1 2 3 4 5 6 7 |
@EnableAutoConfiguration@EnableBatchProcessing@DataJpaTest@Import({DataSourceAutoConfiguration.class, BatchAutoConfiguration.class})public class TestConfiguration { }
|
and add the configuration of the initialization database in the Applicaton.properties configuration:
1 |
spring.batch.initializer.enable=true
|
Reasonable use of chunk mechanism
Spring Batch uses a chunk-based mechanism when configuring step. That is, each time a piece of data is read, and then processing a piece of data, accumulated to a certain amount, and then once to write to the writer. This maximizes write efficiency, and the entire transaction is based on chunk.
When we need to write data to a file, database, and so on, we can set the value of chunk appropriately to maximize write efficiency. But in some scenarios where our write operation is actually invoking a Web service or sending a message to a message queue, we need to set the value of chunk to 1 in these scenarios, so that you can handle the write in a timely manner, and not because of an exception in the entire chunk. The repeated invocation of the service or the repetition of the message when the retry occurs.
Use listener to monitor job execution and handle it in a timely manner
Spring Batch provides a number of listener to fully monitor the execution of the job.
At the job level, spring batch provides the Jobexecutionlistener interface, which supports some additional processing at the start or end of the job. Spring batch At the step level provides the Stepexecutionlistener,chunklistener,itemreadlistener,itemprocesslistener, Itemwritelistener,skiplistener and other interfaces, while retry and skip operations also provide Retrylistener and Skiplistener.
Typically we implement a jobexecutionlistener for each job, and in the afterjob operation we output job execution information, including execution time, job parameters, exit code, step execution, and details of each step. This allows the developer, test, or ops to be knowledgeable about the overall job execution.
If a step takes the action of skip, we also implement a skiplistener for it and record the skip data entry in it for the next processing.
There are two ways to implement listener, one is to inherit from the corresponding interface, such as inheriting the Jobexecutionlistener interface, and the other is to use the annoation (annotation) method. In practice we think it is better to use annotations, because you need to implement all the methods of the interface using the interface, while using annotations you only need to add annoation to the appropriate method.
The following class takes the form of an inherited interface, and we see that we only use the first method, the second and the third are not used. But we have to provide an empty implementation.
1 2 3 4 5 Span class= "Line-number" >6 7 8 9 10 11 12 13 14 15 16 |
public class CustomSkipListener implements SkipListener<String, String> { @Override public void onSkipInRead(Throwable t) { // business logic } @Override public void onSkipInWrite(String item, Throwable t) { // no need } @Override public void onSkipInProcess(String item, Throwable t) { // no need }}
|
And the way to use annoation can be abbreviated as:
1 2 3 4 5 6 7 |
public class CustomSkipListener { @OnSkipInRead public void onSkipInRead(Throwable t) { // business logic }}
|
Using retry and skip to enhance the robustness of batch work
Exceptions are unavoidable in the process of processing millions data. If an exception occurs that causes the entire batch to terminate, the subsequent data cannot be processed. Spring Batch has built-in retry (retry) and Skip (skip) mechanisms to help us handle various exceptions with ease. The exception for retry is that these anomalies may disappear over time, such as the database currently has locks that cannot be written, Web services are currently unavailable, Web services are loaded, and so on. So we can configure the retry mechanism for these exceptions. Some exceptions should not be configured with retry, such as parsing files with exceptions, because these exceptions will always fail even if retry.
You can set the Skip option on the specified exception to ensure that subsequent data can continue to be processed, even if the retry repeatedly fails without having to fail the entire step. We can also configure the SKIPLIMIT option to ensure that when the skip data entry reaches a certain number, the entire job is terminated in a timely manner.
Sometimes we need to do something in each retry interval, such as extending the retry time, resuming the operation site, and so on, Spring batch provides backoffpolicy to achieve the goal. The following is a step example configured with the retry mechanism, the skip mechanism, and the backoffpolicy.
1 2 3 4 5 Span class= "Line-number" >6 7 8 9 10 11 12 13 14 15 16 17 |
@Bean public step Step () { return Stepbuilders.get ("step") <partner,partner>chunk (1) Retrylimit (5)
|
Use a custom decider to implement job flow
While job execution is not necessarily sequential, we often need to determine the next step based on the output data or execution results of a job. We used to put some judgment in the downstream step, which might cause some of the step to actually run, but did not actually do anything. For example, a step execution will record the failed data entry in a report, and the next step will determine if there is a report generated, and if so, send the report to the designated contact, and if not, do nothing. In this case, the job execution process can be implemented through the decider mechanism. In spring Batch 3.0, decider has been isolated from step and is at the same level as step.
1 2 3 4 5 6 7 8 9 Ten |
public class ReportDecider implements JobExecutionDecider { @Override public FlowExecutionStatus decide(JobExecution jobExecution, StepExecution stepExecution) { if (report.isExist()) { return new FlowExecutionStatus(“SEND"); } return new FlowExecutionStatus(“SKIP"); }}
|
In the job configuration, you can use decider. This way the entire job execution process will be clearer and easier to understand.
1 2 3 4 5 6 7 8 |
public Job job() { return new JobBuilder("petstore") .start(orderProcess()) .next(reportDecider) .on("SEND").to(sendReportStep) .on("SKIP").end().build() .build()}
|
Use a variety of mechanisms to accelerate job execution
Batch processing is a large amount of data processing, and the execution window is generally less demanding. So there are a number of ways to speed up job execution. Generally we have four ways to achieve:
- Multi-threaded execution of tasks in a single step
- Parallel execution of different step
- Parallel execution of the same step
- Perform chunk tasks remotely
Performing tasks in a single step multithreading can be accomplished with the help of Taskexecutor. This scenario is appropriate for reader, writer is thread-safe, and is stateless. We can also set the number of threads.
1 2 3 4 5 6 |
public Step step() { return stepBuilders.get("step") .tasklet(tasklet) .throttleLimit(20) .build();}
|
The Tasklet in the example above needs to implement taskexecutor,spring batch to provide a simple multithreaded taskexecutor for us to use: Simpleasynctaskexecutor.
Parallel execution of different step is easy to implement in spring batch, here is an example:
1 2 3 4 5 6 7 |
public Job job() { return stepBuilders.get("parallelSteps") .start(step1) .split(asyncTaskExecutor).add(flow1, flow2) .next(step3) .build();}
|
In this example we execute the STEP1 first, then execute Flow1 and flow2 in parallel, and then execute step3.
Spring Batch provides partitionstep to implement parallel processing in multiple processes for the same step. With Partitonstep and Partitionhandler, one step can be extended to multiple slave for parallel operation.
The remote execution of the chunk task is to split a step's processer operation into multiple processes, with several processes communicating through some middleware (such as by means of messages). This approach is suitable for scenarios where processer is the bottleneck and reader and writer are not bottlenecks.
Conclusion
Spring batch makes a reasonable abstraction of batch scenarios, encapsulates a number of practical functions, and uses it to develop batch applications to achieve a multiplier effect. In the process of use, we still need to summarize some of the best practices to deliver high quality, maintainable batch applications that meet the demanding requirements of enterprise-class applications.
Best Practices for Spring batch in a large enterprise