An article comprehensive analysis of large data batch processing framework Spring Batch__c language

Source: Internet
Author: User
Tags error handling java se

Http://www.chinaznyj.com/KeChuanDaiSheBei/1685.html

Today, the discussion of micro-service architecture is in full swing. But in the enterprise architecture in addition to a large number of OLTP transactions, there are a lot of batch processing transactions. In financial institutions such as banks, there are 340,000 batch jobs that need to be processed every day. For OLTP, the industry has a large number of open source framework, excellent architectural design to support, but the batch processing area of the framework is rare. It's time to join us in understanding the best frameworks and designs for the next batch of the world, and today I will take spring batch as an example to explore the world of batch processing.

Initial batch processing typical scenarios

Exploration domain model and key architecture

To realize the robustness and expansibility of the job

Deficiencies and enhancements of the batch processing framework

Batch processing typical business scenarios

Reconciliation is a typical batch processing business processing scenario, the transaction of each financial institution and the business of the Trans-host system will involve the reconciliation process, such as large small payment, UnionPay transaction, People's Bank exchange, Cash Management, POS business, ATM business, securities Company fund account, securities company and Securities Clearing Company.

Here is a line of silver part of the day End run batch instance scenario requirements.

The demand points involved include:

Each unit of the batch needs error handling and fallback;

Each unit runs in a different platform;

Branching selection is required;

Each unit needs to monitor and obtain the unit processing log;

Provide a variety of triggering rules, by date, calendar, cycle trigger;

In addition, typical batch processing is applicable to the following business scenarios:

Periodically submit batch tasks (day end processing)

Parallel batch processing: Parallel processing tasks

Enterprise Message driven processing

Large-scale parallel processing

Reboot manually or periodically

Process dependent tasks sequentially (scalable to workflow-driven batches)

Partial processing: Ignoring records (for example, when rolling back)

Full Batch Transaction

Unlike OLTP type transactions, the two typical characteristics of a batch job are batch execution and automatic execution (unattended): The former can handle the import, export, and business logic calculations of large quantities of data, while the latter can automate batch tasks without human intervention.

In addition to focusing on its basic functions, you need to focus on the following points:

Robustness: The program will not crash because of invalid data or bad data;

Reliability: Through tracking, monitoring, logging and related processing strategy (retry, skip, restart) to achieve the reliable implementation of batch operations;

Extensibility: Through concurrent or parallel technology to achieve the application of vertical and horizontal expansion to meet the performance requirements of mass data processing;

In the industry's absence of a better batch framework, Spring batch is the industry's few outstanding batch-processing frameworks (Java language development), and SpringSource and Accenture (Accenture) have contributed to the wisdom.

Accenture has a wealth of industry-level experience in the batch architecture, contributing to a previously dedicated batch architecture framework that has been developed and used for decades, providing a lot of reference experience for spring batch.

SpringSource has a deep knowledge of technology and a spring framework programming model, drawing on the language features of JCL (Job control Language) and COBOL. The JSR-352 was incorporated into the normative system in 2013 and included in the JEE7. This means that all JEE7 application servers will have batch processing capabilities, and the first application server to implement this specification is GlassFish 4. Of course, it can also be used in Java SE.

But the most important point is: The JSR-352 specification borrows a lot from the design idea of spring batch frame, and it can be seen from the core model and concept in the above diagram that the core conceptual model is identical. The complete JSR-252 specification can be downloaded from the https://jcp.org/aboutJava/communityprocess/final/jsr352/index.html.

The spring batch framework enables lightweight, robust parallel processing applications that support transactional, concurrency, process, monitoring, vertical, and horizontal scaling, providing unified interface management and task management.

The framework provides core competencies, such as the following, to focus more on business processes. It also provides the following rich ability:

Define the execution environment and application of the separation batch process

Provide Common Core services in the form of interfaces

Simple default core execution interface for "out-of-the-box"

Provides configuration, customization, and extension services in the spring framework

All default implementations of core services can easily be extended and replaced without impacting the underlying layer

Provides a simple deployment pattern that is compiled using MAVEN

Batch processing key domain model and key architecture

Let's start with the Hello World example, a typical batch job.

A typical job is divided into 3 parts: Homework reading, homework processing, homework writing, is also a typical three-step architecture. The entire batch framework is largely centered around read, process, and writer. In addition, the framework provides the job scheduler, the Job warehouse (which holds the job's metadata information, supports memory, db two modes).

Complete the domain concept model to participate in the following figure:

The Job Launcher (Job scheduler) is the ability to run jobs provided by the Spring batch framework infrastructure layer. You can perform a job through the job launcher with the given job name and job Parameters.

You can invoke a batch task in a Java program by using the job launcher, or you can invoke a batch task in a command line or in another framework, such as quartz a scheduled scheduling framework.

Job repository to store the metadata for the job execution period (the metadata here refers to job Instance, job Execution, Job Parameters, step Execution, Execution context, etc.). and provides two kinds of default implementations.

One is stored in memory and the other is stored in the database. By storing metadata in a database, you can monitor the execution status of a batch job at any time. Whether the job execution results in success or failure, and makes it possible to restart the job if the job fails. Step represents a complete procedure in a job that can consist of one or more steps.

The model of the batch framework runtime is also very simple:

Job Instance is a runtime concept that involves a job Instance every time the job is executed.

There may be two types of job instance sources: one that is obtained from the job Repository (Job warehouse) based on the set job parameters, and if no job is obtained from job Repository according to job parameters Instance, a new job Instance is created.

The job execution represents a handle to the job execution, and the execution of one job may succeed or fail. The corresponding job instance will be completed only after the job executes successfully. So when job execution fails, there is a job instance that corresponds to multiple job execution scenarios.

The paper summarizes the typical conceptual model of the batch processing, its design is very concise 10 concepts, complete support for the entire framework.

The core competencies provided by the job include abstraction and inheritance of the job, similar to object-oriented concepts. Provides the ability to reboot for jobs that perform exceptions.

Framework at the job level, also provides the concept of job choreography, including order, conditions, parallel job choreography.

Configure multiple step in a job. Different step can be executed sequentially, or optionally executed according to different conditions (the condition is usually determined by the exit state of Step), and the jump rule is defined by next element or decision element;

In order to improve the execution efficiency of multiple step, the framework provides the ability to execute step in parallel (using split to declare, in which case it is often necessary to have no dependencies between the two, otherwise it is liable to cause business errors). Step contains all the necessary information in a actually running batch task, which can be either a very simple business implementation or a very complex business process, and the complexity of the step is usually business-determined.

Each step is composed of Itemreader, Itemprocessor, Itemwriter, of course, according to different business needs, itemprocessor can do appropriate streamlining. At the same time, the framework provides a large number of Itemreader, Itemwriter implementations, providing support for a variety of data types such as Flatfile, XML, Json, DataBase, message, and so on.

The framework also provides the ability to restart, transaction, restart, concurrency, and commit interval, exception skipping, retry, and completion policy for step. Based on the flexible configuration of step, the common business functional requirements can be accomplished. One of the three steps (Read, Processor, Writer) is a classic abstraction in batch processing.

As a batch-oriented process, the step layer provides the ability to read, process, and submit multiple times.

In chunk operations, you can set up a commit after the number of read records by using the property commit-interval. By setting the interval value of Commit-interval, the frequency of submission is reduced and the utilization rate of resources is reduced. Each commit of step is present as a complete transaction. By default, the declarative transaction management model provided by spring is convenient for transactional orchestration. The following is an example of declaring a transaction:

The framework's support capabilities for transactions include:

Chunk supports transaction management, setting the number of records submitted each time through Commit-interval;

Supports the setting of fine-grained transaction configurations for each tasklet: Isolation domain, propagation behavior, timeout;

Support of rollback and no rollback, supported by skippable-exception-classes and no-rollback-exception-classes;

A transaction-level configuration that supports JMS queue;

In addition, Spring batch also makes a very concise abstraction of the framework's senior model abstraction.

Only six business tables are used to store all of the metadata information (including job, step instance, context, executor information, which is possible for subsequent monitoring, restart, retry, status recovery, etc.).

Batch_job_instance: Job instance table, for storing instance information for a job

Batch_job_execution_params: The job parameter table, which holds the parameter information for each job execution, which actually corresponds to the job instance.

Batch_job_execution: The Job Executor table, which holds execution information for the current job, such as creation time, execution start time, execution end time, execution JOB instance, execution status, and so on.

Batch_job_execution_context: The Job Execution context table, which holds the information for the job executor context.

Batch_step_execution: Job step Executor table for storing information about each step executor, such as the start time of the job steps, execution time, execution status, read and write times, skip times, etc.

Batch_step_execution_context: The job step execution context table, which holds information for each job step context.

To achieve the robustness and scalability of the job

Batch processing requires that the job must be robust, usually in bulk, unattended, which requires that all occurrences of exceptions, errors and job execution be handled effectively during job execution.

A robust job typically requires several features as follows:

1. Fault tolerance

Non-fatal exceptions during job execution should allow the job execution framework to be able to perform effective fault-tolerant processing, rather than having the entire job fail, and usually only fatal, malformed exceptions can terminate the execution of the job.

2. Traceability

In any place where errors occur during job execution, an effective record is required to facilitate the efficient handling of the error points later. For example, any record rows that are ignored during job execution need to be effectively logged, and application maintainers can perform effective processing of the ignored records.

3. Can be restarted

If a failure is caused by an exception during job execution, you should be able to restart the job at the point of failure instead of restarting the job from the beginning.

The framework provides features that support all of the above capabilities, including Skip (skip record Processing), retry (retry given operation), restart (restart failed job starting at error point):

Skip, during the process of data processing, if a certain number of formats do not meet the requirements, you can skip the processing of the row records by skip, so that processor can handle the rest of the record line smoothly.

Retry, the given operation will be retried several times, in some cases, because of the temporary exception caused by the execution failure, such as network connection exception, concurrent processing exception, can be retried to avoid a single failure, the next time the operation of the network back to normal, no more concurrent anomalies, This means that the ability to retry can effectively avoid such transient exceptions.

Restart, after the job execution fails, you can continue to complete the job execution by restarting the function. During restart, the batch framework allows the job to be restarted at the point where the last execution failed, rather than starting from scratch, which can significantly increase the efficiency of the job execution.

For extensibility, the framework provides the following four modes of extensibility:

Multithreaded step multi-threaded execution of a.

The Parallel step executes multiple passes in parallel with multithreading;

Remote chunking performs distributed chunk operation on the remote node;

Partitioning step to partition the data and execute it separately;

Let's take a look at the first implementation multithreaded step:

The batch framework uses a single thread to perform tasks by default at job execution, while the framework provides thread pool support (multithreaded step mode) that can be handled in parallel during step execution, where parallelism refers to the use of a thread pool by the same one. The same step is executed in parallel. Using Tasklet's properties Task-executor makes it easy to turn ordinary step into multithreaded step.

Example of implementation of multithreaded step:

Note that most of the Itemreader, Itemwriter, and other operations provided by the Spring batch framework are thread-insecure.

A thread-safe step can be displayed in an extended manner.

Here is a demonstration of an extended implementation:

Requirements: A thread-safe step is implemented for batch processing of data tables, and the ability to restart is enabled, that is, the state of the batch can be logged at execution failure points.

For the database read component Jdbccursoritemreader in the example, when designing a database table, add a field flag to the table that identifies whether the current record has been read and processed successfully, and identifies flag=true if the process succeeds, and so on the next reread, Skip processing for records that have been successfully read and processed successfully.

The multithreaded Step (multithreading) provides the ability for multiple threads to perform a stage, but this scenario is not used very much in the actual business.

More business scenarios are that different step in the job is not in a clear sequence and can be executed in parallel during the execution period.

Parallel Step: Provides the ability to extend a single node horizontally

Usage Scenario: Step A, Stage b Two job steps are executed by different threads, both of which are executed, and the C is executed.

The framework provides the ability to parallel step. You can define parallel job flows by using the split element and develop a pool of threads to use.

The effect of the Parallel step mode is as follows:

Each job step processes different records in parallel, with three job steps in the example working with different data in the same table.

Parallel step provides horizontal processing on a node, but as the amount of job processing increases, it is possible for a node to be unable to satisfy the job's processing, at which point we can combine multiple machine nodes to complete a job processing by remote step.

Remote chunking: Long-distance step technology essentially separates the processing logic of the item read and write; the logic of reading is usually done in one node, and the write is distributed to another node for execution.

Remote chunking is a process that divides a step into a technology, and does not require a clear understanding of the structure of the data processing.

Any input source can be read by a single process and sent to a remote worker process as a "block" after dynamic segmentation.

The remote process implements the listener pattern, and the feedback request and processing data eventually return the processing results asynchronously. The transmission between the request and the return is ensured between the sender and the individual consumer.

At the master node, the job step is responsible for reading the data and sending the read data over the remote technology to the specified remote node for processing, and the master is responsible for reclaiming the remote side execution after the processing is completed.

The task of remote step is accomplished through two core interfaces in the Spring batch framework, respectively, Chunkprovider and Chunkprocessor.

Chunkprovider: Batch chunk operation is produced according to the given Itemreader operation;

Chunkprocessor: Responsible for acquiring the chunk operation produced by Chunkprovider and executing the specific writing logic;

There is no default implementation for remote step in Spring batch, but we can use SI or AMQP implementation to achieve remote communication capabilities.

Example of remote chunking mode based on SI:

The step local node is responsible for reading the data and sending the request to the remote step via Messaginggateway, and the remote step provides the queue listener, which gets the request information when there is a message in the request queue and gives it to chunkhander for processing.

Next we look at the last partition mode; Partitioning step: Partitioning mode requires a certain understanding of the structure of the data, such as the scope of the primary key, the name of the file to be processed, and so on.

The advantage of this pattern is that the processors for each element of the partition can run like a single step in a normal spring batch task, and do not have to implement any special or new patterns to make them easier to configure and test.

The following benefits can be achieved by partitioning:

Partitioning enables finer-grained scaling;

Based on partitioning, high-performance data segmentation can be achieved.

Partitions are generally more scalable than remote;

The processing logic after partitioning supports both local and remote modes;

The typical partitioning operation can be divided into two processing stages, data partitioning and partition processing.

Data partitioning: According to special rules (such as: According to the file name, data uniqueness identification, or hashing algorithm) to the data for reasonable data slicing, for different slices to generate data execution context Execution contexts, job step actuator steps Execution. Custom partitioning logic can be generated through interface partitioner, Spring The Batch batch processing framework defaults to realize the org.springframework.batch.core.partition.support.MultiResourcePartitioner of multiple files; You can also extend the interface partitioner to implement custom partitioning logic.

Partition processing: After data partitioning, different data has been assigned to different job stepping actuators, and the next step is to give the partition processor a job, which can be performed locally or remotely to perform the assigned job. Interface Partitionhandler defines the logic of partition processing, Spring The Batch batch processing framework defaults to the local multi-threaded partition processing Org.springframework.batch.core.partition.support.TaskExecutorPartitionHandler; You can also extend the interface Partitionhandler to implement custom partition processing logic.

The Spring batch framework provides support for file partitioning, Implementation Class Org.springframework.batch.core.partition.support.MultiResourcePartitioner provides default support for file partitions, partitioning different file processing according to file name, improving the speed and efficiency of processing, and is suitable for There are a lot of small files to deal with the scene.

The example shows that assigning different files to different job steps, using Multiresourcepartitioner for partitioning, means that each file is assigned to a different partition. If there are other partitioning rules, you can customize the extension by implementing the interface Partitioner. Interested in TX, you can implement the partitioning capability based on the database yourself Oh.

To sum up, the batch framework provides 4 of different capabilities in extensibility, each of which is a separate usage scenario, and we can choose from the actual business needs.

Deficiencies and enhancements of the batch processing framework

The Spring Batch batch framework provides 4 different ways of monitoring, but it is not very friendly in terms of current usage.

Through the DB direct view, for managers, really can't bear to look straight;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.