DataStage Sequence Articles
DataStage One, install
DataStage Two, InfoSphere Information Server process start and stop
DataStage. Configuring ODBC
DataStage error Set (continuous update)
DataStage four and five because contains a lot of pictures published inconvenience, interested in learning and researchers please contact request!!!
DataStage Vi. Installing and deploying a clustered environment
1 purpose of the configuration file
The configuration file is read the first time when the DS is running, and if the $apt_config_file parameter is set in the job attribute, the DS reads the profile information that is configured in the parameter, and if no configuration reads the profile information that is configured in the project properties $apt_config_file parameter If not, the default profile default.apt is read, and the configuration file tells the DS how to allocate the required system resources, such as logical processing node information, temporary store, dataset storage, In some cases, you can also configure more advanced resource information in the configuration file, such as buffer storage; The advantage of using a configuration file is that you can not modify or redesign the job so that it remains available if you modify the configuration file (such as adding nodes or adding additional resources). And dealing with these you just need to set the $apt_config_file parameter in the Job property or in the project properties. There are two important factors to consider when creating a configuration file: logical processing nodes and optimization parallelism.
2 Logical processing nodes (Logical processing node)
In a parallel job, a profile can define one or more logical processing nodes that are used by the engine when a parallel job is run, as defined by the following logical nodes:
{ node "node1" { fastname "dsconductor01" pools "" resource disk "/disk2/IBM/EngineTier/Server/Datasets" {pools ""} resource scratchdisk "/disk2/IBM/EngineTier/Server/Scratch" {pools ""} } node "node2" { fastname "dsconductor01" pools "" resource disk "/disk2/IBM/EngineTier/Server/Datasets" {pools ""} resource scratchdisk "/disk2/IBM/EngineTier/Server/Scratch" {pools ""} }}
Note: processing nodes Node1 and Node2 is a logical node that is not a physical node, it does not represent the number of physical CPUs, and one or more logical nodes can be defined in a physical machine. The definition of a logical node determines how many parallel processes and how many resources are available when the parallel run runs, the more the number of physical processes on UNIX, and the more logical nodes the resulting process and the memory or disk space used, such as a sort operation. In IBM's official documentation, it is recommended that the number of logical nodes to be created is half the number of physical cups, depending on the system configuration, resource availability, resource sharing, hardware, and job design, such as if the job requires high IO operations or a large amount of data to be fetched from the database, You may want to consider defining multiple logical nodes to complete the operation.
3 optimized parallelism (optimizing parallelism)
The number of parallel depends on the number of logical processing nodes configured in the configuration file, parallel should be in the case of system and job design to optimize rather than maximize, the more parallel number may be well distributed load, but also increase the number of processes and resource consumption of the system, behind the implicit increase the overall system load , so it is not more parallel, the better, so we must consider the CPU, memory, disk, job factors, so that the impact of parallel to the system and efficiency can be maintained.
* * Note: If the stage contains a junction or sort, you might use partition, such as a hash partition, if the amount of data is close to or equal to the number of partitions, such as the following data:
In such cases, an equal number of parallel numbers can significantly improve overall performance.
3 example of creating a configuration file
When you install DS, the system creates the Default.apt file by default, and the profile is applied by default to the created DS project, which creates the rule:
- Number of logical nodes =1/2 physical cups
- Disk and Scratchdisk use subdirectories under the DS installation directory
You want to create a new configuration file to optimize the parallel job, so that it can fully use hardware resources and system resources, because the different jobs require the resources are not the same, for example, the job contains sorting and saving data to the disk file stage, requires high IO operation, By assigning multiple Scratchdisk according to the system situation, the sorting efficiency can be greatly improved. As an example I have done such a job;
Select_bigdata reads a table containing 100 million data from the source table;
Sort_bigdata the data of the source table in a complex sort operation;
Tran_bigdata the data to be sequenced;
Finally, the data is saved to the target table.
I have defined a config file for the job:
/* beging of Config */{node "node1" {fastname "domain01" Pools "" Reso Urce Disk "/data/sharedata/datasets" {Pools ""} Resource Scratchdisk "/data/scratch01" {Pools ""}} Node "nod E2 "{fastname" domain01 "Pools" "Resource Disk"/data/sharedata/datasets "{Pools" "} Reso Urce Scratchdisk "/data/scratch02" {Pools ""}} "Node" Node3 "{fastname" domain01 "Pools" " Resource Disk "/data/sharedata/datasets" {Pools ""} Resource Scratchdisk "/data/scratch03" {Pools ""}} "node "Node4" {fastname "domain01" Pools "" Resource Disk "/data/sharedata/datasets" {Pools ""} Resource Scratchdisk "/data/scratch04" {Pools ""}}}/* end of entire config */
The config file contains 4 logical nodes, so when the job is run, DS creates 4 process parallel processes, and 4 logical nodes are assigned different scratchdisk. This way the parallel process can write the data in a collection of 4 Scratchdisk to complete the sort operation. So before the job starts, check the system's DS process, only one is connected to the Dsapi_slave;
phantom printer segments! DSnum Uid Pid Ppid C Stime Tty Time Command sywu 12431 12430 0 11:02 ? 00:00:07 dsapi_slave 7 6 0 4
When the job starts running, the DS reads the configuration file and then calculates and allocates resources;
4 phantom printer segments! DSnum Uid Pid Ppid C Stime Tty Time Command 49932 sywu 15604 15564 0 11:52 ? 00:00:00 phantom DSD.OshMonitor Clu 49937 sywu 15599 15564 0 11:52 ? 00:00:00 phantom SH -c ‘/software/I 49972 sywu 15564 12431 1 11:52 ? 00:00:00 phantom DSD.RUN ClusterRea 53105 sywu 12431 12430 0 11:02 ? 00:00:10 dsapi_slave 7 6 0 4
These parallel processes handle different operations at the same time:
[Domain01:dsadm]ps-ef|grep 15604sywu 15604 15564 0 11:52? 00:00:00 Phantom DSD. Oshmonitor clusterreadbigdataandsavetotab 15603 msevents. Falsedsadm 16816 17037 0 11:52 pts/0 00:00:00 grep 15604[domain01:dsadm]ps-ef|grep 15599sywu 15599 15564 0 11 : 52? 00:00:00 Phantom sh-c '/software/ibm/informationserver/server/dsengine/bin/oshwrapper rt_sctemp/ Clusterreadbigdataandsavetotab.fifo rt_sc3/oshexecuter.sh R dummy-f rt_sc3/oshscript.osh-monitorport 13400-pf RT_SC3 /jpfile-impexp_charset UTF-8-string_charset UTF-8-input_charset UTF-8-output_charset UTF-8-collation_sequence OFF ' Sywu 15602 15599 0 11:52? 00:00:00/software/ibm/informationserver/server/dsengine/bin/oshwrapper rt_sctemp/ Clusterreadbigdataandsavetotab.fifo rt_sc3/oshexecuter.sh R dummy-f rt_sc3/oshscript.osh-monitorport 13400-pf RT_SC3 /jpfile-impexp_charset UTF-8-string_charset UTF-8-input_charset UTF-8-output_charset UTF-8-collation_sequence Offdsadm 16819 17037 0 11:53 pts/0 00:00:00 grep 15599[domain01:dsadm]ps-ef|grep 15564sywu 15564 12431 0 11:52? 00:00:00 Phantom DSD. RUN clusterreadbigdataandsavetotab 0/0/1/0/0/0/0/sywu 15599 15564 0 11:52? 00:00:00 Phantom sh-c '/software/ibm/informationserver/server/dsengine/bin/oshwrapper rt_sctemp/ Clusterreadbigdataandsavetotab.fifo rt_sc3/oshexecuter.sh R dummy-f rt_sc3/oshscript.osh-monitorport 13400-pf RT_SC3 /jpfile-impexp_charset UTF-8-string_charset UTF-8-input_charset UTF-8-output_charset UTF-8-collation_sequence OFF ' Sywu 15604 15564 0 11:52? 00:00:00 Phantom DSD. Oshmonitor clusterreadbigdataandsavetotab 15603 msevents. Falsedsadm 16822 17037 0 11:53 pts/0 00:00:00 grep 15564[domain01:dsadm]ps-ef|grep 12431sywu 12431 12430 0 11 : 02? 00:00:10 Dsapi_slave 7 6 0 4sywu 15564 12431 0 11:52? 00:00:00 Phantom DSD. RUN clusterreadbigdataandsavetotab 0/0/1/0/0/0/0/dsadm 16826 17037 0 11:53 pts/0 00:00:00 grep 12431
Each parallel process writes temporary data to the corresponding Scratchdisk;
[DOMAIN01:DSADM]LS/DATA/SCRATCH01TSORT50D0A7CB tsort50db8yql tsort50d_dflj tsort50dfy9vy tsort50djuws6 Tsort50dPpmT S tsort50dtwskn tsort50dxazhztsort50d0k6rw tsort50dbd7lr tsort50ddl1hj tsort50dgm1pk tsort50dk3CQP Tsort50dpS_7P t sort50dtwx6t [email protected] .... [DOMAIN01:DSADM]LS/DATA/SCRATCH02TSORT40D0A7CB tsort40dbd7lr tsort40ddrqkv Tsort40dgogbu tsort40dKj8K3 tsort40d__Rs 8 Tsort40duz4zu [EMAIL PROTECTED]TSORT40D0K6RW Tsort40dbemco tsort40degdyo Tsort40dgvgzs [email protected] TSORT40DSAMWF Tsort40dvawri tsort40dzpg0g ... [DOMAIN01:DSADM]LS/DATA/SCRATCH03TSORT90D0A7CB tsort90dbd7lr tsort90ddrqkv Tsort90dgogbu tsort90dKj8K3 tsort90d__Rs 8 Tsort90duz4zu [EMAIL PROTECTED]TSORT90D0K6RW Tsort90dbemco tsort90degdyo Tsort90dgvgzs [email protected] TSORT90DSAMWF Tsort90dvawri tsort90dzpg0g ... [DOMAIN01:DSADM]LS/DATA/SCRATCH04TSORT70D0A7CB tsort70dbd7lr tsort70ddrqkv Tsort70dgogbu tsort70dKj8K3 tsort70d__Rs 8 TSORT70Duz4zu [EMAIL PROTECTED]TSORT70D0K6RW Tsort70dbemco tsort70degdyo Tsort70dgvgzs [email protected] tsort70d SAMWF tsort70dvawri tsort70dzpg0gtsort70d_0okx tsort70dbo642 tsort70dep8o7 tsort70dh6oyi tsort70dL9Gnu tsort70dSG2r C tsort70dvur3y tsort70dzrp8k ...
3.1 SMP Server configuration file
When the system is running in shared memory, multi-process systems, such as SMP server, assuming that the system has 4 CPUs, there are 4 separate file system disks (/FDISK01,/FDISK02,/FDISK03,/FDISK04), in order to better optimize the parallel resources, Create the following configuration file:
/* beging of Config */{node "node1" {fastname "domain01" Pools "" Resource Disk "/fdisk01/dis K "{} resource Disk"/fdisk02/disk "{} resource Disk"/fdisk03/disk "{} resource Disk"/fdisk04/disk " {} resource Scratchdisk "/fdisk01/scratch" {} resource Scratchdisk "/fdisk02/scratch" {} resource SC Ratchdisk "/fdisk03/scratch" {} resource Scratchdisk "/fdisk04/scratch" {}} Node "Node2" {Fastnam E "Domain01" Pools "" Resource Disk "/fdisk01/disk" {} resource Disk "/fdisk02/disk" {} resour Ce disk "/fdisk03/disk" {} resource Disk "/fdisk04/disk" {} resource Scratchdisk "/fdisk01/scratch" {} Resource Scratchdisk "/fdisk02/scratch" {} resource Scratchdisk "/fdisk03/scratch" {} resource Scratchdisk "/fdisk04/scratch" {}} Node "Node3" {fastname "domain01" Pools "" Resource Disk "/fdisk01 /disk "{} RESOURCE Disk "/fdisk02/disk" {} resource Disk "/fdisk03/disk" {} resource Disk "/fdisk04/disk" {} resource Scratchdisk "/fdisk01/scratch" {} resource Scratchdisk "/fdisk02/scratch" {} resource Scratchdisk "/fdisk03/ Scratch "{} resource Scratchdisk"/fdisk04/scratch "{}} Node" Node4 "{fastname" Domain01 " Pools "" Resource Disk "/fdisk01/disk" {} resource Disk "/fdisk02/disk" {} resource Disk "/fdisk03/di SK "{} resource Disk"/fdisk04/disk "{} resource Scratchdisk"/fdisk01/scratch "{} resource Scratchdi SK "/fdisk02/scratch" {} resource Scratchdisk "/fdisk03/scratch" {} resource Scratchdisk "/fdisk04/scratch" {}}}/* end of entire config */
Such a configuration applies when the job contains more complex difficult to determine the resource allocation method and the stage requires a higher IO operation, the DS calculates the specified disk and Scratchdisk, maximizing the reduced IO.
4 Summary
In many cases, depending on the complexity of the job, including the stage and the amount of data to create the appropriate configuration file, different operations require different systems, hardware resources, for big data sorting, saving to a file may require higher IO resources, for the operation, logic processing may require a higher CPU resources ; creating profiles should be created with these factors in mind; the more logical nodes in a configuration file, the more efficient the parallel job may be, but it also increases the load on the system.
--the End (2015-11-11)
DataStage Vii. allocating resources in DS using configuration files