1. Data-related considerations
Analyzing source data is an important step in determining data quality, storage methods, source data types, and necessary preparations.
1.1 clean and merge data
Data preprocessing can improve the performance:
• If the source data only contains the information required by the model, the read time of transformer will be accelerated. For example, if a data source contains some useless columns, transformer takes extra time to process them even if these columns are not used in the model.
• Data merging can reduce the number of records read. The fewer records the transformer reads, the shorter the time it takes to generate the powercube.
1.1.1 prompt
• When designing the data source to be used in transformer, try to include only the columns required for model generation to minimize the processing time. If a column is not required, the processing time of the data source may be affected.
• If possible, maintain the class structure in the transformer model to reduce the redundant processing required for rebuilding them.
• If the model contains a long description, we recommend that you use a model that already contains a category associated with the description to generate powercubes.
1.2 timing)
Scheduled control (on the data source attribute page) controls when transformer processes the data source.
You must first execute a structured data source to create a category structure in the model. After this step is completed, if you do not need to execute them during the powercube generation process (no new categories are added to the data source and the models containing these categories have been saved ), you can set the data source's timing function as follows:
Some structured data sources represent a variable structure that needs to be updated each time powercube is generated. You can set the timing function of this type of data source to run in the category generation phase of creating powercube.
Transaction data sources change with the new data required by the level value each time powrecube is generated. The transaction data source is executed during the powercube creation process to provide the metric value:
1.3 verify category uniqueness to maximize data access speed
On the data source properties page, there are two settings for uniqueness verification. The default attribute is verify category uniqueness. This setting is recommended for data sources whose columns are associated with layers in dimensions that contain a unique layer. These data sources are usually structured data sources.
If verify category uniqueness is set and transformer detects that two classes have the same source value on a layer marked as "unique" (layer attribute), the following error message is returned:
(Tr2317) the level 'city' is designated as unique. source value 'green Bay 'was used in an attempt to create a category in the path (by state, Illinois, Green Bay ). 'Green Bay 'already exists in level 'city' in the path (by state, Wisconsin, Green Bay ).
(Tr0136) a uniqueness violation was detected. The process has been aborted.
For example, the State dimension is set to unique on the city layer:
This error indicates that the second Green Bay instance exists under the city layer (Illinois in this example ). For example, if your source data is as follows:
Measure, state, City
1, Wisconsin, Green Bay
2, Wisconsin, Appleton
3, Illinois, Green Bay
When "unique" is not selected on the city layer, the dimension view is displayed as follows:
When "unique" is selected on the city layer, the process is interrupted and the dimension view is displayed as follows:
If you are sure that the values in the model data source are mapped to the unique category of the layer, you can set the maximize data access speed attribute. When this attribute is enabled, the uniqueness verification will be minimized, and the data source processing performance will be improved. Transformer does not constantly verify existing values and category values. This means that the performance can be greatly improved.
Warning! If maximize data access speed is enabled, and the uniqueness is violated in the data, transformer will not notify you. This may cause category loss and inaccurate values in powercube.
In the same example, if maximize data access speed is enabled and the city layer is set to "unique ", transformer will not notify you that Green Bay exists in two different States (Wisconsin and Illinois). The final result in powerplay is as follows:
Note: The preceding crosstab chart does not contain Illinois.
If you delete the uniqueness of the city layer and recreate the cube, the final result in powerplay is:
Note: When maximize data access speed is set, unique move is not executed ).
More than 1.4 server processing functions
If the computer that generates the powercube is a dual-CPU, you can use the multi-server processing function. Enabling this function greatly improves the overall performance of generating powercube in the Data Reading phase.
Multi-server only applies to the following data source types:
• Impromptu query definition (IQD)
• Delimited field text
• Delimited field text with Column Titles
This option can be set in the data source attribute dialog box:
1.5 incremental update
If the conditions for creating the entire cube are not met, incremental update is a good solution. Incremental update only adds the latest data to the existing powercube and does not reprocess the previous data. Compared with rebuilding the entire powercube, incremental update only updates a small amount of data, and the update speed is also accelerated.
If the powercube structure (dimension, layer, etc.) is static, you only need to consider using the incremental update function. If the structure changes, you must use all the data to regenerate the cube.
We recommend that you recreate the powercube periodically. When a cube is created for the first time, the auto-partitioning function can split the dimension and layer into multiple partition layers ). After that, all new categories will be added to the "0" partition layer. If many categories are added over time, powercube users will eventually encounter performance problems. Using all the current categories to regenerate powercube will allow transformer to design a new partitioning mode. The following example shows the scheduling of a complete reconstruction after every four incremental updates:
Build processing behavior
1. initialize and Load
2. incrementally update build1 1
3. Perform incremental update 2 on build 2
4. Perform incremental update 3 on build 3
5. Perform incremental update on Build 4 4 4
6. Full loading composed of initial loading and incremental update 1 to incremental update 4
7. Perform incremental update 5 on build 5
8. Perform incremental update 6 on build 7...
1.6 set the transformer Environment
This section lists the settings that need to be considered for optimal performance for transformer on Windows NT:
• Writecachesize: based on the amount of available memory, the write cache value will have a positive or negative impact on the powercube generation time. When there is enough physical memory so that the disk cache can be increased to the same size as powercube, you can get the best performance.
You can modify the settings in Configuration Manager under services-powerplay Data Services-cache. The default value is 8192 (or 8 Mb ). To modify the value, you can use 1024 to increase the value size for the increment. Increasing the write cache to 32768 (32 MB) or 65536 (64 MB) on a large system can improve performance. However, increasing it to a very large value (for example, 102400 or hundreds of megabytes) may damage performance.
• Sortmemory: this variable can be used to set the available physical memory used for sorting data. Transformer sorts data for data merging and automatic partitioning.
The value indicates the number of 2 k blocks used for data sorting. For example, setting the value to 5120 can provide 5120x2 k = 10 MB memory. The default value is 512. You can modify the default value in Configuration Manager under services-uda-general. Setting the default value 5120 is a good choice.
• Tempfiledirs: transformer uses this setting to temporarily sort files. This temporary sorting file is created when transformer performs the sorting operation.
You can modify the addresses in Configuration Manager under services-uda-general. You can set multiple directories separated by semicolons.
• Maxtransactionnum: transformer can insert a checkpoint at each stage of the powercube generation. The maximum transactions per commit setting can limit the number of records temporarily saved before a checkpoint is inserted. The default value is maxtransactionnum = 500000. The set value is the maximum number of records that transformer will process before submitting changes to powercube. You can modify the default value in the transformer preferences dialog box under the General tab.
If an error occurs when a cube is generated (for example, tr0112 there isn't enough memory available), you need to reduce the value of maxtransactionnum to speed up the submission frequency and release disk space.
You can add this setting to a higher value (for example, 800000) to increase the cube generation time. The result depends on the environment.
Idea: The readcachesize setting has nothing to do with transformer. This setting is only applicable to powerplay Enterprise Server and powerplay client.
1.7 parameter setting file
You can use several parameter settings. The following are the most common parameter settings:
• Modelworkdirectory = <path>
Set the location for creating temporary files during model design. Temporary files are used to restore the model that is suspended at the strategic checkpoint when a severe error occurs during cube creation. The file name extension is qyi. The default path is the value set by modelsavedirectory.
• Dataworkdirectory = <path1; path2;...>
Sets the location where transformer creates a temporary work file when a cube is generated. You can use multiple drives to eliminate the size limit caused by the operating system. When transformer creates a cube, it can write temporary files to the specified drive or directory. These files are connected to a logical file without having to consider the drive where they are located. The location of these files is determined by the path list you set. The default path is the value set by cubesavedirectory.
• Performancedirectory = <path>
For data source files other than the IQD file and the effecect model, this setting can specify the location where transformer searches for these files. The default path is the current working path.
• Cubesavedirectory = <path>
Specify the position where transformer saves the cube. The default path is modelsavedirectory.
• Modelsavedirectory = <path>
Sets the location where transformer saves the model. The default path is the current working path.
The following are examples of these settings in the transformer log file:
Powerplay transformer wed Sep 19 09:39:17 2001
Logfiledirectory = C:/transformer/logs
Modelsavedirectory = C:/transformer/models/
Datasourcedirectory = C:/transformer/data/
Cubesavedirectory = E:/transformer/cubes/
Dataworkdirectory = D:/temp/
Modelworkdirectory = E:/temp/
The following example shows how to use the parameter setting file in the command line:
Trnsfrmr-n-FC:/preferences. PRF model. MDL
1.7.1 prompt
• Setting the parameter settings in the command line will overwrite and take precedence over other settings. For example, if the environment settings are defined in the rsserver. Sh file, use the parameter settings file in the command line to overwrite these settings.
• Environmental variables such as tmpdir, temp, and TMP can define where transformer creates temporary files. Transformer can use the first defined environment variable. These environment variables are system environment variables defined by the operating system.