Use the Sample parameter of DataPump for data sampling.
Data Pump Datapump is a convenient data logical backup and restoration tool launched by Oracle. Compared with the traditional Exp/Imp, Datapump provides many advantages in both performance and functionality. In practice, big data sampling is a frequent business requirement. For a massive data table, you can extract representative data volumes to complete the corresponding work.
In the select statement, we can use the sample statement to implement some corresponding functions. With Datapump, we can also back up data tables in the sampling mode.
1. Environment Description
We chose Oracle 11gR2 for testing. The specific version is 11.2.0.4.
SQL> select * from v $ version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0-64bit Production
PL/SQL Release 11.2.0.4.0-Production
CORE 11.2.0.4.0 Production
TNS for Linux: Version 11.2.0.4.0-Production
NLSRTL Version 11.2.0.4.0-Production
Create a data table SM_TEST.
SQL> create table scott. sm_test as select * from dba_objects;
Table created
SQL> select count (*) from scott. sm_test;
COUNT (*)
----------
133391
2. Test the experiment
There are few official interpretations of the Sample parameter of Expdp.
SAMPLE
Percentage of data to be exported.
Datapump can work in three modes: full database mode, Schema mode, and table mode. You can use the Sample parameter to set the three modes. The following describes how to set the sampling ratio by specifying the simplest ratio value.
[Oracle @ localhost ~] $ Expdp \ "/as sysdba \" dumpfile = TEST_1.dmp tables = scott. sm_test sample = 50
Export: Release 11.2.0.4.0-Production on Mon Nov 23 13:50:02 2015
Copyright (c) 1982,201 1, Oracle and/or its affiliates. All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0-64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "SYS". "SYS_EXPORT_TABLE_01": "/******** as sysdba" dumpfile = TEST_1.dmp tables = scott. sm_test sample = 50
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 8 MB
Processing object type TABLE_EXPORT/TABLE
.. Exported "SCOTT". "SM_TEST" 6.402 MB 66587 rows
Master table "SYS". "SYS_EXPORT_TABLE_01" successfully loaded/unloaded
**************************************** **************************************
Dump file set for SYS. SYS_EXPORT_TABLE_01 is:
/U01/app/admin/sicsdb/dpdump/TEST_1.dmp
Job "SYS". "SYS_EXPORT_TABLE_01" successfully completed at Mon Nov 23 13:50:11 2015 elapsed 0 00:00:07
In the table mode where the exported data table name is specified, the percentage parameter of sample is used to determine the content of a specific table. About 60 thousand of the exported data is exactly 0.13 million of the total data of 50%, which is a reasonable result.
If you run in schema mode and specify the proportional share directly, what is the result?
SQL> select count (*) from scott. emp;
COUNT (*)
----------
14
[Oracle @ localhost ~] $ Expdp \ "/as sysdba \" dumpfile = TEST_2.dmp schemas = scott sample = 50
Export: Release 11.2.0.4.0-Production on Mon Nov 23 13:51:31 2015
Copyright (c) 1982,201 1, Oracle and/or its affiliates. All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0-64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "SYS". "SYS_EXPORT_SCHEMA_01": "/******** as sysdba" dumpfile = TEST_2.dmp schemas = scott sample = 50
Estimate in progress using BLOCKS method...
Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 8.093 MB
Processing object type SCHEMA_EXPORT/USER
Processing object type SCHEMA_EXPORT/SYSTEM_GRANT
Processing object type SCHEMA_EXPORT/ROLE_GRANT
Processing object type SCHEMA_EXPORT/DEFAULT_ROLE
Processing object type SCHEMA_EXPORT/PRE_SCHEMA/PROCACT_SCHEMA
Processing object type SCHEMA_EXPORT/TABLE
Processing object type SCHEMA_EXPORT/TABLE/GRANT/OWNER_GRANT/OBJECT_GRANT
Processing object type SCHEMA_EXPORT/TABLE/INDEX
Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT
Processing object type SCHEMA_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type SCHEMA_EXPORT/VIEW
Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT/REF_CONSTRAINT
Processing object type SCHEMA_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
.. Exported "SCOTT". "SM_TEST" 6.406 MB 66611 rows
.. Exported "SCOTT". "DEPT" 5.859 KB 1 rows
.. Exported "SCOTT". "EMP" 8.359 KB 9 rows
.. Exported "SCOTT". "SALGRADE" 5.835 KB 3 rows
.. Exported "SCOTT". "BONUS" 0 KB 0 rows
Master table "SYS". "SYS_EXPORT_SCHEMA_01" successfully loaded/unloaded
**************************************** **************************************
Dump file set for SYS. SYS_EXPORT_SCHEMA_01 is:
/U01/app/admin/sicsdb/dpdump/TEST_2.dmp
Job "SYS". "SYS_EXPORT_SCHEMA_01" successfully completed at Mon Nov 23 13:51:48 2015 elapsed 0 00:00:16
The sampling ratio takes effect within the range of all data tables in the Schema. All data tables are sampled at a rate of 50%.
Finally, if we export multiple data tables and only sample some data tables, how do we specify the sample parameter?
[Oracle @ localhost ~] $ Expdp \ "/as sysdba \" dumpfile = TEST_3.dmp schemas = scott sample = scott. sm_test: 30
Export: Release 11.2.0.4.0-Production on Mon Nov 23 13:52:50 2015
Copyright (c) 1982,201 1, Oracle and/or its affiliates. All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0-64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "SYS". "SYS_EXPORT_SCHEMA_01": "/******** as sysdba" dumpfile = TEST_3.dmp schemas = scott sample = scott. sm_test: 30
Estimate in progress using BLOCKS method...
Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 4.987 MB
Processing object type SCHEMA_EXPORT/USER
Processing object type SCHEMA_EXPORT/SYSTEM_GRANT
Processing object type SCHEMA_EXPORT/ROLE_GRANT
Processing object type SCHEMA_EXPORT/DEFAULT_ROLE
Processing object type SCHEMA_EXPORT/PRE_SCHEMA/PROCACT_SCHEMA
Processing object type SCHEMA_EXPORT/TABLE
Processing object type SCHEMA_EXPORT/TABLE/GRANT/OWNER_GRANT/OBJECT_GRANT
Processing object type SCHEMA_EXPORT/TABLE/INDEX
Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT
Processing object type SCHEMA_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type SCHEMA_EXPORT/VIEW
Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT/REF_CONSTRAINT
Processing object type SCHEMA_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
.. Exported "SCOTT". "SM_TEST" 3.861 MB 40146 rows
.. Exported "SCOTT". "DEPT" 5.960 KB 5 rows
.. Exported "SCOTT". "EMP" 8.562 KB 14 rows
.. Exported "SCOTT". "SALGRADE" 5.859 KB 5 rows
.. Exported "SCOTT". "BONUS" 0 KB 0 rows
Master table "SYS". "SYS_EXPORT_SCHEMA_01" successfully loaded/unloaded
**************************************** **************************************
Dump file set for SYS. SYS_EXPORT_SCHEMA_01 is:
/U01/app/admin/sicsdb/dpdump/TEST_3.dmp
Job "SYS". "SYS_EXPORT_SCHEMA_01" successfully completed at Mon Nov 23 13:53:00 2015 elapsed 0 00:00:09
In the above experiment, if you only sample a specific data table, you need to specify the data table name: sampling ratio.
3. Conclusion
According to my estimation, the sample parameter is mainly implemented by the sample statement that is passed to the select statement in expdp.