When deduplication is, temporary processing is disabled, and 80
Original Works are from the blog of "Deep Blue blog". You are welcome to repost them. Please indicate the source when you repost them. Otherwise, the copyright will be held legally responsible.
Deep Blue blog: http://blog.csdn.net/huangyanlong/article/details/46041735
Temporary handling of 80 million data deduplication
Recently, When I deduplicate a million data business table, the de-duplication operation was stuck. The following is a simple review.
1. query the data volume of the business table. A total of more than 200 million data records are displayed.
SQL> select count (*) from tb_bj_banker_etl;
2552381
2. query the duplicate data volume that should be removed from the table. There are more than 80 million records.
SQL> select count (*) from tb_bj_banker_etl where (id) in (select id from tb_bj_banker_etl group by id having count (*)> 1) and rowid not in (select max (rowid) from tb_bj_banker_etl group by id having count (*)> 1 );
830099
3. So, before work in the evening, execute the following statement script to remove duplicates
SQL> delete from tb_bj_banker_etl where (id) in (select id from tb_bj_banker_etl group by id having count (*)> 1) and rowid not in (select max (rowid) from tb_bj_banker_etl group by id having count (*)> 1 );
SQL> commit;
4. The next day, when we arrived at the site, we found that the statements executed last night in PL/SQL Developer were still being executed.
First of all, I noticed that I had not finished running more than 80 million de-duplicated data for one night? What's wrong with this?
Suspected lock table.
The user queries whether the table has a lock.
Select. OWNER, -- User. OBJECT_NAME, -- OBJECT Name B. XIDUSN, B. XIDSLOT, B. XIDSQN, B. SESSION_ID, -- lock table user's session B. ORACLE_USERNAME, -- Oracle username B of the table lock user. OS _USER_NAME, -- lock table user's operating system login username B. PROCESS, B. LOCKED_MODE, C. MACHINE, -- lock table user's computer name C. STATUS, -- lock Table status c. SERVER, C. SID, C. SERIAL #, C. PROGRAM: the database management tool used by the lock table user FROM ALL_OBJECTS A, V $ LOCKED_OBJECT B, SYS. GV _ $ session c where. OBJECT_ID = B. OBJECT_ID and B. PROCESS = C. processorder by 1, 2
In the following results, we can see that only the initiate session of the deduplicate statement is locked, and no other users cause the lock table. Does this mean that the statement is still being executed? Start to solve the problem with questions.
1 BJHYL tb_bj_banker_ETL 15 18 9000 913 BJHYL Administrator 4036: 972 3 WORKGROUP \ backdb active dedicated 913 3381 plsqldev.exe
2 BJHYL tb_bj_banker_ETL 15 18 9000 913 BJHYL Administrator 4036: 972 3 WORKGROUP \ backdb inactive dedicated 649 41791 plsqldev.exe
3 BJHYL tb_bj_banker_ETL 15 18 9000 913 BJHYL Administrator 4036: 972 3 WORKGROUP \ backdb inactive dedicated 817 27777 plsqldev.exe
4 BJHYL tb_bj_banker_ETL 15 18 9000 913 BJHYL Administrator 4036: 972 3 WORKGROUP \ backdb inactive dedicated 841 1981 plsqldev.exe
5. batch processing is adopted to solve the problem of deduplication
Since direct de-duplication cannot proceed smoothly, I thought of a batch de-duplication method and tried it.
First time: delete from tb_bj_banker_etl where (id) in (select id from tb_bj_banker_etl group by id having count (*)> 1) and rowid not in (select max (rowid) from tb_bj_banker_etl group by id having count (*)> 1) and rownum <= 100000; commit; Second: delete from tb_bj_banker_etl where (id) in (select id from tb_bj_banker_etl group by id having count (*)> 1) and rowid not in (select max (rowid) from tb_bj_banker_etl group by I D having count (*)> 1) and rownum <= 100000; commit ;..................... Eighth time: delete from tb_bj_banker_etl where (id) in (select id from tb_bj_banker_etl group by id having count (*)> 1) and rowid not in (select max (rowid) from tb_bj_banker_etl group by id having count (*)> 1); commit;
Result: After Dividing more than 0.8 million data into 10 million data records for single deduplication operation, the total duration is over 140 seconds, and the goal of removing 0.8 million data is achieved. However, the reason for direct handling of zombie cases remains to be tracked and analyzed in the future.
Original Works are from the blog of "Deep Blue blog". You are welcome to repost them. Please indicate the source when you repost them. Otherwise, the copyright will be held legally responsible.
Deep Blue blog: http://blog.csdn.net/huangyanlong/article/details/46041735
Series link _ 20150523:
Blue growth note-chasing DBA (1): traveling on the road to Shandong
Blue growth notes-Chase DBA (2): Install! Install! Long-lost memories have aroused my new awareness of DBAs.
Blue growth note-chasing DBA (3): importing and exporting data on antiques becomes a problem
Blue growth note-chasing DBA (4): recalling the sorrow of teenagers, and exploring oracle Installation (10g and 11g in Linux)
Blue growth note-chasing DBA (5): Not talking about technology or business, annoying Application Systems
Blue growth note-chasing DBA (6): doing things and being human: Small technology, great human
Blue growth note-Chase DBA (7): Basic commands, foundation stone
Blue growth notes-chasing DBA (8): repicking SP reports and recalling oracle's STATSPACK Experiment
Blue growth-chasing DBA (9): Chasing DBA, new planning, new departure
Blue growth note-chasing DBA (10): Flying knife defense, familiarity rather than expertise: Playing with middleware Websphere
Blue growth note-chasing DBA (11): It's easy to go home and wake up.
Blue growth notes-Chase DBA (12): seven days and seven gains of SQL
BlueGrowth story-chasing DBA (13): Coordinating hardware manufacturers. Six Stories: What you see: servers, storage, switches ......"
Blue growth note-chasing DBA (14): An unforgettable "Cloud" end, started hadoop deployment
Blue growth notes-Chase DBA (15 ):Who thinks FTP is "simple "?
Blue growth note-chasing DBA (16): DBA also drank alcohol and was rejected