Jul 24, 2014

Block corruption & Correction: a case study

Logical Block corruption & Correction: a case study

In one of a production database, I have received following errors in UI ( front-end) and back-end as well though mail. I started investigation though logs and conclude about logical corruption with few requested logs. Please go through my analysis.

1) Error received Front End :

Timestamp: 7/22/2014 4:21:34 PM
Message: OraDataServiceProvider.AddIssueDetails Message : ORA-01476: divisor is equal to zero
ORA-06512: at "SALES.F_UPDATEINVENOTARY", line 517
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
ORA-06512: at "SALES.P_ADDISSUE", line 334
ORA-06512: at line 1
----------------------------------------
Timestamp: 7/22/2014 4:21:34 PM
Message: OraDataServiceProvider.AddIssueDetails StackTrace :    at Oracle.DataAccess.Client.OracleException.HandleErrorHelper(Int32 errCode, OracleConnection conn, IntPtr opsErrCtx, 
OpoSqlValCtx* pOpoSqlValCtx, Object src, String procedure, Boolean bCheck)
   at Oracle.DataAccess.Client.OracleException.HandleError(Int32 errCode, OracleConnection conn, String procedure, 
IntPtr opsErrCtx, OpoSqlValCtx* pOpoSqlValCtx, Object src, Boolean bCheck)
   at Oracle.DataAccess.Client.OracleCommand.ExecuteNonQuery()
   at eHIS.OracleDataAccessHelper.ORACLEHelper.ExecuteNonQuery(String connectionString, CommandType cmdType, String 
sqlCommandText, OracleParameter[] parameterArray)
   at eHIS.SALES.DataAccess.OraDataServiceProvider.AddIssueDetails(String Issue, String& IssueCode)

2)Error received back End : ( in alert log)

2.1) from alert Log:

Tue Jul 22 17:05:49 2014
Errors in file /u01/app/oracle/diag/rdbms/prod/PROD1/trace/PROD1_ora_11731176.trc  (incident=82147):
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

2.2) from related .trc file:

DDE: Problem Key 'ORA 600 [4511]' was flood controlled (0x6) (incident: 82875)
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
DDE: Problem Key 'ORA 600 [4511]' was flood controlled (0x6) (incident: 82876)
*** 2014-07-22 16:44:40.284
*** CLIENT ID:() 2014-07-22 16:44:40.284

ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []

2.3) from related incident .trc file:

Dump continued from file: /u01/app/oracle/diag/rdbms/prod/PROD2/trace/PROD2_ora_13893936.trc
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []

========= Dump for incident 95001 (ORA 600 [4511]) ========
*** 2014-07-22 16:22:26.385
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=8ax21gq9d3rb1) -----
UPDATE BOOKSTORE I SET I.QOHISTORY = (SELECT SUM(QTY) FROM QOHISTORY WHERE BOOKCODE = :B2 AND STORECODE = :B1 ), 

I.HOLDQOHISTORY=I.HOLDQOHISTORY+ABS(:B3 ) WHERE I.BOOKCODE = :B2 AND I.STORECODE = :B1 
----- PL/SQL Stack -----
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
7000003c7a76dd8       241  function SALES.F_UPDATEINVENOTARY
7000004b1c7e110       334  procedure SALES.P_ADDISSUE
7000004aecaf568         1  anonymous block


3) Issue Description:

From the above errors I investigated related procedure and function. From the error, it seems there is a transaction issue with "BOOKSTORE" table. Same issue also logged by support team. But ORA-00600 is related to so many bugs. But this error which is clearly described in related incident .trc file that no update is happening in "BOOKSTORE" table. So initially it is clear. But we need to investigate more and confirm about the issue.

Note: From all all alert logs ( both node) and incident file, I came to know two table corrupted logically. BOOKSTORE and PURCHASEITEMS.

4) Confirmation investigations:

Here are some confirmation investigations:

4.1) Analyze the structure for validation:

e.g.,
While running "analyze table SALES.PURCHASEITEMS validate structure online;", we are getting below error:

SQL> analyze table SALES.PURCHASEITEMS validate structure online;
analyze table SALES.PURCHASEITEMS validate structure online
*
ERROR at line 1:
ORA-01498: block check failure - see trace file

Then I started tracing this with following way:

Analyze the table and upload tracefile generated if analyze fails, trace name should end in _ANALYZE.trc.

SQL> alter session set tracefile_identifier='ANALYZE';
SQL> analyze table SALES.PURCHASEITEMS validate structure online;

Here are some contents from generated .trc file:

*** 2014-07-17 21:33:09.884
*** SESSION ID:(408.36503) 2014-07-17 21:33:09.884
*** CLIENT ID:() 2014-07-17 21:33:09.884
*** SERVICE NAME:(SYS$USERS) 2014-07-17 21:33:09.884
*** MODULE NAME:(sqlplus@ehdb2 (TNS V1-V3)) 2014-07-17 21:33:09.884
*** ACTION NAME:() 2014-07-17 21:33:09.884

Block Checking: DBA = 185398234, Block Type = KTB-managed data block
data header at 0x70000017729c0ac
kdbchk: row locked by non-existent transaction
        table=0   slot=19
        lockid=4   ktbbhitc=5
Block header dump:  0x0b0cf3da
 Object id on Block? Y
 seg/obj: 0x122c8  csc: 0x0f.e2b866de  itc: 5  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0xb0cf300 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x001d.011.000237cd  0x01001e9e.4eb3.2a  C---    0  scn 0x000e.fa71110c
0x02   0x0177.008.000004cd  0x0100022d.0092.11  C---    0  scn 0x000e.5f6767bd
0x03   0x0092.00b.0003c7b9  0x00c0d3e4.7168.0b  C---    0  scn 0x000e.aebfb808
0x04   0x000d.01f.0006477b  0x0107008d.c219.25  C---    0  scn 0x000e.d03f9c6f
0x05   0x0017.001.00036313  0x01017ea0.67db.36  C---    0  scn 0x000e.fa70d161
bdba: 0x0b0cf3da
data_block_dump,data header at 0x70000017729c0ac
===============
tsiz: 0x1f50
hsiz: 0x8c
pbl: 0x70000017729c0ac
     76543210
flag=--------
ntab=1
nrow=61
frre=-1
fsbo=0x8c
fseo=0x952
avsp=0x909
tosp=0x924
...........
..........

But this may not understandable directly unless we do readable form. If you have Oracle support ID, then you can raise SR and upload this trace file. If without error stucture validated, probably there is other cause. Then let us proceed

some other tests.

4.2) Using RMAN method:

check the files used by these two tables PURCHASEITEMS and BOOKSTORE with RMAN to see how many blocks are affected

$ rman target / nocatalog

b) RMAN> run {
allocate channel d1 type disk;
allocate channel d2 type disk;
allocate channel d3 type disk;
allocate channel d4 type disk;
backup check logical validate datafile x,y,z;
}

/** where x,y,z should be replaced by the file#

Output:

$ rman target / nocatalog

RMAN>

run { 
allocate channel d1 type disk; 
allocate channel d2 type disk; 
allocate channel d3 type disk; 
allocate channel d4 type disk; 
backup check logical validate datafile 44,45,46; 
} 2> 3> 4> 5> 6> 7> 

allocated channel: d1
channel d1: SID=39 instance=PROD1 device type=DISK

allocated channel: d2
channel d2: SID=424 instance=PROD1 device type=DISK

allocated channel: d3
channel d3: SID=629 instance=PROD1 device type=DISK

allocated channel: d4
channel d4: SID=1358 instance=PROD1 device type=DISK

Starting backup at 22-JUL-14
channel d1: starting full datafile backup set
channel d1: specifying datafile(s) in backup set
input datafile file number=00044 name=+DATA/prod/datafile/sales01.dbf
channel d2: starting full datafile backup set
channel d2: specifying datafile(s) in backup set
input datafile file number=00045 name=+DATA/prod/datafile/sales_index01.dbf
channel d3: starting full datafile backup set
channel d3: specifying datafile(s) in backup set
input datafile file number=00046 name=+DATA/prod/datafile/sales_indx_01.dbf
channel d3: backup set complete, elapsed time: 00:00:16
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
46   OK     0              309          18640           24375064492
  File Name: +DATA/prod/datafile/sales_indx_01.dbf
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              0               
  Index      0              17782           
  Other      0              549             

channel d2: backup set complete, elapsed time: 00:01:19
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
45   OK     0              37946        664640          88021468605
  File Name: +DATA/prod/datafile/sales_index01.dbf
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              0               
  Index      0              616016          
  Other      0              10678          

channel d1: backup set complete, elapsed time: 00:01:32
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
44   FAILED 0              48213        963536          88021469422
  File Name: +DATA/prod/datafile/sales01.dbf
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       7              907742          
  Index      0              1               
  Other      0              7580          

validate found one or more corrupt blocks
See trace file /u01/app/oracle/diag/rdbms/prod/PROD1/trace/PROD1_ora_21823514.trc for details
Finished backup at 22-JUL-14
released channel: d1
released channel: d2
released channel: d3
released channel: d4

RMAN> 

RMAN> quit


Recovery Manager complete.


Here, RMAN reported no. of blocks corrupted with this statement, "Block Type Blocks Failing Blocks Processed". i.e., here we have some confirmation about block corruption. Next we will do some acid tests for max clarity.

4.3) Using DBMS packages:

Pre-requisites:

a) create 'REPAIR_TABLE' table :
Examples: Building a Repair Table or Orphan Key Table
The ADMIN_TABLE procedure is used to create, purge, or drop a repair table or an orphan key table.
A repair table provides information about the corruptions that were found by the CHECK_OBJECT procedure and how these will be addressed if the FIX_CORRUPT_BLOCKS procedure is run. Further, it is used to drive the execution of the FIX_CORRUPT_BLOCKS procedure.

Example: Creating a Repair TableThe following example creates a repair table for the users tablespace.

SET SERVEROUTPUT ON
BEGIN
DBMS_REPAIR.ADMIN_TABLES (
     TABLE_NAME => 'REPAIR_TABLE',
     TABLE_TYPE => dbms_repair.repair_table,
     ACTION     => dbms_repair.create_action,
     TABLESPACE => 'USERS');
END;
/

b) Detecting Corruption :

The CHECK_OBJECT procedure checks the specified object, and populates the repair table with information about corruptions and repair directives. You can optionally specify a range, partition name, or subpartition name when you want to check a portion of an object.

Validation consists of checking all blocks in the object that have not previously been marked corrupt. For each block,  the transaction and data layer portions are checked for self consistency. During CHECK_OBJECT, if a block is encountered that has a corrupt buffer cache header, then that block is skipped.

The following is an example of executing the CHECK_OBJECT procedure for the SALES.PURCHASEITEMS table.

SET SERVEROUTPUT ON
DECLARE num_corrupt INT;
BEGIN
num_corrupt := 0;
DBMS_REPAIR.CHECK_OBJECT (
     SCHEMA_NAME => 'SALES',
     OBJECT_NAME => 'PURCHASEITEMS',
     REPAIR_TABLE_NAME => 'REPAIR_TABLE',
     CORRUPT_COUNT =>  num_corrupt);
DBMS_OUTPUT.PUT_LINE('number corrupt: ' || TO_CHAR (num_corrupt));
END;
/

output:

number corrupt: 2

PL/SQL procedure successfully completed

set serveroutput on
DECLARE num_corrupt INT;
BEGIN
num_corrupt := 0;
DBMS_REPAIR.CHECK_OBJECT (
SCHEMA_NAME => 'SALES',
OBJECT_NAME => 'BOOKSTORE',
REPAIR_TABLE_NAME => 'REPAIR_TABLE',
corrupt_count => num_corrupt);
DBMS_OUTPUT.PUT_LINE('number corrupt: ' || TO_CHAR (num_corrupt));
END;
/

output:
number corrupt: 5

PL/SQL procedure successfully completed

Here it is confirmed, no. of corrupted blocks.

Run the below view to find comoplete information:

SQL> SELECT OBJECT_NAME, BLOCK_ID, CORRUPT_TYPE, MARKED_CORRUPT,CORRUPT_DESCRIPTION, REPAIR_DESCRIPTION
FROM REPAIR_TABLE;
OR
select object_id,object_name,BLOCK_ID, CORRUPT_TYPE, CORRUPT_DESCRIPTION from REPAIR_TABLE;

see: corrupt_blocks_output.xls



5) Corrective actions:

5.1) I can recommend some simplest method which needs to perform in downtime: Create copy tables after disbling constraints and truncate the currupted tables and then re-insert them. Example:


Step_1: take the counts
Check counts:

select count(1) from SALES.PURCHASEITEMS -- 491609
select count(1) from SALES.copy_PURCHASEITEMS -- 491609
select count(1) from SALES.BOOKSTORE -- 168806
select count(1) from  SALES.copy_BOOKSTORE -- 168806

Step_2: Take logical backup -- when downtime starts, to avoid any other issues if entire schema size is less or take only required tables.

$ expdp directory=DATA_PUMP dumpfile=sales_22Jul14.dmp logfile=sales_22Jul14_exp.log schemas=SALES parallel=3 exclude=statistics cluster=NO

Step_3: Create copy table for two issued tables:

create table copy_PURCHASEITEMS as select * from SALES.PURCHASEITEMS;
alter table SALES.PURCHASEITEMS disable constraint FK_PURCHASEITEMS_PO;
truncate table SALES.PURCHASEITEMS;
insert into SALES.PURCHASEITEMS select * from SALES.copy_PURCHASEITEMS;
alter table SALES.PURCHASEITEMS enable constraint FK_PURCHASEITEMS_PO;

create table SALES.copy_BOOKSTORE as select * from SALES.BOOKSTORE ;
truncate table SALES.BOOKSTORE;
insert into SALES.BOOKSTORE select * from SALES.copy_BOOKSTORE;


Step_4: Analyze tables:

exec dbms_stats.gather_table_stats(ownname => 'SALES',tabname => 'PURCHASEITEMS',cascade => TRUE);
exec dbms_stats.gather_table_stats(ownname => 'SALES',tabname => 'BOOKSTORE',cascade => TRUE);


5.2) Using DBMS Packages

Fixing Corrupt Blocks

Use the FIX_CORRUPT_BLOCKS procedure to fix the corrupt blocks in specified objects based on information in the repair table that was generated by the CHECK_OBJECT procedure. Before changing a block, the block is checked to ensure that the block is still corrupt. Corrupt blocks are repaired by marking the block software corrupt. When a repair is performed, the associated row in the repair table is updated with a timestamp.

This example fixes the corrupt block in table SALES.PURCHASEITEMS that was reported by the CHECK_OBJECT procedure.

SET SERVEROUTPUT ON
DECLARE num_fix INT;
BEGIN
num_fix := 0;
DBMS_REPAIR.FIX_CORRUPT_BLOCKS (
     SCHEMA_NAME => 'SALES',
     OBJECT_NAME=> 'PURCHASEITEMS',
     OBJECT_TYPE => dbms_repair.table_object,
     REPAIR_TABLE_NAME => 'REPAIR_TABLE',
     FIX_COUNT=> num_fix);
DBMS_OUTPUT.PUT_LINE('num fix: ' || TO_CHAR(num_fix));
END;
/

Simple & strait workaround:

-- start DBMS API to skip corrupt block flag

BEGIN
DBMS_REPAIR.SKIP_CORRUPT_BLOCKS (
SCHEMA_NAME => 'SALES',
OBJECT_NAME => 'BOOKSTORE',
OBJECT_TYPE => dbms_repair.table_object,
FLAGS => dbms_repair.SKIP_FLAG);
END;
/

-- Create a repair table

> create table SALES.BOOKSTORE_REPAIR as select * from SALES.BOOKSTORE;

-- Check counts

select count(*) from SALES.BOOKSTORE;
select count(*) from SALES.BOOKSTORE_REPAIR;

-- If the count is close 

>truncate table SALES.BOOKSTORE;
>insert into SALES.BOOKSTORE select * from SALES.BOOKSTORE_REPAIR;

Note: In my experience, I found some junk data updated in table and anonyms row inserted in the table. Manually data corrected and anaonymous record deleted with a pl/sql programming. Here are few errors:
error-1:
insert into SALES.BOOKSTORE
select * from SALES.servicerequestdetails_repair ;

ORA-12899: value too large for column "SALES"."BOOKSTORE"."EXTRACHARGEFLAG" (actual: 83, maximum: 5)

error-2:
insert into SALES.BOOKSTORE
select * from SALES.BOOKSTORE_repair ;

ORA-12899: value too large for column "SALES"."BOOKSTORE"."CONSULTATIONDATE" (actual: 120, maximum: 7)

etc.

-- after correction, check the counts again.

>select count(*) from SALES.BOOKSTORE;
>select count(*) from SALES.BOOKSTORE_REPAIR;

-- Again set "no skip" flag

BEGIN
DBMS_REPAIR.SKIP_CORRUPT_BLOCKS (
SCHEMA_NAME => 'SALES',
OBJECT_NAME => 'BOOKSTORE',
OBJECT_TYPE => dbms_repair.table_object,
FLAGS => dbms_repair.NOSKIP_FLAG);
END;
/

-- Analyze the table


exec DBMS_STATS.gather_table_stats(ownname=> 'SALES',tabname=>'BOOKSTORE',estimate_percent => 100,cascade=>TRUE,degree=>1,granularity=>'AUTO',method_opt=>'FOR ALL INDEXED COLUMNS SIZE AUTO');

-- To validate database and backup sets




Start RMAN and connect to a target database.
Execute the VALIDATE command with the desired options.
For example, to validate all datafiles and control files (and the server parameter file if one is in use), execute the following command at the RMAN prompt:
RMAN> VALIDATE DATABASE;
e.g.,
RMAN> VALIDATE DATABASE;

Starting validate at 15-MAY-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=547 device type=DISK
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00053 name=/oradata/datafiles/PROD/PAYROLL01.dbf
......
.....

-- Validate backupset:
Alternatively, you can validate a particular backup set by using the form of the command shown in the following example (sample output included).
RMAN> VALIDATE BACKUPSET 22;

Below query can be used to find any corrupted block is avialbe in database or not:

set head on; 
set pagesize 2000 
set linesize 250 
select * from v$database_block_corruption;
SELECT e.owner, e.segment_type, e.segment_name, e.partition_name, c.file# 
, greatest(e.block_id, c.block#) corr_start_block# 
, least(e.block_id+e.blocks-1, c.block#+c.blocks-1) corr_end_block# 
, least(e.block_id+e.blocks-1, c.block#+c.blocks-1) 
- greatest(e.block_id, c.block#) + 1 blocks_corrupted 
, null description 
FROM dba_extents e, v$database_block_corruption c 
WHERE e.file_id = c.file# 
AND e.block_id <= c.block# + c.blocks - 1 
AND e.block_id + e.blocks - 1 >= c.block# 
UNION 
SELECT s.owner, s.segment_type, s.segment_name, s.partition_name, c.file# 
, header_block corr_start_block# 
, header_block corr_end_block# 
, 1 blocks_corrupted 
, 'Segment Header' description 
FROM dba_segments s, v$database_block_corruption c 
WHERE s.header_file = c.file# 
AND s.header_block between c.block# and c.block# + c.blocks - 1 
UNION 
SELECT null owner, null segment_type, null segment_name, null partition_name, c.file# 
, greatest(f.block_id, c.block#) corr_start_block# 
, least(f.block_id+f.blocks-1, c.block#+c.blocks-1) corr_end_block# 
, least(f.block_id+f.blocks-1, c.block#+c.blocks-1) 
- greatest(f.block_id, c.block#) + 1 blocks_corrupted 
, 'Free Block' description 
FROM dba_free_space f, v$database_block_corruption c 
WHERE f.file_id = c.file# 
AND f.block_id <= c.block# + c.blocks - 1 
AND f.block_id + f.blocks - 1 >= c.block# 
order by file#, corr_start_block#; 






Note: If you have "NOLOGING" block corruption, you can follow aboev method, but in alert log message will come. Below query can be used to find "NOLOGGING" block corruptions. 


select * from v$database_block_corruption 
where CORRUPTION_TYPE='NOLOGGING';




You can use below query since when this corruption occured:

select file#, block#, first_time, next_time
from   v$archived_log, v$database_block_corruption
where  CORRUPTION_CHANGE# between first_change# and next_change#
  and CORRUPTION_TYPE='NOLOGGING';


Fix: Keep your database in FORCE LOGGING mode.

sql> connect sys as sysdba
sql> alter database force logging;

If you are getting LOB segments as corrupted block, then first find which segment it refers. Use below query:
e.g., Assume, 'SYS_LOB0004142294C00005$$' came as corrupted segment.

select * from dba_lobs where segment_name='SYS_LOB0004142294C00005$$'

Thanks
Post comments, if you have some doubts...

2 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete

Translate >>