How to repair ASM disk header

What is ASM disk header corruption?

ASM disk header corruption is a loss or damage to the disk header which contains the metadata essential for the operation and availability of an ASM disk group in Oracle. Imagine that you have an ASM disk corruption in any of the ASM diskgroups that have critical business data and the diskgroup has been configured as "EXTERNAL" redundancy, meaning that if one disk is inaccessible the whole diskgroup will be unavailable. Now what? Is your only option to restore it from backup by recreating the diskgroup or are there any other faster ways to recover the data? Here you need to find out whether it’s a complete loss of data in the ASM disk or is it just a header corruption.

Hopefully for you, this is just a header corruption. Because, yes, there is a way out!  Below find the steps to restore from the ASM header corruption.

Most of the data in the ASM disk header is of interest to that disk only. However, some information in the ASM disk header is relevant to the whole disk group and some is even relevant to the whole cluster and that’s why it is very important that the Oracle keeps a backup of the header in the same disk.


Both the DATA and FLASH diskgroup failed to mount after the server reboot and the production database becomes unavailable. When you see the ASM header status showing as PROVISIONED (Normally PROVISIONED status means Disk is not part of a disk group and may be added to a disk group with the ALTER DISKGROUP statement), you should follow the steps below to get the issue resolved:
Step 1: Check the header status of ASM disks:

SQL> connect / as sysasm

SQL>  select GROUP_NUMBER, HEADER_STATUS, PATH from v$asm_disk where GROUP_NUMBER in (2,4);

           2    MEMBER       ORCL:DATA_1

           2    MEMBER       ORCL:DATA_2

           2    PROVISIONED       ORCL:DATA_3

           4    MEMBER       ORCL:FLASH_1

           4    MEMBER       ORCL:FLASH_2

           4    PROVISIONED       ORCL:FLASH_3

Now Shutdown CRS stack (ASM instance) (On all nodes)

# crsctl stop crs –f


Step 2: Take a 10M dd backup just in case for all the disk that is having issues.

# dd if=/dev/oracleasm/disks/DATA_3 of=/tmp/DATA.dd bs=1M count=10

# dd if=/dev/oracleasm/disks/FLASH_3 of=/tmp/DATA.dd bs=1M count=10

tep 3: Repair the device header in the following way.

Repeat the following example procedure for the all affected devices.

Check AU (Allocation Unit) size for the disk in the following way for “AUS” option in kfed command:

# kfed read /dev/oracleasm/disks/FLASH_3 | grep ausize ## You will get output like below

kfdhdb.ausize:               1048576 ; 0x0bc: 0x00100000

Step 4: Now execute repair command:

# kfed repair /dev/oracleasm/disks/FLASH_3 aus=1048576 ## aus size should be same as you get from the above grep command

The above command needs to be executed for all disks having issue. Like below:

# kfed repair /dev/oracleasm/disks/DATA_3 aus=1048576

Once the repair command executed

Step 5) Now start ASM instance and then proceed with mounting DATA and FLASH diskgroup.

SQL> connect / as sysasm

SQL>  select GROUP_NUMBER, HEADER_STATUS, PATH from v$asm_disk where GROUP_NUMBER in (2,4);

:- Validate the header status column it should be MEMBER now

           2    MEMBER       ORCL:DATA_1

           2    MEMBER       ORCL:DATA_2

           2    MEMBER       ORCL:DATA_3

           4    MEMBER       ORCL:FLASH_1

           4    MEMBER       ORCL:FLASH_2

           4    MEMBER       ORCL:FLASH_3

SQL> alter diskgroup DATA mount;

SQL> alter diskgroup FLASH mount;


ep 6: Shutdown CRS stack cleanly as root:

# crsctl stop crs [-f]

Step 7: Once CRS is shutdown cleanly on all nodes startup crs serially on all cluster nodes as root

# crsctl start crs

Note: The above recovery procedure only works with the symptom (0xaa55 on 510th location on device header) described above. And there are many other possibilities causing PROVISIONED status in which the procedure above may not work and for that we need to open case with Oracle support.

Conclusion:

Oracle provides a very powerful “kfed” utility to read and write to the ASM disk header directly. This utility helps identifying and fixing the ASM disk header corruption quickly. Without this utility you might end up doing complete rebuild of ASM disks and perform restore of database in case of ASM disk header corruption.

Comments

Popular posts from this blog

Understanding Terraform

How to make CRS and ASM not to restart after server reboot