How to replace ASM failed disk?
Oracle Automatic Storage Management (ASM) was introduced in Oracle 10g. ASM provides advance storage management features such as DISK I/O re-balancing, volume management and easy database file name management. It also can provide MIRRORING of data for high availability and redundancy in the event of a disk failure (Mirroring is optional). ASM guarantees that data extents (table,index row data etc.) in one disk are mirrored in another disk (normal redundancy) and in two disks (high redundancy).
A few times I have faced ASM disk failures when redundancy (mirroring) was enabled and none of them resulted in an issue for an end user. ASM automatically detects the disk failure and services Oracle SQL requests by retrieving information from the mirrored (other) disk. Such a failure is handled gracefully and entirely managed by Oracle. I am very impressed by the fault tolerance capability in ASM.
But soon the Oracle DBA must work with the system administrator to replaced the failed disk. If the mirrored disk also fails before the replacement, then Oracle SQL by end users will error because both the primary and mirrored disks have failed.
This post assumes that you are using ASM redundancy (Normal or High) and that you are not using ASMLib program. The commands and syntax could be different if you are using ASMLib.
Drop the failed disk
1) alter diskgroup #name# drop disk #disk name#;
Caution: Do NOT physically remove the failed disk YET from the disk enclosure of the server. The above command is executed immediately, but ASM also starts a lengthy re-balance operation. The disk should be physically removed only after the header_status for the failed disk becomes FORMER. This status is set after the re-balance operation is completed. One can monitor the progress of the re-balance operation by checking v$asm_operation.
select state,power,group_number,EST_MINUTES from v$asm_operation;
After a few min/hours the above operation will get completed (no rows returned). Then verify that the header_status is now FORMER and then request the System Administrator to physically remove the disk from the disk enclosure. The LED light for the failed disk should get turned off and this indicates the physical location of the failed disk in the enclosure.
How to decrease the ASM re-balance operation time
While the above ASM re-balancing operation is in progress, the DBA can let it complete quickly by changing 'ASM power' by running the below command for example.
alter diskgroup #name# rebalance power 8;
The default power is 1 (i.e ASM starts one re-balance background process to handle the re-balancing work, called ARB process). The above command dynamically starts 8 ARB processes (ARB0 to ARB7), which can dramatically decrease the time to re-balance. The maximum power limit in 11g R1 is 11 (upto 11 ARB processes can be started).
Conclusion
None of the above maintenance operations (disk drop, disk add) causes a downtime to the end user and therefore can be completed during normal business hours. The re-balance operation can cause slight degradation of performance and hence increase the power limit to let it complete quickly.
A few times I have faced ASM disk failures when redundancy (mirroring) was enabled and none of them resulted in an issue for an end user. ASM automatically detects the disk failure and services Oracle SQL requests by retrieving information from the mirrored (other) disk. Such a failure is handled gracefully and entirely managed by Oracle. I am very impressed by the fault tolerance capability in ASM.
But soon the Oracle DBA must work with the system administrator to replaced the failed disk. If the mirrored disk also fails before the replacement, then Oracle SQL by end users will error because both the primary and mirrored disks have failed.
This post assumes that you are using ASM redundancy (Normal or High) and that you are not using ASMLib program. The commands and syntax could be different if you are using ASMLib.
Drop the failed disk
1) alter diskgroup #name# drop disk #disk name#;
Caution: Do NOT physically remove the failed disk YET from the disk enclosure of the server. The above command is executed immediately, but ASM also starts a lengthy re-balance operation. The disk should be physically removed only after the header_status for the failed disk becomes FORMER. This status is set after the re-balance operation is completed. One can monitor the progress of the re-balance operation by checking v$asm_operation.
select state,power,group_number,EST_MINUTES from v$asm_operation;
After a few min/hours the above operation will get completed (no rows returned). Then verify that the header_status is now FORMER and then request the System Administrator to physically remove the disk from the disk enclosure. The LED light for the failed disk should get turned off and this indicates the physical location of the failed disk in the enclosure.
How to decrease the ASM re-balance operation time
While the above ASM re-balancing operation is in progress, the DBA can let it complete quickly by changing 'ASM power' by running the below command for example.
alter diskgroup #name# rebalance power 8;
The default power is 1 (i.e ASM starts one re-balance background process to handle the re-balancing work, called ARB process). The above command dynamically starts 8 ARB processes (ARB0 to ARB7), which can dramatically decrease the time to re-balance. The maximum power limit in 11g R1 is 11 (upto 11 ARB processes can be started).
Conclusion
None of the above maintenance operations (disk drop, disk add) causes a downtime to the end user and therefore can be completed during normal business hours. The re-balance operation can cause slight degradation of performance and hence increase the power limit to let it complete quickly.
Comments
Post a Comment