A few years ago, I was managing a database system that had been upgraded from SQL Server 7.0 to SQL Server 2000.
The upgrade went well and all issues were resolved except for one – we were experiencing very bad performance when attempting to read data from the primary replica of a mirrored server.
This occurred even though the secondary replica had current copies of all the data and each transaction on the primary replica log was being queued up waiting for a disk write to occur.
I asked our storage vendor if there might be an issue with blocking caused by some other process on our UNIX cluster as they suggested this as a possibility since they did not have visibility into any activity within SQL Server itself.
Details About SOL server recovery:
In my troubleshooting work dealing with these sorts of problems, I often like to begin using the Performance Monitor (Perfmon) tool.
One of the reports I like to run right away is the “SQL: Buffer Manager – Page Life Expectancy (seconds)” report since it gives me an idea of what sort of overall performance impact might be occurring due to memory pressure and/or disk queuing.
In this particular case, page life expectancy was constantly well below 20 seconds as our system was being pounded with full table scans from client applications for which we had not initially planned as part of normal business activity.
This level of IO-intensive processing was causing a severe bottleneck on disk I/O that eventually caused a major storage subsystem failure.
The drive array responded poorly to these sorts of loads and, as a result, the entire storage system eventually failed.
In this case, we were fortunate that a complete set of system backups had been made on daily basis to an offsite location.
In addition, all of our documentation was stored in a redundant location as well.
Unfortunately, the SQL Server 2000 license key information was stored on the same server as the databases, and as such it was unavailable after the failure as well as all data that was within these databases at the time they were lost due to system/storage/SAN failure.
Fortunately for us – and you – I am going to explain how you can protect yourself from similar sorts of situations if your database servers are configured with shared disks and you use transactions and mirrored databases so your users experience an acceptable level of service even in the event of a shared storage component (e.g., disk, RAID controller, SAN) failure.
The reason why this solution works is that it is leveraging existing functionality within SQL Server to implement production-level recovery procedures for protecting your data in the event of a system/storage/SAN failure without requiring you to purchase additional software or hardware components just for the purpose of recovering from such failures.
I am not suggesting that these steps are 100% bulletproof, but if configured properly they will protect you from some very common types of failures that have been responsible for major downtime over the years.
When using mirroring with transaction logs on multiple disk volumes, there are four primary areas where things can go wrong:
1. Loss of power/UPS to SAN/disks
2. System failure (e.g., corrupted OS, hardware issues)
3. Shared disk failure (RAID array, etc.)
4. SAN controller failure
Most large database servers are configured with redundant power supplies and battery backup units to protect against loss of power which should provide sufficient time for a controlled shutdown that can be initiated by the system administrator via an automated process or manually if need be so this is usually not much of an issue unless you happen to have your server out in the middle of nowhere without utilities but even then there are plans available for these sorts of contingencies as well.
But let’s say you do not have any kind of redundancy in place to power down the system, for whatever reason.
In this case, you should still be able to initiate a controlled shutdown by using either the Windows “Shutdown” command or pressing and holding the power button until all activity on the server stops so it is not likely that data loss will occur in this scenario until there is another outage which allows the disks to spin up again without any kind of protection/containment.
The two main areas where I see people fail to account for system failures and storage component failures leading to data loss are:
1. Lack of redundancy in power supplies and/or UPS units, and
2. Not protecting your SQL Servers with RAID controller disk arrays as part of your standard server build process.