When the agent starts backing up, the load goes to 150+ and the servers hang. It can happen on recent CentOS, especially if the machine is already under a moderate I/O load.
The bug is in the kernel, not the CDP driver. These high load average and I/O wait issues seen on some environments is a kernel bug, and can be reproduced without using R1Soft CDP. The issue appears to be a complex combination of I/O scheduling and some storage controllers (device drivers). When you get enough disk I/O going, the issues arise. Adding the extra disk I/O on the system to maintain the CDP snapshot can provoke the system into exposing the kernel bug.
Consider trying the 2.6.32 kernel. So far, in the few cases where customers with the issue have updated to 2.6.32, the issue has gone away.
Perhaps, the latest changes to the CFQ scheduler will resolve the issue.
Latest kernel which is mainly RHEL's 5.5 (beta) should contain the new CFQ changes:
Note: R1Soft is working to re-design the CDP 3 Linux snapshot driver to manage its I/O differently, so it will be less prone to triggering this kernel scheduling bug.
|Page: Linux Agent — High Load Average, IO Wait Issues (Archived Knowledge Base 2.0) Labels: kernel, centos|
|Page: High Load within CentOS Kernel (Archived Knowledge Base 2.0) Labels: centos, kernel|
|Page: Linux Agent — CentOS 5.3 Upgrade Kills Agent (Archived Knowledge Base 2.0) Labels: kernel, buagent, centos|
|Page: Linux Agent — CentOS and RHE Kernel Panics (Archived Knowledge Base 2.0) Labels: kernel, centos|
|Page: Linux Agent — Installing on CentOS with Xen Enabled (Archived Knowledge Base 2.0) Labels: kernel, xen, centos, install|