Search Current Space

compared with
Current by David Wartell
on Sep 25, 2008 15:27.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (2)

View Page History
| *Disk I/O Impact for Large Files* | {color:#ff0000}{*}Very bad for performance as whole files are backed up for even a small change{*}{color} |

h3. Check Sums

Check sums can be used to compute Deltas.  The main advantage of the check sum method is that granularity beyond complete files is provided.  Check sum methods all have some concept of a block where a block is either defined as fixed length or variable length.  In the fixed length example a file is broken into fixed length byte ranges or blocks for example 4 KB.  A unique signature called check sum is then computed based on the 4 KB.  The most popular check sums for this purpose are [MD4|http://en.wikipedia.org/wiki/MD4] and [MD5|http://en.wikipedia.org/wiki/MD5].  For more information on check sums see: [http://en.wikipedia.org/wiki/Cryptographic_hash_function]

The most popular algorithm for computing deltas in backup applications is the [rsync|http://en.wikipedia.org/wiki/Rsync#Algorithm] algorithm.  _Many_ commerical backup applications use the rsync algorithm to compute deltas for backup purposes.  One of the best known is [Evault|http://www.evault.com/].  Evault actually has a [patent on the process|http://www.google.com/patents?id=4Gl7AAAAEBAJ] of using variable length block deltas for incremental backup purposes.  Some backup application vendors like [Vembu |http://www.vembu.com]actually [brag about bering based on rsync algorithm|http://www.vembu.com/storegrid/rsync-incremental-backup.html].

The are two major challenges with the process of using check sums to compute deltas.
# The entire file system tree must be indexed and walked
# The backup application must read all the contents of all files to compute deltas.  This can take many hours or days on a large data set.

| *Requires Time Consuming Walk of Entire File System Tree to Compute Deltas* \\ | {color:#ff0000}{*}Yes{*}{color} |
| * Delta Granularity* \\ | {color:#66cc66}{*}Block{*}{color} |
| * Accuracy in Identifying Changes* \\ | {color:#66cc66}{*}Perfect unless optimized by checking file attributes{*}{color}\\ |
| *Disk I/O Impact for Small Files* \\ | {color:#ff0000}{*}Must Read All Files to Compute Deltas{*}{color}\\ |
| *Disk I/O Impact for Large Files* | {color:#ff0000}{*}Must Read All Files to Compute Deltas{*}{color} |

h3. near-Continuous Delta Method (CDP)

The most efficient method for computing Deltas is the near-Continuous or CDP method.  R1Soft happens to the only example of a near-Continuous Deltas method for both Windows and Linux platforms.  The near-Continuous method works by using a Disk Volume Device Driver.  The device driver is situated between the file system (e.g. NTFS) and the Disk Volume (e.g. Logical Disk Volume 1).  

By locating a device driver between the file system and raw Disk Volume the application is able to identify changed Disk Blocks in real-time without any performance impact.  Its really quite a simple concept. In Windows this kind of Device Driver is called an Upper Volume Filter Driver.  R1Soft's Linux CDP implementation also uses a device driver.  Linux does not have an official filter driver API though the form and function is very similar to the Windows CDP driver.

_"Why spend hours reading data form the Disk just to compute Deltas when you can watch them happen for free?"_, says David Wartell, R1Soft Founder.

With the near-Continuous method of Delta computation a fixed length block size us used that in practice usually corresponds to the file system block size.  Typically this fixed block size is 4 KB but can vary in different environments or implementations.  As writes to the Disk are _observed_ the block number that was changed is recorded in a specialized in-memory data structure. 

R1Soft Linux Agents versions 1.0 employ a [bitmap |http://en.wikipedia.org/wiki/Bitmap]for this purpose where a region in memory uses 1 [bit|http://en.wikipedia.org/wiki/Bit] to describe the state of a disk block.  Commonly bitmaps are used in image file formats.  With a 4 KB block size there are 26,214,400 Disk Blocks per 100 [GB|http://en.wikipedia.org/wiki/Gigabyte]of Disk Volume size.  That corresponds to 26,214,400 bits or 3,276,800 bytes (3.125 [MB|http://en.wikipedia.org/wiki/Megabyte]) of memory to track _all_ Deltas made to 100 GB of raw Disk capacity.

R1Soft 2.0 Windows Agents and later use a new proprietary data structure for tracking deltas developed by R1Soft.  This new data structure is based on a [Tree|http://en.wikipedia.org/wiki/Tree_data_structure] so that in the average case only 200 or 300 [KB|http://en.wikipedia.org/wiki/Kilobyte] of memory is used to track _all_ Deltas per 100 GB of raw Disk capacity.  R1Soft is making this new more efficient data structure available to its Linux CDP technology with the release of Continuous Data Protection Server 3.0.
!r1-cdp-in-thestack.png|width=32,height=32!
 

h4.


h4. near-Continuous (CDP) Change Tracking Overhead

+Changed Tracked in Memory+
* R1Soft CDP 2.0 Change Log - 3 MB of memory used per 100 GB of raw Disk Volume capacity where CDP is enabled
* R1Soft CDP 3.0 - 200 - 300 KB on average per 100 GB of raw Disk Volume capacity

+Disk I/O overhead Caused by CDP CHange Tracking+
* So small it's Not Measurable
* No Delay to I/O
* Windows and Linux kernel does very Little extra work (few dozen extra Assembly Language Operations)

h4. Does near-Continuous (CDP) Deal with a Server Rebooting? (deltas are tracked in memory)

+Yes a Reboot or Crash Automatically Triggers an Integroty Check to Re-Sync CDP+
* Secure Random Key is Kept in Windows and Linux Kernel memory.
* Copy of this secure key is stored in the data repository with each synchronization.
* At the start of each synchronization operation the key in the data repository is compared to they key in kernel memory.  If the keys differ then there was a reboot or crash of the server and an integroty check is required to re-sync CDP.
+Re-Syncing CDP+
* Requires full block scan of Disk Volume and fails safe to check sum methos of computing Deltas since change tracking data structure is lost on reboot.
* Compares MD5 check sums of each used Disk Block to checksum in last completed recovery point synchronization.
* Only deltas are sent to network and stored.
* After Re-Sync (Integrity Check)  CDP is back in Sync and near-Continuous Delta method is used.
 

h4. Overview of near-Continuous (CDP) Delta Backup Method

| *Requires Time Consuming Walk of Entire File System Tree to Compute Deltas* \\ | {color:#66cc66}{*}No{*}{color}\\ |
| * Delta Granularity* \\ | {color:#66cc66}{*}Block{*}{color} |
| * Accuracy in Identifying Changes* \\ | {color:#66cc66}{*}Perfect{*}{color} \\ |
| *Disk I/O Impact for Small Files* \\ | {color:#66cc66}{*}Only Deltas when CDP in sync{*}{color}\\ |
| *Disk I/O Impact for Large Files* | {color:#66cc66}{*}Only Deltas when CDP in sync{*}{color} |

h3.

\\