Search Current Space

Version 1 by David Wartell
on Sep 25, 2008 15:12.

compared with
Current by David Wartell
on Sep 25, 2008 15:14.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (5)

View Page History
{page-title}\\

h3. Introduction

After reading this document you will be able to size up any backup software and have an expectation for how it will perform in your environment. Most backup technology was created before the Internet was a boom and the average business just turned off their lights at the end of the day.  Consistent backups are simple when no one is using a computer system.  There is little need for high performance backup when you can take a server down for 12 hours to perform the nightly backup.

Does your business shutdown access to all of its computing systems at 5 PM every day?  Strange as it may sound that's the assumption backup applications are built around.
\\

What backup technology do you use? 
* Did you know Legato Networker was first released in 1990 when the largest hard disk available was 270 MB?
 
* Did you know GNU Tar was written in 1986 when a gigantic hard disk was 70 M?
 
* Did you know the author of [rsync|http://samba.anu.edu.au/rsync/] Andrew Tridgell developed rsync for general purpose file transfer and mirroring over slow WAN links?  Tridgell explains in his PhD thesis paper that while the rsync algorithm would be highly useful for optimizing tape backup the rsync application itself is not intended to be used as a backup application.
[Efficient Algorithms for Sorting and Synchronization (PhD thesis)|http://samba.org/~tridge/phd_thesis.pdf]

{panel}
h4. Contents

{children:depth=99}
{panel}

h2.


h2. Backup Methods


h3. Full Backup

A full backup copies all data from disk to backup medium every backup operation.

Full backup uses the most disk I/O, network and storage resources as _every_ backup operation _all_ data is read form disk, sent acropss the network, and stored on the backup medium.

\\ !fullbackup-diskio.png|width=32,height=32!\\ !fullbackup-diskio.png|width=401,height=212!\\

h3. Incremental Backup

An incremental backup works by taking a periodic full backup. Followed by one or more "incremental" backups which are the changes to files made since the last full or incremental backup run.  Typically an incremental backup is scheduled so that one full backup is performed weekly for example on Monday every week and a daily incremental backup is run on the other 4 working days.

h4. Incremental Backup

\\ !incremental-backup.png|width=32,height=32!
 
 

h4. Restore with Incremental Method

Restoring from an incremental backup involves multiple steps.  First the full backup is restored.  Then each incremental backup is restored until the desired restore point is reached.

!incremental-restore.png|width=32,height=32!
 
\\

h4. Incremental Backup Disk I/O and Storage Impact

With incremental backup there is a savings in network I/O and storage consumed over full backup.  Usually once a week a complete full backup is scheduled and this has the same impact as the full backup method.  The other days in the week typically an incremental backup is performed and there is some savings.  Surprisingly often there is no savings usually on server Disk I/O.  The reason is that backup applications must compute some kind of check sums or deltas in order to accurately determine changed files.   In order to perform these deltas _all_ files and all their data must be read from disk consuming massive precious Disk I/O resources that grow linearly with the size of the data set. 

Some applications rely on the last modified date or a Windows specific archive bit.  This allows _some_ savings on server disk I/O as only files that have the archive bit flipped or have a changed date since the last backup are read.  The biggest disadvantage with this approach is that large files always have their complete contents backed up as if the entire file has changed. 
!incremental-backup-diskio.png|width=32,height=32!
 

h3. Differential Backup

Differential backup works by taking a Full Backup and then taking one or more "Differential" backups where each differential backup is the differences between the backup run and the original full backup.  The advantage of differential  backup is that only the full backup and a single differential are required to restore the state of files. 

h4. Differential Backup Method

\\ !differential-backup-method.png|width=32,height=32!

h4. Restore with Differential Method

\\ !differential-restore.png|width=32,height=32!
 

h4. Differential Backup Disk I/O and Storage Impact

The dis-advantage of differential compared to incremental is that it requires more storage space as many changes or differences are duplicated in each differential backup run in a cumulative fashion.  Typically a differential backup is scheduled so that one full backup is performed weekly for example on Monday every week and a daily differential backup is run on the other 4 working days.

!differential-backup-diskio.png|width=32,height=32!
 
 

h3. Virtual Full Backup

The Virtual Full Backup method was pioneered by R1Soft and overcomes limitations of the other known backup methods. 

The chief drawbacks to all of the other backup methods that Virtual Full Backups overcome are:
* The need for a periodic Full Backup to be done e.g. every week.  The weekly Full Backup that other methods require is extremely time consuming and intensive on critical server Disk I/O resources.
* Incremental Backup methods make restoring very complex as each incremental must be replayed to return files to a particular point in time.
* Incremental and Differential Backup methods provide no long term archiving or convenient rotation of old backups.  For example you can not point and click Delete an Incremental backup without breaking other backup sets and loosing data.
* The other backup methods are based on tape and do not properly take advantage of disk based storage.  For example a Deltas or unchanged file is stored in every Incremental backup set multiple times.  This makes sense when you storage media is tape and it needs to be duplicate in each tape.  Many Disk Based backup use Virtual Tape Library (VTL) software to make a file appear like a large tape.   If the storage medium is Disk Based there is no need to store a Delta or unchanged file more than once\!  Virtual Full Backups overcome this limitation.

h4. Initial Replica

The Virtual Full Method involves one initial Full backup called an Initial Replica.  It is called an Initial Replica because it is performed _only one time_ ever as long as the storage medium is not replaced.  For example if the Virtual Full Backups are stored on a NAS device at \\myNAS\backups then there is only one Virtual Full Backup until a new storage location is used.  This is very different form Incremental or Differential backups where the full backup is performed periodically for example weekly.

!virutal-full-backup-replica.png|width=32,height=32!
 

h4.


h4. Disk Safe a Database for Block Deltas

The Virtual Full Backup method requires that a database be used to store block deltas.  R1Soft calls this a Disk Safe ®  While in theory an SQL type database could be used in pratice they do not deal with large amounts of binary block efficiently so custom tailored data stores are used.  The important function of the block deltas database is to map what versions of blocks are allocated and used by particular recovery points.  This facilitates the efficient retrieval of blocks belonging to a recovery point as well as the process of merging.  Merging involves deleting a recovery point and deltas specific to that recovery point not needed by other recovery points that are not being deleted.
\\ !block-delta-database.png|width=32,height=32!
 
\\
\\

h4. Merging (Deleting) Recovery Points

In a Virtual Full Backup each recovery point is comprised of many deltas.  A deltas represents the state of a disk volume block at a particular point in time and each delta can be used by one or more recovery point.  Since the relationship between Deltas and Recovery Points is maintained by the Disk Safe it is possible to delete any unwanted Recovery Point. 
# The process of deleting a recovery point is actually a merge between the unwanted recovery point and closest recovery point still existing that created since the unwanted recovery point was taken.
!recovery-point-merge-step1.png|width=32,height=32!
# The two recovery points are compared.  Recall that not all parts of the Disk are used all the time.  And as files are allocated and created some portions of the Disk which were used become unused available to store data. 

As the two recovery points are compared two types of deltas can be discarded:

A) deltas associated with blocks that _are_ allocated in recovery point N (the unwanted recovery point) and _Not_ allocated in Recovery Point N+1.

B) deltas that are allocated in both recovery points that are not duplicated.  Or deltas that changed from N to N+1 so that the corresponding deltas in N are no longer needed and can be discarded.

!recovery-point-merge-step2.png|width=32,height=32!

h4. Advanced Recovery Point Archiving Rules

Merging recovery points allows for archiving rules.  For example a backup policy or schedule can be defined to support the following example configurations:
* Synchronize Hourly and Keep only the last X  Recovery Points
* Keep the Last X Hourly Recovery Points, Y Daily Recovery Points and Z Monthly Recovery Points

h4.


h4. Synchronization Process

After the Initial Replica the process is very different form the other methods.  Every time a backup job is run a Synchronization is performed.  During the synchronization process a point-in-time snapshot of the Disk Volume is created and Deltas are computed based on the last completed synchronization.  The Deltas are block level and are usually computed at a level below the file system very close to the raw disk structure.  These deltas represent the low level changes to the disk volume made since the last recovery point. 

A Recovery Point appears like a Full backup to the user.  That is only in appearance as each recovery point only contains block level Deltas or changes since the last Synchronization.  The method for computing deltas can vary in implementation.  R1Soft uses a near-Continuous Delta method described here *FIXME LINK TO CDP TECH DOC*.  Deltas could potentially be computed by comparing all disk volume blocks every synchronization however this would be undesirable due to the time needed to do the comparison and heavy impact on disk I/O.

!virtual-full-backup-process.png|width=32,height=32!
 
 

h4. Restore with Virtual Full Backup Method

Restores are easy with Virtual Full Backups.  The Disk Safe makes each Virtual Full Backup appear as a complete Full backup even though it is comprised of Deltas taken across many or even all recovery points.  Each Recovery Point is then usable as a complete Volume or Disk image.  R1Soft has special software capable of reading raw Disk blocks into a usable file system like NTFS on Windows or ext3 and reiserfs on Linux so that files can be browsed and restored seamlessly as if the files are being seen like they appeared on the live server at the point-in-time the synchronization was performed.

Behind the scenes the work of making a recovery point looks like the image below where Deltas are read from the Disk Safe to comprise a complete Full backup.
\\ !virtual-full-backup-restore.png|width=32,height=32!

h4. Virtual Full Backup Disk I/O and Storage Impact

The impact of the Virtual Full Backups on performance is as little as possible since only the deltas between synchronizations are read form the live server Disk(s).  And unlike the other methods Deltas are only ever stored one time in the backup storage medium.
\\ !virtual-full-backupio.png|width=32,height=32!
 
\\

h2. Methods for Computing Deltas in Backup Applications

While there are hundreds of different backup applications all of them use one of several known methods for computing Deltas. 

Deltas are simply defined as the data that has changed since the last backup run.  Defining it any further than that depends on how the backup application computes deltas.  A delta could be a raw disk block, a variables length portion of a file or even a complete file depending on the method.

h3. File Attributes

The simplest form of computing Deltas for backup purposes is by using file attributes.  All files have the date that in theory is the date they were last modified.  In Windows there is a concept of an archive bit that is set and cleared during backup operations that can also be used.  When a file is backed up the archive bit is set true and next backup operation the file is skipped.

The biggest disadvantage with this approach is that large files always have their complete contents backed up as if the entire file has changed.  This method also provides no guarantee that all changes are detected as last modified dates and archive bits can easily be changed by any user or application to appear as if there have been no changes.  In Windows a user can simple set the archive bit by right clicking on a file and changing its properties.  In Linux and Unix a user can set the last modified date to any time they desire using the [touch utility|http://linux.about.com/od/commands/l/blcmdl1_touch.htm].  Windows has several utilities available for overriding the last modified date and an [overview is available here|http://www.online-tech-tips.com/computer-tips/how-to-change-the-last-modified-date-creation-date-and-last-accessed-date-for-files-and-folders/].
| *Requires Time Consuming Walk of Entire File System Tree to Compute Deltas* \\ | {color:#ff0000}{*}Yes{*}{color} |
| * Delta Granularity* \\ | {color:#ff0000}{*}File{*}{color} |
| * Accuracy in Identifying Changes* \\ | {color:#ff0000}{*}Not Guaranteed *{color} \\ |
| * Disk I/O Impact for Small Files* \\ | {color:#66cc00}{*}Small only while files are read or skipped based on attribute{*}{color} \\ |
| *Disk I/O Impact for Large Files* | {color:#ff0000}{*}Very bad for performance as whole files are backed up for even a small change{*}{color} |

h3. Check Sums

Check sums can be used to compute Deltas.  The main advantage of the check sum method is that granularity beyond complete files is provided.  Check sum methods all have some concept of a block where a block is either defined as fixed length or variable length.  In the fixed length example a file is broken into fixed length byte ranges or blocks for example 4 KB.  A unique signature called check sum is then computed based on the 4 KB.  The most popular check sums for this purpose are [MD4|http://en.wikipedia.org/wiki/MD4] and [MD5|http://en.wikipedia.org/wiki/MD5].  For more information on check sums see: [http://en.wikipedia.org/wiki/Cryptographic_hash_function]

The most popular algorithm for computing deltas in backup applications is the [rsync|http://en.wikipedia.org/wiki/Rsync#Algorithm] algorithm.  _Many_ commerical backup applications use the rsync algorithm to compute deltas for backup purposes.  One of the best known is [Evault|http://www.evault.com/].  Evault actually has a [patent on the process|http://www.google.com/patents?id=4Gl7AAAAEBAJ] of using variable length block deltas for incremental backup purposes.  Some backup application vendors like [Vembu |http://www.vembu.com]actually [brag about bering based on rsync algorithm|http://www.vembu.com/storegrid/rsync-incremental-backup.html].

The are two major challenges with the process of using check sums to compute deltas.
# The entire file system tree must be indexed and walked
# The backup application must read all the contents of all files to compute deltas.  This can take many hours or days on a large data set.

| *Requires Time Consuming Walk of Entire File System Tree to Compute Deltas* \\ | {color:#ff0000}{*}Yes{*}{color} |
| * Delta Granularity* \\ | {color:#66cc66}{*}Block{*}{color} |
| * Accuracy in Identifying Changes* \\ | {color:#66cc66}{*}Perfect unless optimized by checking file attributes{*}{color}\\ |
| *Disk I/O Impact for Small Files* \\ | {color:#ff0000}{*}Must Read All Files to Compute Deltas{*}{color}\\ |
| *Disk I/O Impact for Large Files* | {color:#ff0000}{*}Must Read All Files to Compute Deltas{*}{color} |

h3. near-Continuous Delta Method (CDP)

The most efficient method for computing Deltas is the near-Continuous or CDP method.  R1Soft happens to the only example of a near-Continuous Deltas method for both Windows and Linux platforms.  The near-Continuous method works by using a Disk Volume Device Driver.  The device driver is situated between the file system (e.g. NTFS) and the Disk Volume (e.g. Logical Disk Volume 1).  

By locating a device driver between the file system and raw Disk Volume the application is able to identify changed Disk Blocks in real-time without any performance impact.  Its really quite a simple concept. In Windows this kind of Device Driver is called an Upper Volume Filter Driver.  R1Soft's Linux CDP implementation also uses a device driver.  Linux does not have an official filter driver API though the form and function is very similar to the Windows CDP driver.

_"Why spend hours reading data form the Disk just to compute Deltas when you can watch them happen for free?"_, says David Wartell, R1Soft Founder.

With the near-Continuous method of Delta computation a fixed length block size us used that in practice usually corresponds to the file system block size.  Typically this fixed block size is 4 KB but can vary in different environments or implementations.  As writes to the Disk are _observed_ the block number that was changed is recorded in a specialized in-memory data structure. 

R1Soft Linux Agents versions 1.0 employ a [bitmap |http://en.wikipedia.org/wiki/Bitmap]for this purpose where a region in memory uses 1 [bit|http://en.wikipedia.org/wiki/Bit] to describe the state of a disk block.  Commonly bitmaps are used in image file formats.  With a 4 KB block size there are 26,214,400 Disk Blocks per 100 [GB|http://en.wikipedia.org/wiki/Gigabyte]of Disk Volume size.  That corresponds to 26,214,400 bits or 3,276,800 bytes (3.125 [MB|http://en.wikipedia.org/wiki/Megabyte]) of memory to track _all_ Deltas made to 100 GB of raw Disk capacity.

R1Soft 2.0 Windows Agents and later use a new proprietary data structure for tracking deltas developed by R1Soft.  This new data structure is based on a [Tree|http://en.wikipedia.org/wiki/Tree_data_structure] so that in the average case only 200 or 300 [KB|http://en.wikipedia.org/wiki/Kilobyte] of memory is used to track _all_ Deltas per 100 GB of raw Disk capacity.  R1Soft is making this new more efficient data structure available to its Linux CDP technology with the release of Continuous Data Protection Server 3.0.
!r1-cdp-in-thestack.png|width=32,height=32!
 

h4.


h4. near-Continuous (CDP) Change Tracking Overhead

+Changed Tracked in Memory+
* R1Soft CDP 2.0 Change Log - 3 MB of memory used per 100 GB of raw Disk Volume capacity where CDP is enabled
* R1Soft CDP 3.0 - 200 - 300 KB on average per 100 GB of raw Disk Volume capacity

+Disk I/O overhead Caused by CDP CHange Tracking+
* So small it's Not Measurable
* No Delay to I/O
* Windows and Linux kernel does very Little extra work (few dozen extra Assembly Language Operations)

h4. Does near-Continuous (CDP) Deal with a Server Rebooting? (deltas are tracked in memory)

+Yes a Reboot or Crash Automatically Triggers an Integroty Check to Re-Sync CDP+
* Secure Random Key is Kept in Windows and Linux Kernel memory.
* Copy of this secure key is stored in the data repository with each synchronization.
* At the start of each synchronization operation the key in the data repository is compared to they key in kernel memory.  If the keys differ then there was a reboot or crash of the server and an integroty check is required to re-sync CDP.
+Re-Syncing CDP+
* Requires full block scan of Disk Volume and fails safe to check sum methos of computing Deltas since change tracking data structure is lost on reboot.
* Compares MD5 check sums of each used Disk Block to checksum in last completed recovery point synchronization.
* Only deltas are sent to network and stored.
* After Re-Sync (Integrity Check)  CDP is back in Sync and near-Continuous Delta method is used.
 

h4. Overview of near-Continuous (CDP) Delta Backup Method

| *Requires Time Consuming Walk of Entire File System Tree to Compute Deltas* \\ | {color:#66cc66}{*}No{*}{color}\\ |
| * Delta Granularity* \\ | {color:#66cc66}{*}Block{*}{color} |
| * Accuracy in Identifying Changes* \\ | {color:#66cc66}{*}Perfect{*}{color} \\ |
| *Disk I/O Impact for Small Files* \\ | {color:#66cc66}{*}Only Deltas when CDP in sync{*}{color}\\ |
| *Disk I/O Impact for Large Files* | {color:#66cc66}{*}Only Deltas when CDP in sync{*}{color} |

h3.

\\