Search Current Space

compared with
Current by Robert Crockett
on Apr 23, 2012 10:14.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (17)

View Page History
h4. Computing Deltas with Check Sums

rsync is an effeicient method for synchronizing a set of files over the network.  rsync can compute the differences in two file trees effeiciently using an ingenious specialized method of [computing deltas|TP:Categories of Backup Software] [using checksums |TP:Computing Deltas - Check Sums]called the rsync algortithm.  Deltas are the parts of files that have changed from one version of a file set to another.  rsync defines deltas as variable length blocks or chunks of data. 

Using check sums to determine the differences between files is nothing new.  What rsync added is the idea of a weak and strong checksum.  For each file rsync breaks the file into small fixed size chunks of data called blocks.  Each block of all files are read into memory where rsync computes the weak check sum ([adler-32|http://en.wikipedia.org/wiki/Adler-32]) of the block.  The adler-32 check sum requires very little CPU time to compute.  The weak check sum of a new block in a file is compared ot to the weak check sum of a block in the older version of a file.  If the weak check sum is different the block has changed and is copied from the source to the destination.

By its design the adler-32 check sum is "weak" meaning there is a fair chance two blocks of data that are Not the same could actually compute to the same check sum.  This would be bad as if this happened rsync would not detect that the block changed and the file would be corrupt.  To overcome this matching weak check sums are verified with a strong check sum ([MD4|http://en.wikipedia.org/wiki/MD4]) which is almost impossible to generate a match on two different blocks.  If the strong check sum is the same then the block is determined to not have changed.  If it is different the block is updated in the destination file set and is determined to have changed.
The whole idea of the rsync algorithm is to copy as little data across a slow network link as possible to update an older version of a set of files on another computer.  rsync does a great job of this as it's guaranteed to only send deltas or different parts of files that have actually changed.  

In doing this rsync aims to reduce CPU time needed to compute check sums.  This is accomplished as rsync avoids computing the more expensive string check sum (MD4) as much as possible.  And remember deltas are just what parts of the file changed.  This is a great idea and works very well.

h4. rsync Backup Resources
rsync backup recipes - [http://www.mikerubel.org/computers/rsync_snapshots/]


incremental backups with rdiff - [http://www.gnu.org/savannah-checkouts/non-gnu/rdiff-backup/]


rsync incremental backup script - [http://www.rustyparts.com/ribs.php]


official rsync examples - [http://samba.anu.edu.au/rsync/examples.html]


rsync backup documentation written by Kevin Korb - [http://www.sanitarium.net/golug/rsync_backups.html]


h4. History of rsync

The real challenge as any server administrator knows is that in a modern server CPU time is usually plentiful while disk I/O is very precious.  rsync must read all of the data on disk or in the file set to compute the checksums.  This means the rsync operation takes longer and longer the more data there is no matter how little data has actually changed.  This is why your server I/O wait time skyrockets along with waiting processes and load average when rsync runs on a large data set.

For more information see: [TP:Why Are Server Backups So Painful?|TP:Why Are Server Backups So Painful?]


*rsync is File Based and Not Block Based*

Another challenge is that rsync works at the [level of files and directories |TP:File Based Backup Technology]like any other program.  This might seem confusing considering the previous paragraphs discussed how rsync divides files into blocks.  What this means is that rsync examines each file and must create an index of the path to every file and directory on the server.  For a server with a fairly large number of files this can be an enormous task all on its own.  It's also the reason people often have problems with rsync's consumption of large amounts of memory on larger servers.  This is different from [block based backup software|TP:Block Based Backup Technology] that bypasses the file system to read low level disk or volume blocks.

*rsync Keeps only One Copy of Data*
On its own rsync keeps only one copy of the data.  A basic feature of backup and archiving is the ability to keep multiple historical archive points as defined by some policy.  There are some very clever scripts that combine  rsync with Linux file system hard links to create an effective system for having multiple rsync archive points.   A great resource for how to script incremental backups with rsync is [http://www.mikerubel.org/computers/rsync_snapshots/]


*rsync Is Not Able to Get a Consistent Copy of Data*

*CDP Uses a near-Continuous Method of Computing Deltas*

As you read above a major drawback to rysync is the time and Disk I/O work needed to compute deltas using check sums.  Continuous Data Protection solves this problem with a proprietary device driver that can [track or see changes to the disk or volume|TP:Computing Deltas - near-Continuous (CDP)] at a very low level while the system is running. 

Imagine if you will that the data on your disk is kind of like the earth's surface.  Most of the earth's surface is not changing.  At the same time forests are being cut down.  Buildings and roads constructed.  Houses being re-painted and changed.  Every time rsync needs to make a map of your files or hard disk it must walk the earth to identify all the little changes.  This takes forever of course.   Alternatively Continuous Data Protection works as if it was a giant database that got instantly updated every time a change is made.  Ask it at any time what the map looks like and it has an instant up to date copy\!

For more information see: [TP:Computing Deltas - near-Continuous (CDP)|TP:Computing Deltas - near-Continuous (CDP)]


*CDP Is Block Based and Works at the Level of the Linux Kernel*

Continuous Data Protection is [block based |TP:Block Based Backup Technology]bypassing the file system.  It is not concerned with the nunmber of files.  It does not index file on the file system.  It only reads deltas.  It knows what the deltas are before the backup operation starts because of the [near-Continuous method for computing deltas|TP:Computing Deltas - near-Continuous (CDP)]. 

*CDP Creates a Point-in-Time Snapshot of the Volume *
*CDP Uses Virtual Full Backups for Unlimited Recovery Point Archiving*

CDP works by only doing a full backup once.  Once and only once.  From then an every backup operation is really a "synchronization" where deltas are copied.  Through proprietary technology called the "Disk Safe" R1Soft is able to store deltas in a highly efficient manner.   Only block level deltas are ever stored.  Furthermore they can be optionally compressed.  Each backup or synchronization appears as a vritually virtually Full Backup even though it's made up of only deltas.  For more details see: [TP:Backup Method - Virtual Full Backup|TP:Backup Method - Virtual Full Backup]\\

h2. Technology Comparison