| | | RssFeeds
 
Get NetworkComputing Connect Search   Search Search
 
NWC Print
July 2008
Beyond Headlines
Buzzcut
Editorial
Cover Story
On the Record
On Location
Show Case
Bulls Eye
Lateral View
Inshort
In-Depth : Wireless
Analytics Brief
Down to Business
Techmall
Book Review
In Passing
Last Mile
Archieve
 

Data De-duping: An Antidote to Bloated Storage

 

Backup devices are filled with redundant data. These tools can help thin out the ranks and regain space

 By Howard Marks

Just a few years ago, disk-to-disk backup seemed almost too good to be true. Powered by inexpensive ATA (and later SATA) disk drives, D2D, whether implemented as virtual tape libraries or as a backup-to-disk option in your favorite backup application, made backups faster, eliminated mechanical failures in tape drives and libraries, and made it easier to deal with the continuous chorus of calls to the helpdesk for individual file restores.

Today, our disk-backup devices are filling up, and there’s not enough space or power in the data center to add another petabyte of backup space, so we’re keeping only two to three days’ worth of backups on disk, when we’d like to keep a month’s worth. Problem is, there’s too much duplicate data in our backup sets. The good news is, vendors—smelling money, of course—are promising that their new data de-duplication products can provide 20-to-1, even 300-to-1 reductions in the amount of data we need to store. Can it be? Let’s take a look.

De-duplication technology lets you store more backup data on a given set of disks. This can extend the period you keep disk backups and reduce your data center power and cooling costs. If you de-dupe data before sending it across the WAN, you can save on bandwidth, making online off-site backups practical at companies that used to rely on tape. The only drawback to data de-duplication is that it can slow down the backup process.

 

 Point of Origin

Duplicate data makes its way into backups across the temporal realm over time, as your backup program backs up the same file from the same directory multiple times, or as the same files are backed up from multiple locations in your network. Most networks have a surprising amount of duplicate data, from the holiday party invitation PDF 56 users saved to their home directories to the 3 GB of Windows files on the system drive of every server.

One solution to file duplication in the temporal realm is incremental backup. Although we’re big fans of this, especially the incremental-forever approach used by Tivoli Storage Manager and others, we don’t consider incremental backups to be data de-duplication any more than we consider RAID disaster recovery. Incremental backups fall in the realm of duplicate avoidance.

The most basic form of data de-duplication is the file-level single-instance store found in CAS (content-addressable storage) devices, such as EMC’s Centera. As each file is stored on a CAS system, the device generates a hash of the file’s contents; should a file with the same hash already exist, rather than saving another copy, the system just creates another pointer to the copy it already has.

Microsoft’s latest version of Windows Storage Ser­ver, the OEM NAS (network-attached storage) version of Windows server, uses a slightly different approach to eliminating duplicate files. Rather than identify duplicates as they’re written, WSS runs a background process, the SIS (single-instance storage) Groveler, which identifies duplicate files using a partial file hash function followed by a full binary comparison, moves the file to a common storage area and replaces the files in their original locations with links to the file in the common store.

Although file-level SIS can save some space, things get really interesting if we eliminate not only duplicate files but also storing data duplicated within the file. Think of Outlook’s lowly .PST file. A typical user may have a 300-MB or larger .PST holding all his e-mail from time immemorial; every day he receives one or more new messages, and since his .PST file is changed that day, your backup program includes it in the incremental backup even though there are only 25 KB of changes in the 300-MB file.

A de-duping product that could identify that 25 KB of new data and store it without the rest of the baggage could save lots of disk space. Extend that concept so that duplicate data, such as the 550-KB attachment that’s in 20 users’ .PST files, can be eliminated, and you could achieve staggering data-reduction factors. One group of such solutions are the data de-duping backup targets pioneered by Data Domain. These devices look to a backup application like a VTL (virtual tape library) or NAS device. They take their data from the backup app and do their de-duplication magic on it transparently.

 Modus Operandi

Vendors have taken three basic approaches to the data de-duplication process. The hash-based approach, used by Data Domain, FalconStor Software in its VTL software and Quantum in its new DXi-series appliances, breaks the data stream from the backup app into blocks and generates a hash for each block, using SHA-1, MD-5 or a similar algorithm. If the hash for a new block matches a hash that’s in the device’s hash index, the data has already been backed up, and the device just updates its tables to say the data exists in the new location too.

The hash-based approach has a built-in scalability issue. To quickly tell if a given block of data has been backed up, it should hold the hash index in memory. As the number of backed-up blocks grows, so does the index. Once the index grows beyond the device’s ability to hold it in memory, performance falls off, as disk searches are much slower than memory searches. As a result, most hash-based systems are self-contained appliances balancing the amount of memory with the amount of disk space for storing data so the hash table never grows too big.

The second approach, content-aware de-duplication, relies on the backup appliance being aware of the data format it’s recording. It can use the file-system metadata embedded in the backup data to identify files; it then does byte-by-byte comparisons with other versions in its data repository to create a delta file of the changes in this version compared with the first version stored. This approach avoids the possibility of a hash collision (see “Don’t Fear Collisions,” below), but requires the use of a supported backup app so the device can extract metadata.

ExaGrid Systems’ InfiniteFiler is an example of a content-aware de-duplication device that uses its knowledge of the common backup apps like CommVault Galaxy and Symantec Backup Exec to identify files from the source system as they’re backed up. After the backup is com­pleted, it identifies files that have been backed up multiple times and generates deltas. Multiple InfiniteFilers can be combined into a grid supporting up to 30 TB of backup data. The de-duping approach ExaGrid uses does a good job of storing the one new message in a 1-GB .PST file but it can’t eliminate duplicate data across multiple different files, like the same attachment in four .PSTs.

Sepaton’s DeltaStor for its VTLs also uses the content-aware approach, but compares the new file with both previous versions from the same location and with versions backed up from other locations so it can eliminate geographical duplicates.

The third approach, used by Diligent Technologies in its ProtecTier­ VTL, divides data into blocks like the hash-based products but uses a proprietary algorithm to determine if given blocks are similar to one another. It then does a byte-by-byte compare of the data in similar blocks to determine if the block has been backed up.

 Hardware or Software

In addition to their de-duping approach, backup targets differ in their physical architectures. Data Domain, ExaGrid and Quantum make monolithic appliances that contain their disk arrays. The Data Domain and Quantum appliances can have NAS or VTL interfaces, while ExaGrid is always a NAS. Diligent and FalconStor sell their products as software, running on an Intel or Opteron server, to create a VTL gateway to external storage.

Although a backup appliance with a VTL interface may seem more sophisticated and could be easier to integrate into an existing tape-based backup environment, using a NAS interface gives your backup application more control over virtual media management. When a backup file reaches the end of its retention period, some backup apps, including Symantec’s NetBackup, can delete the file from their disk repository. When a de-duping NAS appliance sees the deletion, it can re-allocate its free space and hash index. Since you don’t delete tapes, there’s no way to release space on a VTL until the virtual tape is overwritten.

Of course, there is a price to pay for fitting 25 TB of data in a 1-TB bag, and not just in dollars. All the work of slicing your data into chunks and indexing it to remove the duplicates does slow things down more than just a little. A midrange VTL like an Overland REO 9000 can back up data at 300 MBps or better. Diligent has been able to achieve 200-MBps backup rates on its ProtecTier in third-party benchmarks, but that required a quad Opteron ser­ver front-ending an array of more than 100 disk drives.

Other vendors address the problem by de-duping the data as a separate process that runs after the backup. On a system running FalconStor’s VTL software, data is written from the backup app to a compressed but not de-duped virtual tape file. Then a background process chunks the data, removes the duplicates and creates a virtual virtual tape that is an index of which de-duped data blocks were on the original virtual tape. Once the data from a virtual tape is de-duped, the space it occupied is returned to the available space pool. Sepaton’s DeltaStor and ExaGrid also perform their de-duping as a post-backup process.

Although post-processing can boost backup speeds, it has its own costs. A system that does post-process de-duping must have enough disk space to hold a full set of standard backups in addition to its de-duped data. If you’re looking to keep to a weekly full/daily incremental backup schedule, you may need a couple times more disk space on a system that de-dupes in the background to hold those full backups until it can digest them.

Just because the de-duping is running in the background, don’t ignore de-duping performance. If your VTL hasn’t finished digesting the weekend’s backups by the time you start backing up your servers again on Monday night, you may not be happy with the results. Disk space may not be available or the de-duping process may slow down your backups.

 Bandwidth Conservation

Saving disk space on a backup appliance isn’t the only application of subfile de-duping technology. A new generation of backup applications, including Asigra’s Televaulting, EMC’s Avamar Axion and Symantec’s NetBackup PureDisk, use hash-based data de-duplication to reduce the bandwidth needed to send backups across a WAN.

First, like any conventional backup application making an incremental backup, these use the usual methods like archive bits, last-modified dates and the file system change journal to ID the files that have changed since the last backup. They then slice, dice and julienne the file into smaller blocks and calculate hashes for each block.

The hashes are then compared with a local cache of the hashes of blocks that have been backed up at the local site. The hashes that don’t appear in the local cache and file system metadata are then sent to the central backup server, which compares the data with its hash tables. The backup server sends back a list of the hashes that it hasn’t seen before; the server being backed up then sends the data blocks represented by those hashes to the central server for safekeeping.

These backup solutions could reach even higher data-reduction levels than the backup targets by de-duplicating not just the data from the set of servers that are backed up to a single target or even a cluster of targets but across the entire enterprise. If the CEO sends a 100 MB PowerPoint presentation to all 500 branch offices, it will be backed up from the one whose backup schedule runs first. All the others will just send hashes to the home office and be told, “We already got that, thanks.”

This approach is also less susceptible to the scalabil­ity issues that affect hash-based systems. Since each remote ser­ver only caches the hashes for its local data, that hash table shouldn’t outgrow available space, and since the disk I/O system at the central site is much faster than the WAN feeding the backups, even searching a huge hash index on disk is much faster than sending the data.


Although Televaulting, Avamar Axion and NetBackup PureDisk all share a similar architecture and are priced based on the size of the de-duplicated data store, there are some differences. NetBackup PureDisk uses a fixed 128-KB block size, whereas Televaulting and Avamar Axion use variable block sizes, which should result in greater de-duplication. PureDisk can be managed from NetBackup, and Symantec promises greater integration in the future, which we hope means de-duplication integrated into data center backup jobs. Asigra also markets Televaulting for service providers so small businesses that don’t want to set up their own infrastructure can take advantage of de-duplication too.

Backup targets, including FalconStor’s VTL, Quantum’s DXi series and Data Domain’s appliances that can replicate data after it has been de-duped, can see the same kind of bandwidth reductions for branch data center off-site backups and disaster recovery of applications that don’t require real-time replication.

Data de-duplication is here to stay for at least a while. We spoke to several users who report they really do get 20-to-1 and greater data-reduction factors without making major changes to their backup processes. Small organizations can use the new-generation backup programs from Asigra, EMC and Symantec to replace their conventional backup solutions. Midsize organizations can use backup targets in the data center. Large enterprises with very high backup performance needs may have to wait for the next generation.

Print this Page   E-mail this Page
 
 CIO of the Week >>

“The management has identified technology as the change agent that will drive the company ahead”

Nitin Arora, CIO, Writer Corporation

 

More: CIO OF THE WEEK >>


 FEATURED STORIES >>

Americans not Confident About the Safety of Their Personal Data

Only an average of eight percent of Americans say they are very confident in the ability of US retailers, government and banks to protect their personal information

 

BT to Launch £1.5 Billion Programme

BT announced plans to roll out fibre-based, super-fast broadband to as many as 10 million homes by 2012

 

Icahn Would Sell Yahoo's Search Business to Microsoft for $1 Billion

Under Icahn's plan, Microsoft also would pay billions of dollars to become the exclusive search provider on all Yahoo sites for a term of 5 years

CAST YOUR VOTE>>

"Do you think growing consolidation among IT vendors is good for enterprise users?"



View Polls Archive
ADVERTISEMENTS >>
 
Powered By: ssCMS 2.2.0.0