The differences between data deduplication technology for data backups and primary data storage

Date: Aug 04, 2010

Data deduplication technology reduces data and saves space in disk-based data backups. Recently, data deduplication has been used in primary data storage systems as well. How is data deduplication for disk-based backup different than data deduplication for primary data storage? What can we expect in the future?

discusses the differences between data deduplication for disk-based data backups and primary data storage with Rich Castagna, editorial director of TechTarget's Storage Media Group.

Castagna: Data deduplication technology has been a game-changer for disk-based backup. Can it do the same for primary data storage?

Preston: The reason why data deduplication technology is so appropriate for backups is because backups have so much duplicate data in a typical set of backups. But with primary data there is much less duplicate data. So we're not going to get the same data reduction ratios. However, if done correctly, data deduplication can save a lot of power and cooling and disk drives in primary data storage.

Castagna: How much of a hit does dedupe put on the performance of a data storage system?

Preston: In the backup world, there was a bit more acceptance to some sort of hit to the performance level because of the competition dedupe was seeing with backups. In fact, there was a little too much tolerance in the backup world for deduplication that caused a restore speed hit. With primary data storage, a very small hit is acceptable. But if more than that occurs, an end user is going to notice and complain, which changes the user experience. Essentially, there are some technologies that have either no performance impact or negligible performance impact on primary data storage, and that's something you can expect from a primary storage vendor.

Castagna: NetApp builds data deduplication into its operating system. But other vendors like Ocarina Networks and Storwize Inc. [recently acquired by IBM Corp.] have dedupe appliances. Are there advantages or disadvantages to either approach?

Preston: I think in the case of NetApp then the answer is yes, having deduplication built into its operating system is an advantage. This is because they already have checksums of all the blocks that they store and they're already good at moving pointers around. Also, they can keep snapshots for up to weeks at a time and it won't impact their performance level, and that's because of Write Anywhere File Layout (WAFL) and how it works. All they have to do is compare the checksums to see if they match; if they do, then they replace them with pointers. In the end what they get is negligible performance impact and pretty decent data deduplication. However, this method probably won't get as high a dedupe ratio with disk-based backup as other products specifically designed for certain types of data.

But with primary data storage, NetApp should be able to dedupe very well, relatively efficiently and with minimal performance impact.

The other two products, which are appliances, are very different. The Storwize product is just compression in and out. Storwize argues that they're kind of like a compression chip for a tape drive. Like NetApp, Storwize also has a minimal performance impact, which is achieved because they're doing compression inline, meaning that they don't have to read and write as much data.

Ocarina is designed for specific types of data -- mainly the types of data that no one is going after such as imaging and certain types of document files. In other words, they have a very content-aware deduplication approach for certain types of files that other dedupe vendors have basically ignored. This type of appliance will work well for people who have a lot of those files, but it won't do anything for something like VMware.

Castagna: Does deduping live data have an effect on using data deduplication during backups?

Preston: If you are backing up that data the same way you are backing up data in the typical fashion, meaning you're copying it in an in-mass format such as incremental backups, then the answer is no. In the case of NetApp and vendors that are following them, where you use products from NetApp like SnapMirror and SnapVault, the answer would be yes. You would get data deduplication benefits both on the primary copy and on the secondary copy, but when you back it up using traditional methods it's re-duped and passed to your backup apps, and it gets no benefit there.

Castagna: Some people think that data deduplication technology will become a basic part of primary data storage systems. Do you think that will happen, and if so, how soon?

Preston: I think the people who say that have never designed a data deduplication system. Doing dedupe is hard and doing dedupe right is much harder. If you look at all the major vendors, not a single one of them wrote dedupe from scratch for backups. Only one vendor, NetApp, has done it for primary data storage. Most vendors went out and licensed or acquired one or more dedupe products. I'm sure there are people working on making data deduplication technology a part of primary data storage, but I haven't talked to anyone who said it's coming out anytime soon. Most vendors are just now getting their dedupe story for backup down, not for primary storage. It will ultimately happen. There will be data reduction techniques in primary storage, but the degree to which it becomes ubiquitous is anyone's guess. And how long? Definitely multiple years.

