Data reduction is a hot topic in enterprise data storage today, and it's easy to understand why. Reducing your data not only saves space, but it also reduces the amount of cables, switch ports and power and cooling equipment
Data deduplication technology
Data deduplication technology is a good data reduction technology because it eliminates duplicate files, blocks or blocklets. Deduplication reduces a lot of data in secondary storage because a lot of duplications show up in data backup sets, snapshots and replications. Dedupe ratios typically range from 10:1 to as much as 500:1 depending on the type of data and timeframe, which is why most data backup software, virtual tape libraries (VTLs), and backup target storage appliances now have built-in data deduplication today.
But most data storage administrators are underwhelmed when target storage data deduplication is utilized on primary data storage systems. Systems such as EMC Corp. Celerra, EMC Data Domain, ExaGrid EX series, NetApp FAS and NetApp V-series, do not produce anywhere near the same results when used on primary data versus the results they get on backup, snapshot, or replicated data. They often average approximately an order of magnitude less as reported by actual production users and vendor testing. Results are worse because there are far less deduplicated data and the dedupe algorithms also tend to have problems with compressed data. Data compression changes the makeup of data blocks severely reducing block or blocklet duplications. And a lot of the unstructured primary data, such as Microsoft Office data (.pptx, .docx, .xlsx), JPEGS, MPEGs, PDFs, ZIPs, etc. are compressed.
The other issue with data deduplication is application and user performance. For example, writing and reading the data response times can be affected. Also, write performance can be negatively affected by either inline deduplication or post-processing deduplication. Inline deduplication takes time and adds latency by deduping the data before it's written. Post-processing deduplication requires data storage system cycles while it dedupes, which slows down system performance. Read performance is also negatively affected from the latency in rehydrating the data. The performance degradation may not matter for some applications such as server virtual machine golden system images and ISO files, which also provide the best dedupe results.
Data compression and primary storage
Data compression, (such as StorWize Inc.'s STN appliances), does a little better than deduplication with primary data (as reported by production users and vendor's testing). Data compression removes nulls from data, which gets the same or even a little better primary data reduction than dedupe with little or no impact on performance. However, data compression also has very little data reduction on already compressed data. The biggest impact comes from structured data, such as databases and email, as well as uncompressed files.
Content-aware compression, such as Ocarina Networks' optimizer, is another type of primary data storage reduction technology that surpasses both data deduplication and compression. It's a post-processing technology that decompresses files from their native format, removes duplicate storage objects, then recompresses them in their native formats. Or if they were not already compressed, it can compress the files after removing duplicate storage objects.
The two downsides to this technology is upfront and total cost of ownership and the requirement for a reader/decoder that allows the user or application to read the data after data deduplication and compression. The reader can exist on the user's workstation, server, application, or storage system.
Source-level post-processing data reduction
The third type of primary data storage reduction is source-level post-processing data reduction, which is geared toward SMBs. This type of data reduction is also content-aware but actually reduces application file size by getting rid of the file's "excess baggage" while converting embedded graphics to the most appropriate file format and resolution, which greatly reduces file size while having no affect on its visual content integrity. Source-level data reduction file optimization technology is deployed on file servers and/or desktops and priced well for SMBs. Source-level data reduction supports Microsoft Office files and JPEGS. Also, its effectiveness is similar to other content-aware technologies. Plus, it does not require any software to open or read files that have already been optimized. In other words, any user can read and modify an already optimized Microsoft Office or JPEG file without special software on their workstation.
However, source-level post-processing data reduction is limited to the file types it supports. It doesn't work with other file types or databases. Additionally, it has to be installed on the file servers or the desktops.
Overall, each of these primary data storage reduction technologies can be effective for SMB primary data reduction. Each provides pretty good data reduction, but they all have their downsides as well. Every data storage environment is different. The key thing to remember is to choose the primary data reduction technology that will provide the biggest reduction for the lowest total cost of ownership in your environment.
About the author: Marc Staimer is the founder, senior analyst, and CDS of Dragon Slayer Consulting in Beaverton, OR. The consulting practice of 12+ years has focused in the areas of strategic planning, product development, and market development. With more than 29 years of marketing, sales and business experience in infrastructure, storage, server, software, and virtualization, he's considered one of the industry's leading experts. Marc can be reached at firstname.lastname@example.org.
This was first published in April 2010