Why logical deduplication is completely illogical

Recently I’ve been seeing a resurgence of staffers at legacy data protection vendors quoting huge deduplication ratios on LinkedIn posts. Over the years I’ve seen 30:1, 100:1, even 150:1. The below post comes in at a magic 55:1 with “my 55:1 dedupe is better than vendor x 3:1 to 12:1 dedupe”.

What’s not being said is they are comparing logical deduplication to actual deduplication and how this logical ratio is being calculated. Why do this? Because it implies they are better with “numbers” and this chart of deduplication ratios would seem to back this claim:

However, the problem with logical deduplication ratios is that in a forever incremental backup world, where you only ever take 1 full backup and only changes thereafter, it is completely illogical. In fact, it’s bullshit. Why? Let me explain..

In a typical legacy backup environment, you take a weekly full and then daily incremental backups of only the changed data and retain them locally for 30 days. Maybe you’ve already switched to incremental forever. But irrespective of method, logical data reduction is unbelievably calculated as though a full backup is being taken, transferred, and deduped every day, plus the daily change:

(Full Backup x Days Retention) + (Daily Change x Days Retention) = Total Data Protected / Deduplicated Data Stored = Logical Deduplication Ratio

I.E (100TB VM Full Backup x 30) + (5TB Change Per Day x 30) = 3,150TB / 57TB Stored = 55:1

For Virtual Machines (VMs) worst case you are doing a weekly full, but many customers have already switched to incremental forever. So how can you calculate and compare dedupe based on data that isn’t even being sent? The simple answer is you can’t, hence it’s bullshit. The actual deduplication ratio should be calculated and compared using:

Total Full Backups + (Daily Change x Days Retention) = Total Data Protected / Deduplicated Data Stored = Actual Deduplication Ratio

I.E 1 x 100TB VM Full Backup + (5TB Change Per Day x 30) = 250TB / 57TB Stored = 4.39:1

So, in my example is your 55:1 logical dedupe better than my 4.39:1 actual dedupe as suggested in the LinkedIn post? No, both solutions are using 57TB of deduplicated storage. But it is a valid sales tactic to avoid talking about the real problems in infrastructure like cost, complexity, recovery, leveraging automation, cloud, and self-service.

Next time you see a post or hear this kind of claim, call them out! To help you calculate your own actual deduplication rates, and cut out the logical crap, I’ve created a simple excel spreadsheet below:

ActualDedupeCalculator.xlsx

At this point, a vendor might say “ahhh, but Joshua, you seem to be talking about VMs where forever incremental is easy. What about NAS, Oracle and SQL data where I send a full backup every night?”. If your current backup solution or chosen methodology can’t or isn’t doing incremental forever, then it’s time to ditch it and modernize.

There’s simply no need to do error-prone, bandwidth hogging, time-sapping, heavy lifting of constant full backups for SQL, Oracle, NAS or VMs anymore! And irrespective of method, the most important figure is always how much deduplication storage is required not the mythical logical dedupe ratio.

Agree with me? Disagree? Feel free to leave a comment and discuss. Hope you enjoyed this post,

Joshua

Tom

In that 30 days of backups you should have 4 full backups and 26 daily, but its all about the point you are trying to make.

Completely agree, I’ve called out Dell EMC on the rates for Avamar dozens of times. Prior to our switching over I was seeing overall rates of 15 – 18:1 which was driven by our DBA’s taking daily fulls and keeping them for 90 days. We switched to Avamar and the rates magically increased to over 100:1 hitting the high mark at 146:1. If the numbers were realistic I should have seen a pretty substantial drop in storage use, but as expected never did. In all reality, taking our daily changes, my backup storage is eating 1/4 – 1/3 the space the systems backed up take on the storage array, and there is a lot on the array I don’t back up, so the numbers are worse.

I understand why they come out with these numbers, and I can’t completely fault them for it. Typical backups would have restored the weekend full, and the daily up through the current day being restored to. That means if you deleted something Monday, had to restore the system Thursday that data would be back. Avamar doesn’t restore the data deleted Monday, but you see what the system looked like on the day restored from. Which is nice in that respect, I’ve been bitten by the restore of everything including the data that was intentionally deleted previously due to a specific business process.

Technically it is all crap anyway, granted we don’t use them any longer but the old PST files for Outlook, as soon as you open Outlook it would change the date on the PST, it would be a “changed” file, even though you never added or removed data. The next backup would pick it up, but it would dedup to nothing since it was identical to the previous one. Think about all the systems and databases that are just sitting there spinning but no one is using them, don’t deny it, we have all experienced sprawl of abandoned systems.

You really can’t fault vendors for manipulating the numbers to gain that competitive edge, it is hard to unseat the incumbent, so whatever tricks you can use to sway the unfortunately uninformed is fair game. It isn’t specific to IT either, think about cars and published MPG, how many of those were taken from a dynamometer to avoid air resistance before they standardized the testing process?

No1

I.E (100TB VM Full Backup x 30) + (5TB Change Per Day x 30) = 3,150TB / 57TB Stored = 55:1

30 full backups plus 30 incremental backups ? Please check the logic before you post your bullshit out!!

- Joshua Stenhouse
  
  Hey! First of all, thank you for being the first person to comment. I love that you are so passionate about dedupe ratios! This example assumes 30-day retention and it was the only way I could calculate a 55:1 dedupe ratio on a realistic backup storage footprint. Unless we are saying that 98%+ data reductions actually means you can protect 100TB of used VM storage and reduce it by 98% to 2TB storage required, which we all know isn’t true! Otherwise, please share with me your name, company, and how I should be calculating logical dedupe if I’m wrong! I’ll happily correct it….
  
Josh Odgers

100% agree Joshua, Great name by the way.

I’ve written similar posts, including directly calling out one vendor (Simplivity) and their misleading guarantee.

http://www.joshodgers.com/2017/06/07/dare2compare-part-1-hpesimplivitys-101-data-reduction-hyperguarantee-explained/

As well as what I believe should be reported as deduplication savings.

http://www.joshodgers.com/2015/01/03/deduplication-ratios-what-should-be-included-in-the-reported-ratio/

The last point I would make is any vendor worth considering has very similar data efficiencies for the same dataset / configuration so data efficiency is rarely a significant factor when considering a product/storage for either primary or secondary data.

Why logical deduplication is completely illogical — Virtually Sober | Farhan Parkar's Weblog

[…] via Why logical deduplication is completely illogical — Virtually Sober […]

Jason Walker

Ha! This is perfect. Combating an old EMC tactic to scare/tempt customers into trialing any of their dedupe methodologies versus whatever was in place. So glad you posted about it — it is really about how small the data package is post full, subsequent backup speed and backend footprint. Also, about how one can scale a larger set of source based data while maintaining dedupe with as little hassle as possible internally for an extended period of time (7 years, inf, etc.) Dedupe ratios are so 2005 — thanks for calling it out — never get tired of seeing someone exposing marketing fluff, especially this topic!

Tom

In that 30 days of backups you should have 4 full backups and 26 daily, but its all about the point you are trying to make.

Completely agree, I’ve called out Dell EMC on the rates for Avamar dozens of times. Prior to our switching over I was seeing overall rates of 15 – 18:1 which was driven by our DBA’s taking daily fulls and keeping them for 90 days. We switched to Avamar and the rates magically increased to over 100:1 hitting the high mark at 146:1. If the numbers were realistic I should have seen a pretty substantial drop in storage use, but as expected never did. In all reality, taking our daily changes, my backup storage is eating 1/4 – 1/3 the space the systems backed up take on the storage array, and there is a lot on the array I don’t back up, so the numbers are worse.

I understand why they come out with these numbers, and I can’t completely fault them for it. Typical backups would have restored the weekend full, and the daily up through the current day being restored to. That means if you deleted something Monday, had to restore the system Thursday that data would be back. Avamar doesn’t restore the data deleted Monday, but you see what the system looked like on the day restored from. Which is nice in that respect, I’ve been bitten by the restore of everything including the data that was intentionally deleted previously due to a specific business process.

Technically it is all crap anyway, granted we don’t use them any longer but the old PST files for Outlook, as soon as you open Outlook it would change the date on the PST, it would be a “changed” file, even though you never added or removed data. The next backup would pick it up, but it would dedup to nothing since it was identical to the previous one. Think about all the systems and databases that are just sitting there spinning but no one is using them, don’t deny it, we have all experienced sprawl of abandoned systems.

You really can’t fault vendors for manipulating the numbers to gain that competitive edge, it is hard to unseat the incumbent, so whatever tricks you can use to sway the unfortunately uninformed is fair game. It isn’t specific to IT either, think about cars and published MPG, how many of those were taken from a dynamometer to avoid air resistance before they standardized the testing process?

Top 3 questions to identify if an Integrated Data Protection Appliance really is an appliance, or integrated – Virtually Sober

[…] Why logical deduplication is completely illogical […]

Frank

Josh, this is a great summary. Very clear and easy to digest. I knew about the mythical “fulls” but this wraps it up nicely. Going to add reading stuff on your site to the weekly routine.

Shai

Hi Joshua,

I love your approach although marketing teams of large vendors might like it less…
I think that the discussion of Dedup is going the wrong way. Playing with Dedup “ratios” has always been a marketing issue with much less ability to prove in reality. The Dedup axiom suggests that one will always have less Dedup ratio (and thus more Dedup hardware) than they planned to.

My point is completely different. The world is changing and the data behaviour is changing as well. Online companies use databases that tend to show a daily change rate of >10%, archives (like Exchange) grow incredibly big an their internal housekeeping make them Dedup unfriendly, and, above all – encrypted data for privacy reasons. All of those trends and data properties make Dedup ratio much lower, for those cased, than for general IT purposes.
Now – if you wish to have any kind of backup to cloud, and trying to put your Dedup engine in the cloud – you are prone to pay tons of money as those engines will cost around $10k/year just for the cloud VM. Such an engine normally takes care of

Daniel Banche

Great post. I’m bookmarking this, it will save me from numerous debates. Thanks!

Why logical deduplication is completely illogical

Like this:

Related

Leave a Reply to TomCancel reply

Why logical deduplication is completely illogical

Share this:

Like this:

Related

Leave a Reply to TomCancel reply

Discover more from Virtually Sober