ML
    • Recent
    • Categories
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    How to Market RAID 6 When Customers Need Safety

    IT Discussion
    raid risk
    5
    21
    4.5k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • DustinB3403D
      DustinB3403 @scottalanmiller
      last edited by

      @scottalanmiller said in How to Market RAID 6 When Customers Need Safety:

      Semi-tongue in cheek

      Sure that's semi? Seems pretty honest to me.

      1 Reply Last reply Reply Quote 1
      • scottalanmillerS
        scottalanmiller @Dashrender
        last edited by

        @Dashrender said in How to Market RAID 6 When Customers Need Safety:

        and second - write hole in ZFS?

        ZFS uses variable stripe widths to overcome the write hole. Why no one else has implemented this, I am not sure (backward compatibility concerns, perhaps?) It's been a decade since Sun solved the write hole problem but still today, no one has it solved except for the ZFS implementation of parity RAID. Now, most people avoid it by having batteries, flash cache or insane UPS systems, so it does not come up that often. But the risk is real.

        DashrenderD 1 Reply Last reply Reply Quote 0
        • DashrenderD
          Dashrender @scottalanmiller
          last edited by

          @scottalanmiller said in How to Market RAID 6 When Customers Need Safety:

          @Dashrender said in How to Market RAID 6 When Customers Need Safety:

          and second - write hole in ZFS?

          ZFS uses variable stripe widths to overcome the write hole. Why no one else has implemented this, I am not sure (backward compatibility concerns, perhaps?) It's been a decade since Sun solved the write hole problem but still today, no one has it solved except for the ZFS implementation of parity RAID. Now, most people avoid it by having batteries, flash cache or insane UPS systems, so it does not come up that often. But the risk is real.

          But what is a write hole?

          coliverC scottalanmillerS 2 Replies Last reply Reply Quote 0
          • coliverC
            coliver @Dashrender
            last edited by coliver

            @Dashrender said in How to Market RAID 6 When Customers Need Safety:

            @scottalanmiller said in How to Market RAID 6 When Customers Need Safety:

            @Dashrender said in How to Market RAID 6 When Customers Need Safety:

            and second - write hole in ZFS?

            ZFS uses variable stripe widths to overcome the write hole. Why no one else has implemented this, I am not sure (backward compatibility concerns, perhaps?) It's been a decade since Sun solved the write hole problem but still today, no one has it solved except for the ZFS implementation of parity RAID. Now, most people avoid it by having batteries, flash cache or insane UPS systems, so it does not come up that often. But the risk is real.

            But what is a write hole?

            It's when two disks, in a RAID6, don't match the other members of the array. RAID1 and RAID5 have this issue as well but with a single drive.

            DashrenderD 1 Reply Last reply Reply Quote 0
            • DashrenderD
              Dashrender @coliver
              last edited by

              @coliver said in How to Market RAID 6 When Customers Need Safety:

              @Dashrender said in How to Market RAID 6 When Customers Need Safety:

              @scottalanmiller said in How to Market RAID 6 When Customers Need Safety:

              @Dashrender said in How to Market RAID 6 When Customers Need Safety:

              and second - write hole in ZFS?

              ZFS uses variable stripe widths to overcome the write hole. Why no one else has implemented this, I am not sure (backward compatibility concerns, perhaps?) It's been a decade since Sun solved the write hole problem but still today, no one has it solved except for the ZFS implementation of parity RAID. Now, most people avoid it by having batteries, flash cache or insane UPS systems, so it does not come up that often. But the risk is real.

              But what is a write hole?

              It's when two disks, in a RAID6, don't match the other members of the array. RAID1 and RAID5 have this issue as well but with a single drive.

              If that happens in RAID 1/10 as well, then how is it solved?

              coliverC 1 Reply Last reply Reply Quote 0
              • coliverC
                coliver @Dashrender
                last edited by

                @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                @coliver said in How to Market RAID 6 When Customers Need Safety:

                @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                @scottalanmiller said in How to Market RAID 6 When Customers Need Safety:

                @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                and second - write hole in ZFS?

                ZFS uses variable stripe widths to overcome the write hole. Why no one else has implemented this, I am not sure (backward compatibility concerns, perhaps?) It's been a decade since Sun solved the write hole problem but still today, no one has it solved except for the ZFS implementation of parity RAID. Now, most people avoid it by having batteries, flash cache or insane UPS systems, so it does not come up that often. But the risk is real.

                But what is a write hole?

                It's when two disks, in a RAID6, don't match the other members of the array. RAID1 and RAID5 have this issue as well but with a single drive.

                If that happens in RAID 1/10 as well, then how is it solved?

                From my understanding it doesn't happen on RAID1 often. Only when there is a drive/array misconfiguration. However it is common on RAID5/6. I'm not sure the exact mechanism but it has something to do with built in drive caching.

                scottalanmillerS 1 Reply Last reply Reply Quote 0
                • scottalanmillerS
                  scottalanmiller @Dashrender
                  last edited by

                  @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                  Can you go into more details on the two failures - RAID 10 loosing second drive in same pair vs potential loss of third drive in RAID 6, what ever the comparison is that makes RAID 10 still safer in the two drive loss scenario...

                  This one gets complex because there are so many factors involved. I'll start with a list:

                  • RAID 10 is more likely, at the same capacity, to experience the first drive failure due to the fact that it has more disks than RAID 6 (except in the four disk scenario, then they are even.) So RAID 10 starts with more "recovery events" than RAID 6 does. Even the pro-RAID 6 people always skip this which is surprising.
                  • Once a single drive has been lost, now we have a degraded array. During this time, there is lost performance but negligible impact to the array in terms of risk. But there is exposure until the failed drive is replaced.
                  • Once a drive is replaced, RAID 10 rapidly mirrors back to that drive and returns the array to healthy. The time frame here is extremely small and the operation is simple. The reliability of this process is so close to 100% that it cannot be measured on any real world system (80,000 array years sampling, zero failure, no way to gain statistical knowledge.) RAID 6, on the other hand, begins a very complex rebuild operation that takes more time. How much more you have to determine, but always longer than RAID 10. In the real world, it is typical for the rebuild to take days or weeks instead of hours. The difference can be staggering. This provides a many times larger window for a second drive to fail. That alone only raises the risk by a few hundred percent in most cases. Many times the risk of near zero is still pretty low. What is significant is that parity RAID arrays have been shown and are well known to induce additional drive failures during the rebuild operation (it is believed because of the increased wear and tear from a long running, highly intensive operation.) So the chances of secondary drive failure skyrocket from "essentially impossible" to "not at all unlikely."
                  • If a second drive fails on RAID 10, there is only impact if the second drive is a member of the same mirrored pair. This takes the already incredibly low chance of secondary drive failure and reduces it dramatically. (Mirrored Pair testing... 160,000 array years, no dual drive failures!!) So, for all intents and purposes, two disk failure on RAID 1 does not exist when there is no external damaging actor and the failed drive is replaced promptly.
                  • If multiple drives fail on RAID 10 that are not shared on the same mirrored pair, each rebuilds concurrently and independently and do not contribute to a general increase in array level risk as the repair window remains tiny, each heals independently and one failing does not trigger another.
                  • If a second drive fails in RAID 6 all of the risks that led to the second drive failure increase again. Now the burden on the remaining disks takes another jump up beyond the original burden of a single disk failure. And the window in which the array is rebuilding increases, dramatically, typically to about double. So the array then has an even longer repair window with an ever increasing chance of yet another disk failing. If any additional disk fails before one of the failed disks has been rebuilt, the array is lost completely. If any additional disk fails after one, but not both, of the failed disks have been rebuilt, the lengthy and risky process of rebuilding begins again. In the real world, on a moderately large array, a triple disk failure where one disk had been repaired before the third failed, we could literally see rebuild times creeping over the three month mark!
                  • The bigger risk than a third drive failing is hitting a URE during the lengthy dual disk failure rebuild. The standard parity RAID implementations will treat this no differently than a failed disk as the stripe is bad and will drop the entire array resulting in total loss. Even low URE enterprise drives become extremely susceptible to this in a large RAID 6 array rebuild process and if we end up in the triple failure mode scenario, the URE risks nearly double again.
                  • The largest risk, and the one that is totally ignored, with RAID 6 is that in most cases performance becomes unacceptably slow or even disconnects entirely during a rebuild operation. There are many factors involved here so we cannot so this across all cases, but very few people measure their environment to see what the impact would be and having a RAID array offline or nearly offline for days, weeks or, in the triple failure example, as much as an entire season likely means that giving up on the array immediately and restoring from backup would have been a few hour outage with minimal data loss rather than a scenario where the system is offline for 90 days and in the 89.9th day hits and URE and all of that restore time is lost.
                  1 Reply Last reply Reply Quote 1
                  • scottalanmillerS
                    scottalanmiller @Dashrender
                    last edited by

                    @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                    @scottalanmiller said in How to Market RAID 6 When Customers Need Safety:

                    @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                    and second - write hole in ZFS?

                    ZFS uses variable stripe widths to overcome the write hole. Why no one else has implemented this, I am not sure (backward compatibility concerns, perhaps?) It's been a decade since Sun solved the write hole problem but still today, no one has it solved except for the ZFS implementation of parity RAID. Now, most people avoid it by having batteries, flash cache or insane UPS systems, so it does not come up that often. But the risk is real.

                    But what is a write hole?

                    From Sun's 2005 paper addressing it: "RAID-5 (and other data/parity schemes such as RAID-4, RAID-6, even-odd, and Row Diagonal Parity) never quite delivered on the RAID promise -- and can't -- due to a fatal flaw known as the RAID-5 write hole. Whenever you update the data in a RAID stripe you must also update the parity, so that all disks XOR to zero -- it's that equation that allows you to reconstruct data when a disk fails. The problem is that there's no way to update two or more disks atomically, so RAID stripes can become damaged during a crash or power outage."

                    RAID Z and the Write Hole

                    1 Reply Last reply Reply Quote 1
                    • scottalanmillerS
                      scottalanmiller @coliver
                      last edited by

                      @coliver said in How to Market RAID 6 When Customers Need Safety:

                      @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                      @coliver said in How to Market RAID 6 When Customers Need Safety:

                      @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                      @scottalanmiller said in How to Market RAID 6 When Customers Need Safety:

                      @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                      and second - write hole in ZFS?

                      ZFS uses variable stripe widths to overcome the write hole. Why no one else has implemented this, I am not sure (backward compatibility concerns, perhaps?) It's been a decade since Sun solved the write hole problem but still today, no one has it solved except for the ZFS implementation of parity RAID. Now, most people avoid it by having batteries, flash cache or insane UPS systems, so it does not come up that often. But the risk is real.

                      But what is a write hole?

                      It's when two disks, in a RAID6, don't match the other members of the array. RAID1 and RAID5 have this issue as well but with a single drive.

                      If that happens in RAID 1/10 as well, then how is it solved?

                      From my understanding it doesn't happen on RAID1 often. Only when there is a drive/array misconfiguration. However it is common on RAID5/6. I'm not sure the exact mechanism but it has something to do with built in drive caching.

                      It's full name is the RAID 5 Write Hole. It does not exist in mirrored RAID, it is a parity RAID only risk.

                      coliverC 1 Reply Last reply Reply Quote 0
                      • coliverC
                        coliver @scottalanmiller
                        last edited by

                        @scottalanmiller said in How to Market RAID 6 When Customers Need Safety:

                        @coliver said in How to Market RAID 6 When Customers Need Safety:

                        @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                        @coliver said in How to Market RAID 6 When Customers Need Safety:

                        @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                        @scottalanmiller said in How to Market RAID 6 When Customers Need Safety:

                        @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                        and second - write hole in ZFS?

                        ZFS uses variable stripe widths to overcome the write hole. Why no one else has implemented this, I am not sure (backward compatibility concerns, perhaps?) It's been a decade since Sun solved the write hole problem but still today, no one has it solved except for the ZFS implementation of parity RAID. Now, most people avoid it by having batteries, flash cache or insane UPS systems, so it does not come up that often. But the risk is real.

                        But what is a write hole?

                        It's when two disks, in a RAID6, don't match the other members of the array. RAID1 and RAID5 have this issue as well but with a single drive.

                        If that happens in RAID 1/10 as well, then how is it solved?

                        From my understanding it doesn't happen on RAID1 often. Only when there is a drive/array misconfiguration. However it is common on RAID5/6. I'm not sure the exact mechanism but it has something to do with built in drive caching.

                        It's full name is the RAID 5 Write Hole. It does not exist in mirrored RAID, it is a parity RAID only risk.

                        That's good to know. So it has to do with the parity bit in parity RAID devices. I'll have to look at it more.

                        scottalanmillerS 1 Reply Last reply Reply Quote 0
                        • DustinB3403D
                          DustinB3403
                          last edited by

                          So the RAID 5 Write Hole is active on all parity arrays?

                          Which means any parity array should be avoided at all cost... doesn't it?

                          scottalanmillerS 1 Reply Last reply Reply Quote 0
                          • scottalanmillerS
                            scottalanmiller @coliver
                            last edited by

                            @coliver said in How to Market RAID 6 When Customers Need Safety:

                            @scottalanmiller said in How to Market RAID 6 When Customers Need Safety:

                            @coliver said in How to Market RAID 6 When Customers Need Safety:

                            @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                            @coliver said in How to Market RAID 6 When Customers Need Safety:

                            @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                            @scottalanmiller said in How to Market RAID 6 When Customers Need Safety:

                            @Dashrender said in How to Market RAID 6 When Customers Need Safety:

                            and second - write hole in ZFS?

                            ZFS uses variable stripe widths to overcome the write hole. Why no one else has implemented this, I am not sure (backward compatibility concerns, perhaps?) It's been a decade since Sun solved the write hole problem but still today, no one has it solved except for the ZFS implementation of parity RAID. Now, most people avoid it by having batteries, flash cache or insane UPS systems, so it does not come up that often. But the risk is real.

                            But what is a write hole?

                            It's when two disks, in a RAID6, don't match the other members of the array. RAID1 and RAID5 have this issue as well but with a single drive.

                            If that happens in RAID 1/10 as well, then how is it solved?

                            From my understanding it doesn't happen on RAID1 often. Only when there is a drive/array misconfiguration. However it is common on RAID5/6. I'm not sure the exact mechanism but it has something to do with built in drive caching.

                            It's full name is the RAID 5 Write Hole. It does not exist in mirrored RAID, it is a parity RAID only risk.

                            That's good to know. So it has to do with the parity bit in parity RAID devices. I'll have to look at it more.

                            Yeah, has to do with the way that it writes.

                            1 Reply Last reply Reply Quote 0
                            • scottalanmillerS
                              scottalanmiller @DustinB3403
                              last edited by

                              @DustinB3403 said in How to Market RAID 6 When Customers Need Safety:

                              So the RAID 5 Write Hole is active on all parity arrays?

                              Which means any parity array should be avoided at all cost... doesn't it?

                              No, because, like losing multiple disks in RAID 10, it's just not a real world risk. I've been involved in an awful lot of array failures over the years and never once was it because of the write hole. Write holes are rare even when the circumstances allow it to happen - and almost no enterprise system does that. Any enterprise class hardware RAID protects against the write hole, that's why we have battery backed cache and nvram caches on them. ZFS protects against this the Solaris, FreeBSD and OpenIndiana worlds.

                              The risk really only exists with Linux MD RAID, non-ZFS RAID on BSD, Windows Software RAID, FakeRAID controllers and other situations. The big enterprise software RAID vendors have stated that they assume that you will maintain power to your system and then the write hole cannot happen. If you want to use software RAID, and parity and not use ZFS then you need to either accept the write hole risk or you need to ensure continuous power to the box, the same as the battery cache does for a hardware RAID cache.

                              1 Reply Last reply Reply Quote 1
                              • bbigfordB
                                bbigford
                                last edited by bbigford

                                I once asked a vendor who were pitching an appliance that supported RAID0+1 and RAID1+0, "what would you recommend between the two, to a potential customer?" They said it didn't matter as they are both the same thing.

                                We didn't go with that vendor.

                                scottalanmillerS DustinB3403D 2 Replies Last reply Reply Quote 1
                                • scottalanmillerS
                                  scottalanmiller @bbigford
                                  last edited by

                                  @BBigford said in How to Market RAID 6 When Customers Need Safety:

                                  I once asked a vendor who were pitching an appliance that supported RAID0+1 and RAID1+0, "what would you recommend between the two, to a potential customer?" They said it didn't matter as they are both the same thing.

                                  We didn't go with that vendor.

                                  Amazing. Now that's just stupid. Losing a sale over not knowing your own product is ridiculous.

                                  1 Reply Last reply Reply Quote 1
                                  • DustinB3403D
                                    DustinB3403 @bbigford
                                    last edited by

                                    @BBigford said in How to Market RAID 6 When Customers Need Safety:

                                    I once asked a vendor who were pitching an appliance that supported RAID0+1 and RAID1+0, "what would you recommend between the two, to a potential customer?" They said it didn't matter as they are both the same thing.

                                    We didn't go with that vendor.

                                    RAID10 vs RAID0+1

                                    scottalanmillerS 1 Reply Last reply Reply Quote 1
                                    • scottalanmillerS
                                      scottalanmiller @DustinB3403
                                      last edited by

                                      @DustinB3403 said in How to Market RAID 6 When Customers Need Safety:

                                      @BBigford said in How to Market RAID 6 When Customers Need Safety:

                                      I once asked a vendor who were pitching an appliance that supported RAID0+1 and RAID1+0, "what would you recommend between the two, to a potential customer?" They said it didn't matter as they are both the same thing.

                                      We didn't go with that vendor.

                                      RAID10 vs RAID0+1

                                      Or, you know...

                                      http://www.smbitjournal.com/2014/07/comparing-raid-10-and-raid-01/

                                      DustinB3403D 1 Reply Last reply Reply Quote 1
                                      • DustinB3403D
                                        DustinB3403 @scottalanmiller
                                        last edited by

                                        @scottalanmiller said in How to Market RAID 6 When Customers Need Safety:

                                        @DustinB3403 said in How to Market RAID 6 When Customers Need Safety:

                                        @BBigford said in How to Market RAID 6 When Customers Need Safety:

                                        I once asked a vendor who were pitching an appliance that supported RAID0+1 and RAID1+0, "what would you recommend between the two, to a potential customer?" They said it didn't matter as they are both the same thing.

                                        We didn't go with that vendor.

                                        RAID10 vs RAID0+1

                                        Or, you know...

                                        http://www.smbitjournal.com/2014/07/comparing-raid-10-and-raid-01/

                                        TL:DR pictures are prettier 😛

                                        1 Reply Last reply Reply Quote 2
                                        • 1
                                        • 2
                                        • 1 / 2
                                        • First post
                                          Last post