ML
    • Recent
    • Categories
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    IT Survey: Preemptive Drive Replacement in RAID Arrays

    Scheduled Pinned Locked Moved IT Discussion
    storageraidwinchester drivesurvey
    44 Posts 11 Posters 12.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • MattSpellerM
      MattSpeller @Dashrender
      last edited by

      @Dashrender exactly, there are much better ways to set that kinda thing up - I think we're still looking for a scenario where dude-buddy-guy from SW forums would be right. He may just be 100% wrong.

      DashrenderD 1 Reply Last reply Reply Quote 0
      • MattSpellerM
        MattSpeller
        last edited by

        I confess to enjoying "devil's advocate" and thought experiments a lot

        1 Reply Last reply Reply Quote 0
        • DashrenderD
          Dashrender @MattSpeller
          last edited by

          @MattSpeller said:

          @Dashrender exactly, there are much better ways to set that kinda thing up - I think we're still looking for a scenario where dude-buddy-guy from SW forums would be right. He may just be 100% wrong.

          well, again, my friends suggested reason, lack of personnel resources in times of emergency, could be a reason.

          scottalanmillerS 1 Reply Last reply Reply Quote 0
          • Deleted74295D
            Deleted74295 Banned
            last edited by

            I will tell you 3 concrete facts.

            1. You must never reboot the servers. Constant up time is vital.

            2. Don't install updates, Microsoft will only break the server to force you to upgrade to the latest version.

            3. Linux is not safe for production. Too complicated and too buggy.

            Why are these facts true?

            Because my experience, training and mentors have fostered a closed minded set of views in my mind and because of this I need to ignore all propaganda. I am not here to listen and learn, I am only here to teach others of the correct way of doing things.

            Yes my job security might be at risk because I am not open to new ideas or learning new concepts but I'm irreplaceable here.

            MattSpellerM 1 Reply Last reply Reply Quote 2
            • Deleted74295D
              Deleted74295 Banned
              last edited by

              Oh by the way.

              If you use mixed operating systems, (ala 7/Vista/8.1/10) when you get Cryptolocker or other Malware, the damage is limited to one group of operating systems.

              1 Reply Last reply Reply Quote 0
              • MattSpellerM
                MattSpeller @Deleted74295
                last edited by

                @Breffni-Potter lol 10/10

                1 Reply Last reply Reply Quote 1
                • nadnerBN
                  nadnerB
                  last edited by

                  Well played @Breffni-Potter 😉

                  1 Reply Last reply Reply Quote 0
                  • scottalanmillerS
                    scottalanmiller @DustinB3403
                    last edited by

                    @DustinB3403 said:

                    I've heard of doing it every 2 - 3 years, but not as a part of routine maintenance.

                    What is schedule for routine maintenance with where you heard this?

                    2-3 years would definitely constitute routine maintenance. I think even for people doing this, 2-3 years seems extremely short.

                    1 Reply Last reply Reply Quote 0
                    • scottalanmillerS
                      scottalanmiller @DustinB3403
                      last edited by

                      @DustinB3403 said:

                      To follow up, I've never performed it either. But have heard people say that they replace their drives to avoid the urgent rush of a RAID being depreciated, because of a failed drive.

                      But there is an urgent rush anyway, they didn't avoid one. And they create more of them. It's literally the same as crashing your car to avoid accidents, preemptively.

                      DustinB3403D 1 Reply Last reply Reply Quote 1
                      • DustinB3403D
                        DustinB3403 @scottalanmiller
                        last edited by

                        @scottalanmiller Oh I completely agree, and said something very similar to that analogy when I heard this.

                        1 Reply Last reply Reply Quote 0
                        • scottalanmillerS
                          scottalanmiller @DustinB3403
                          last edited by

                          @DustinB3403 said:

                          Some people simply don't want to understand what has to be performed to rebuild the array when you replace drives just to replace them.

                          But they have to understand that to do the replacement. A preemptive replacement is an full failure as well. Just a human breaking the array rather than the drive failing and breaking it. Full knowledge of how to repair the array is needed and is identical in both cases.

                          The extra knowledge needed with preemptive is when you can safely do it since if you did it when another drive had failed you easily could make a degraded array into a fully lost array.

                          1 Reply Last reply Reply Quote 0
                          • scottalanmillerS
                            scottalanmiller @Drew
                            last edited by

                            @Drew said:

                            I'm guessing this isn't exactly what you're referring to but I thought I'd add my experience anyway. I guess it depends on what you mean by "perfectly healthy". One manufacturer might consider a drive perfectly healthy while another might not.

                            Meaning, no use of failure indicators at all. Just replacing drives because you replace them, not because there is any indication of issues.

                            1 Reply Last reply Reply Quote 0
                            • scottalanmillerS
                              scottalanmiller @Dashrender
                              last edited by

                              @Dashrender said:

                              His reason was, if the labor pool for emergency repair is small to handle all the emergencies that are happening. Of course there are tons of mitigations for this, but I though the general idea had merit.

                              Nope, this would make that worse too since it increases the chances of drive failure in addition to the extra maintenance. For the scenario you mention a hot spare (or many) would help, but doing this would hurt.

                              1 Reply Last reply Reply Quote 0
                              • DustinB3403D
                                DustinB3403
                                last edited by

                                I guess the better way to have said that statement Scott is that they don't understand the additional risk they are putting the system into, by replacing a drive, to replace it as a way to avoid a failed array.

                                But by replacing the drive, they are putting more stress on the array to rebuild the new drive. And even more and more as they go down the line replacing each drive in the array. Until their on new spinning rust.

                                1 Reply Last reply Reply Quote 0
                                • scottalanmillerS
                                  scottalanmiller @MattSpeller
                                  last edited by

                                  @MattSpeller said:

                                  @Dashrender Also maintenance on exceptionally expensive to access sites (think weather station in Greenland or something)

                                  Same problem, because preemptively replacing healthy, burned in drives causes additional risk because of the bathtub curve problem, this is exactly when you would also avoid this.

                                  1 Reply Last reply Reply Quote 0
                                  • scottalanmillerS
                                    scottalanmiller @Deleted74295
                                    last edited by

                                    @Breffni-Potter said:

                                    For the hard to access station, they should have spares on a shelf, but in theory, when you buy a drive and store it for 3 years, what happens with the warranty if you put it in and it dies after a month?

                                    You could do hot spares or even cold spares in a chassis so that you can do many drive replacements without needing to be physically at the location.

                                    Although how many places are remote AND unmanned?

                                    1 Reply Last reply Reply Quote 0
                                    • scottalanmillerS
                                      scottalanmiller @MattSpeller
                                      last edited by

                                      @MattSpeller said:

                                      @Breffni-Potter spares are a luxury unless you use them on a regular basis

                                      That would only be the case in a situation where you were not comparing to preemptive replacement which is many times (orders of magnitude most likely) more expensive than spares, even tons of spares, even 100% of spares. Preemptive replacement of healthy drives means you have to have spares and use them over and over again, even when the original drives have not failed!

                                      So every luxury of spares PLUS the luxury of just throwing out good drives for the fun of throwing them out!

                                      1 Reply Last reply Reply Quote 1
                                      • scottalanmillerS
                                        scottalanmiller @Dashrender
                                        last edited by

                                        @Dashrender said:

                                        @Breffni-Potter said:

                                        For the hard to access station, they should have spares on a shelf, but in theory, when you buy a drive and store it for 3 years, what happens with the warranty if you put it in and it dies after a month?

                                        It would be out of warranty. But this wouldn't be the situation as @MattSpeller is describing. If they only visit the site say once every 3 months, presumably they would bring drives with them.

                                        But really, you wouldn't setup a system that relied on this type of solution in this scenerio, you'd choose something with more robustness built in. Though I can't tell you what that would look like. Perhaps 2 or even three equal sized arrays kept in sync with redundant data paths, etc. If the data is that important, but you can only visit the site once every three months, you can't just use the day to day setup in most cases.

                                        I'm not sure that is true. You might balance SSD and Winchester drives to alter robustness for the scenario in question, but avoiding RAID might not make sense. With failure rates without spares reaching into the tens and hundreds of thousands of years MTBF on RAID 10, going with a RAID 10, even a large one, with a number of hot spares could give an unmanned station decades of reliable operation time before arrays need to be replaced - likely longer than the equipment is viable.

                                        1 Reply Last reply Reply Quote 0
                                        • scottalanmillerS
                                          scottalanmiller @Dashrender
                                          last edited by

                                          @Dashrender said:

                                          @MattSpeller said:

                                          @Dashrender exactly, there are much better ways to set that kinda thing up - I think we're still looking for a scenario where dude-buddy-guy from SW forums would be right. He may just be 100% wrong.

                                          well, again, my friends suggested reason, lack of personnel resources in times of emergency, could be a reason.

                                          No, it makes investment in spares make sense but still doesn't justify preemptive, it would do the opposite. Having spares is what you do when you don't have available labour, not burning them up and throwing them out.

                                          1 Reply Last reply Reply Quote 0
                                          • 1
                                          • 2
                                          • 3
                                          • 1 / 3
                                          • First post
                                            Last post