ML
    • Recent
    • Categories
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?

    IT Discussion
    11
    40
    3.3k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • DustinB3403D
      DustinB3403 @Obsolesce
      last edited by

      @tim_g said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

      Would you consider looking for duplicate files from the server directory by directory, rather than everything all at once?

      Maybe scan in 500,000 file chunks and start reducing it little by little manually.

      This would likely be the only way to do it.

      Ive used a few different tools (windows ones) that could scan directories and compare for hash matches. I'm sure there is a better Linux alternative.

      1 Reply Last reply Reply Quote 1
      • H
        HelloWill
        last edited by

        I was hoping there was some type of server migration software or enterprise deduplication software that would be able to crawl all our data, store the results in some type of database and then allow us to parse the results.

        When you throw 10MM files at traditional duplicate cleaners, they tend to blow up. Then, after you clean some parts up, guess what... you have to rescan and wait.

        There has to be a better way. Block-level deduplication solves part of the storage size equation, but doesn't address the root cause of the problem in the first place which is poor data governance. The challenge is going from messy > organized in an efficient manner.

        Has anybody used this, or know of something similar?
        http://www.valiancepartners.com/data-migration-tools/trucompare-data-migration-testing/

        DustinB3403D 1 Reply Last reply Reply Quote 0
        • DustinB3403D
          DustinB3403 @HelloWill
          last edited by

          @hellowill the biggest issue is you have way to many files and not enough resources to scan and dedup the system live.

          Your only reasonable approach is to do this in smaller chunks at a time. Since we can reasonably assume you don't have a TB+ of ram to throw at this job nor anywhere to store the updated files.

          1 Reply Last reply Reply Quote 1
          • dafyreD
            dafyre
            last edited by

            I like Scott's idea of storing the file hashes in a database but jeez... for 10m files, you're looking at a huge DB just to store the file list!

            scottalanmillerS 1 Reply Last reply Reply Quote 2
            • scottalanmillerS
              scottalanmiller @dafyre
              last edited by

              @dafyre said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

              I like Scott's idea of storing the file hashes in a database but jeez... for 10m files, you're looking at a huge DB just to store the file list!

              Yeah, not trivial, even just on that part. Although not all that bad. I've got databases with way more data than that "per entry", and over a million entries. And it takes nothing to do complex queries against it (MariaDB.) So simpler, smaller data at ten times the size should remain really trivial. Especially as they could use an even simpler database type, zero relations.

              anthonyhA 1 Reply Last reply Reply Quote 0
              • anthonyhA
                anthonyh @scottalanmiller
                last edited by

                This would be a bit more work to set up initially as it would probably mean moving away from FreeNAS, but might be worth considering. Of course, you'd need somewhere to stage your 200TB of data which would be a huge feat in itself. But, jussst in case you might be in the market to build a new box....

                I've been considering XFS + duperemove (https://github.com/markfasheh/duperemove) for some of my storage needs.

                Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing blocks that match each other. When given the -d option, duperemove will submit those extents for deduplication using the Linux kernel extent-same ioctl.

                Duperemove can store the hashes it computes in a 'hashfile'. If given an existing hashfile, duperemove will only compute hashes for those files which have changed since the last run. Thus you can run duperemove repeatedly on your data as it changes, without having to re-checksum unchanged data.

                What's nice about duperemove is that it's an "out of band" process so to speak. So you can run it during off-peak utilization and start/stop the process at will. It doesn't require RAM like ZFS.

                dafyreD scottalanmillerS 2 Replies Last reply Reply Quote 2
                • dafyreD
                  dafyre @anthonyh
                  last edited by dafyre

                  @anthonyh said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                  This would be a bit more work to set up initially as it would probably mean moving away from FreeNAS, but might be worth considering. Of course, you'd need somewhere to stage your 200TB of data which would be a huge feat in itself. But, jussst in case you might be in the market to build a new box....

                  I've been considering XFS + duperemove (https://github.com/markfasheh/duperemove) for some of my storage needs.

                  Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing blocks that match each other. When given the -d option, duperemove will submit those extents for deduplication using the Linux kernel extent-same ioctl.

                  Duperemove can store the hashes it computes in a 'hashfile'. If given an existing hashfile, duperemove will only compute hashes for those files which have changed since the last run. Thus you can run duperemove repeatedly on your data as it changes, without having to re-checksum unchanged data.

                  What's nice about duperemove is that it's an "out of band" process so to speak. So you can run it during off-peak utilization and start/stop the process at will. It doesn't require RAM like ZFS.

                  Is that an XFS only thing -- or can it work with other File Systems?

                  Edit: Quick glance at their Github doesn't say anything about which filesystems are required.

                  anthonyhA 2 Replies Last reply Reply Quote 0
                  • anthonyhA
                    anthonyh @dafyre
                    last edited by

                    @dafyre said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                    @anthonyh said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                    This would be a bit more work to set up initially as it would probably mean moving away from FreeNAS, but might be worth considering. Of course, you'd need somewhere to stage your 200TB of data which would be a huge feat in itself. But, jussst in case you might be in the market to build a new box....

                    I've been considering XFS + duperemove (https://github.com/markfasheh/duperemove) for some of my storage needs.

                    Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing blocks that match each other. When given the -d option, duperemove will submit those extents for deduplication using the Linux kernel extent-same ioctl.

                    Duperemove can store the hashes it computes in a 'hashfile'. If given an existing hashfile, duperemove will only compute hashes for those files which have changed since the last run. Thus you can run duperemove repeatedly on your data as it changes, without having to re-checksum unchanged data.

                    What's nice about duperemove is that it's an "out of band" process so to speak. So you can run it during off-peak utilization and start/stop the process at will. It doesn't require RAM like ZFS.

                    Is that an XFS only thing -- or can it work with other File Systems?

                    You know, I'm not 100% sure. I am only familiar with this method of deduplication with BtrFS and XFS.

                    1 Reply Last reply Reply Quote 0
                    • anthonyhA
                      anthonyh @dafyre
                      last edited by

                      @dafyre said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                      @anthonyh said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                      This would be a bit more work to set up initially as it would probably mean moving away from FreeNAS, but might be worth considering. Of course, you'd need somewhere to stage your 200TB of data which would be a huge feat in itself. But, jussst in case you might be in the market to build a new box....

                      I've been considering XFS + duperemove (https://github.com/markfasheh/duperemove) for some of my storage needs.

                      Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing blocks that match each other. When given the -d option, duperemove will submit those extents for deduplication using the Linux kernel extent-same ioctl.

                      Duperemove can store the hashes it computes in a 'hashfile'. If given an existing hashfile, duperemove will only compute hashes for those files which have changed since the last run. Thus you can run duperemove repeatedly on your data as it changes, without having to re-checksum unchanged data.

                      What's nice about duperemove is that it's an "out of band" process so to speak. So you can run it during off-peak utilization and start/stop the process at will. It doesn't require RAM like ZFS.

                      Is that an XFS only thing -- or can it work with other File Systems?

                      Edit: Quick glance at their Github doesn't say anything about which filesystems are required.

                      If my understanding is correct, this would work with filesystems that support reflinks.

                      dafyreD 1 Reply Last reply Reply Quote 0
                      • dafyreD
                        dafyre @anthonyh
                        last edited by

                        @anthonyh said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                        @dafyre said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                        @anthonyh said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                        This would be a bit more work to set up initially as it would probably mean moving away from FreeNAS, but might be worth considering. Of course, you'd need somewhere to stage your 200TB of data which would be a huge feat in itself. But, jussst in case you might be in the market to build a new box....

                        I've been considering XFS + duperemove (https://github.com/markfasheh/duperemove) for some of my storage needs.

                        Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing blocks that match each other. When given the -d option, duperemove will submit those extents for deduplication using the Linux kernel extent-same ioctl.

                        Duperemove can store the hashes it computes in a 'hashfile'. If given an existing hashfile, duperemove will only compute hashes for those files which have changed since the last run. Thus you can run duperemove repeatedly on your data as it changes, without having to re-checksum unchanged data.

                        What's nice about duperemove is that it's an "out of band" process so to speak. So you can run it during off-peak utilization and start/stop the process at will. It doesn't require RAM like ZFS.

                        Is that an XFS only thing -- or can it work with other File Systems?

                        Edit: Quick glance at their Github doesn't say anything about which filesystems are required.

                        If my understanding is correct, this would work with filesystems that support reflinks.

                        I wonder if this would work for the OP's use case. I think he's on FreeNas though.

                        anthonyhA 1 Reply Last reply Reply Quote 0
                        • scottalanmillerS
                          scottalanmiller @anthonyh
                          last edited by

                          @anthonyh said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                          This would be a bit more work to set up initially as it would probably mean moving away from FreeNAS, but might be worth considering. Of course, you'd need somewhere to stage your 200TB of data which would be a huge feat in itself. But, jussst in case you might be in the market to build a new box....

                          I've been considering XFS + duperemove (https://github.com/markfasheh/duperemove) for some of my storage needs.

                          Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing blocks that match each other. When given the -d option, duperemove will submit those extents for deduplication using the Linux kernel extent-same ioctl.

                          Duperemove can store the hashes it computes in a 'hashfile'. If given an existing hashfile, duperemove will only compute hashes for those files which have changed since the last run. Thus you can run duperemove repeatedly on your data as it changes, without having to re-checksum unchanged data.

                          What's nice about duperemove is that it's an "out of band" process so to speak. So you can run it during off-peak utilization and start/stop the process at will. It doesn't require RAM like ZFS.

                          Very nice. I need to play with that.

                          1 Reply Last reply Reply Quote 0
                          • anthonyhA
                            anthonyh @dafyre
                            last edited by

                            @dafyre said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                            @anthonyh said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                            @dafyre said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                            @anthonyh said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                            This would be a bit more work to set up initially as it would probably mean moving away from FreeNAS, but might be worth considering. Of course, you'd need somewhere to stage your 200TB of data which would be a huge feat in itself. But, jussst in case you might be in the market to build a new box....

                            I've been considering XFS + duperemove (https://github.com/markfasheh/duperemove) for some of my storage needs.

                            Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing blocks that match each other. When given the -d option, duperemove will submit those extents for deduplication using the Linux kernel extent-same ioctl.

                            Duperemove can store the hashes it computes in a 'hashfile'. If given an existing hashfile, duperemove will only compute hashes for those files which have changed since the last run. Thus you can run duperemove repeatedly on your data as it changes, without having to re-checksum unchanged data.

                            What's nice about duperemove is that it's an "out of band" process so to speak. So you can run it during off-peak utilization and start/stop the process at will. It doesn't require RAM like ZFS.

                            Is that an XFS only thing -- or can it work with other File Systems?

                            Edit: Quick glance at their Github doesn't say anything about which filesystems are required.

                            If my understanding is correct, this would work with filesystems that support reflinks.

                            I wonder if this would work for the OP's use case. I think he's on FreeNas though.

                            I don't think so since FreeNAS uses ZFS exclusively (according to my quick Google search...I am not a FreeNAS user).

                            I believe OP would need to build a NAS using software that'll support a filesystem with reflinks support.

                            Though it looks like ZFS on BSD (which IIRC FreeNAS is based on FreeBSD) might support reflinks...

                            So I really don't know!

                            scottalanmillerS 1 Reply Last reply Reply Quote 0
                            • scottalanmillerS
                              scottalanmiller @anthonyh
                              last edited by

                              @anthonyh said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                              Though it looks like ZFS on BSD (which IIRC FreeNAS is based on FreeBSD) might support reflinks...

                              Yes, FreeNAS is just an older version of FreeBSD (not very old, just a little.)

                              1 Reply Last reply Reply Quote 0
                              • scottalanmillerS
                                scottalanmiller
                                last edited by

                                More on reflinks.

                                stacksofplatesS 1 Reply Last reply Reply Quote 0
                                • scottalanmillerS
                                  scottalanmiller
                                  last edited by

                                  ZFS does not have reflinks, and doesn't plan to. It's a BtrFS feature back ported to XFS on Linux.

                                  anthonyhA 1 Reply Last reply Reply Quote 0
                                  • anthonyhA
                                    anthonyh @scottalanmiller
                                    last edited by

                                    @scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                    ZFS does not have reflinks, and doesn't plan to. It's a BtrFS feature back ported to XFS on Linux.

                                    That's what I thought, but I didn't have the data to back it up.

                                    1 Reply Last reply Reply Quote 0
                                    • StrongBadS
                                      StrongBad
                                      last edited by

                                      ZFS has a lot of similar stuff built in, I don't think that they want to do it two ways. It's not often that people want the extra reflinks functionality.

                                      anthonyhA 1 Reply Last reply Reply Quote 0
                                      • anthonyhA
                                        anthonyh @StrongBad
                                        last edited by

                                        @strongbad said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                        ZFS has a lot of similar stuff built in, I don't think that they want to do it two ways. It's not often that people want the extra reflinks functionality.

                                        Yeah. ZFS's deduplication functionality is good...just resource intensive. I've talked to guys who build out large storage arrays using ZFS and deduplication and it gets complicated (at least from my ZFS novice point of view) if you want it to perform well.

                                        scottalanmillerS 1 Reply Last reply Reply Quote 0
                                        • scottalanmillerS
                                          scottalanmiller @anthonyh
                                          last edited by

                                          @anthonyh said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                          @strongbad said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                          ZFS has a lot of similar stuff built in, I don't think that they want to do it two ways. It's not often that people want the extra reflinks functionality.

                                          Yeah. ZFS's deduplication functionality is good...just resource intensive. I've talked to guys who build out large storage arrays using ZFS and deduplication and it gets complicated (at least from my ZFS novice point of view) if you want it to perform well.

                                          ZFS was never built for performance (Sun said this directly.) It was for low cost, giant scale with good reliability and durability. So that it doesn't handle performance great while doing a feature like dedupe is not at all surprising.

                                          It's also 13 years old and the granddaddy of its type of product.

                                          1 Reply Last reply Reply Quote 1
                                          • matteo nunziatiM
                                            matteo nunziati @scottalanmiller
                                            last edited by

                                            @scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                            @tim_g said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                            @scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                            @dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

                                            How did you get 256GB of RAM needed?

                                            That FreeNAS article recommends 5GB RAM per 1 TB of deduped data...
                                            Considering he has 200TB of data he'd want to dedup, that's at least 1TB of RAM to start.

                                            This is because dedup on ZFS/FreeNAS is much more RAM intensive than all other file systems. (and also because 200TB is a ton of data)

                                            What caused it to balloon so much recently? Traditionally it has been 1GB per 1TB.

                                            https://serverfault.com/questions/569354/freenas-do-i-need-1gb-per-tb-of-usable-storage-or-1gb-of-memory-per-tb-of-phys

                                            Freebsd zfs page stated up to 5gb per 1tb last time I checked

                                            1 Reply Last reply Reply Quote 1
                                            • 1
                                            • 2
                                            • 2 / 2
                                            • First post
                                              Last post