ML
    • Recent
    • Categories
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?

    IT Discussion
    11
    40
    3.3k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • NashBrydgesN
      NashBrydges
      last edited by

      Saw your post on the 🌶 last evening.

      Issue #1: using FreeNAS at all for production sotrage
      Issue #2: using FreeNAS for such a LARGE production storage

      How have you been running the dedupe scanners? Via a PC connected to the shares on the FreeNAS server? I'm not familiar enough with FreeBSD to know if there are commands that can be run from shell to check for dupes.

      1 Reply Last reply Reply Quote 0
      • scottalanmillerS
        scottalanmiller
        last edited by

        I've never used Duff, but it should run there. But the scale might be problematic.

        1 Reply Last reply Reply Quote 1
        • scottalanmillerS
          scottalanmiller
          last edited by

          At that size, there is no simple way to handle this. The file comparison process is incredible. You need a checksum for over 10,000,000 files, that alone is no small task, and then you need to compare every file to every other file, that's 1x10^14 MD5 comparisons.

          If you can find any ways to limit these comparisons, that might help. But the number of them is so high that probably no normal tools will tackle it.

          1 Reply Last reply Reply Quote 1
          • scottalanmillerS
            scottalanmiller
            last edited by

            What might work is making a database (that is stored elsewhere) that will hold all of the MD5s and sort them alphabetically. Then you only need to use either the databases own duplication checking and/or check against neighbour values. Then you'll know where duplicates are possible.

            1 Reply Last reply Reply Quote 1
            • StrongBadS
              StrongBad
              last edited by

              Wow that is a lot of files. That's going to take forever.

              You need to run this from a PC, not from the server? Is this a SAN then, not a NAS?

              1 Reply Last reply Reply Quote 0
              • H
                HelloWill
                last edited by

                I can run any software from either a workstation or the server, however running things directly on FreeNAS makes me nervous because i'm not sure how it will react.

                The files are shared as a NAS, although we could connect via iSCSI or similar

                scottalanmillerS 1 Reply Last reply Reply Quote 0
                • scottalanmillerS
                  scottalanmiller @HelloWill
                  last edited by

                  @hellowill said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                  I can run any software from either a workstation or the server, however running things directly on FreeNAS makes me nervous because i'm not sure how it will react.

                  The files are shared as a NAS, although we could connect via iSCSI or similar

                  That would corrupt the data. If it is shared as a NAS, then you need to run everything from the server. That rules our iSCSI. iSCSI would corrupt or just delete all of your data since it would need to format the space as a new drive before mounting it.

                  1 Reply Last reply Reply Quote 0
                  • dbeatoD
                    dbeato @HelloWill
                    last edited by

                    @hellowill said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                    We have a large FreeNAS server that is loaded with files. I am looking for advice on the best way to get things cleaned up, and I know there's tons of duplicates.

                    File Types:

                    • Images
                    • Text
                    • Videos

                    File Counts:

                    • 10,000,000+ Files
                    • 200+ TB

                    I've tried running many other duplicate scanners, but they haven't been easy because the scanners crash when they get logs too big, it's hard to get context, and it takes days to scan without checksums (Takes a really long time to checksum (MD5) files). And to top it off, they only run on one PC so I can't even enlist the rest of the team to help clean up.

                    I need a way to make it so that we can easily scan files, identify duplicates, and be able to ideally save scan results and checksums such that we don't need to keep re-scanning the same files again and again. I like beyond compare, but it helps after the duplicates have been identified.

                    What do you guys do to scan this much data and make sense of it / organize it?

                    I am sure you have checked ZFS Deduplication correct? http://www.freenas.org/blog/freenas-worst-practices/

                    The way this is setup it should be spread out not just in one NAS device.

                    scottalanmillerS 1 Reply Last reply Reply Quote 0
                    • scottalanmillerS
                      scottalanmiller @dbeato
                      last edited by

                      @dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

                      dbeatoD ObsolesceO 2 Replies Last reply Reply Quote 2
                      • dbeatoD
                        dbeato @scottalanmiller
                        last edited by

                        @scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                        @dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

                        I know, I was just saying that FreeNAS and deduplication don't work well in other words...

                        scottalanmillerS 1 Reply Last reply Reply Quote 0
                        • scottalanmillerS
                          scottalanmiller @dbeato
                          last edited by

                          @dbeato said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                          @scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                          @dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

                          I know, I was just saying that FreeNAS and deduplication don't work well in other words...

                          I see, yes, it's a bit of a dilemma. In reality, nothing works great with dedupe, it's a difficult thing to do at large scale.

                          1 Reply Last reply Reply Quote 0
                          • ObsolesceO
                            Obsolesce @scottalanmiller
                            last edited by Obsolesce

                            @scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                            @dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

                            How did you get 256GB of RAM needed?

                            That FreeNAS article recommends 5GB RAM per 1 TB of deduped data...
                            Considering he has 200TB of data he'd want to dedup, that's at least 1TB of RAM to start.

                            This is because dedup on ZFS/FreeNAS is much more RAM intensive than all other file systems. (and also because 200TB is a ton of data)

                            scottalanmillerS 1 Reply Last reply Reply Quote 0
                            • scottalanmillerS
                              scottalanmiller @Obsolesce
                              last edited by

                              @tim_g said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                              @scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                              @dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

                              How did you get 256GB of RAM needed?

                              That FreeNAS article recommends 5GB RAM per 1 TB of deduped data...
                              Considering he has 200TB of data he'd want to dedup, that's at least 1TB of RAM to start.

                              This is because dedup on ZFS/FreeNAS is much more RAM intensive than all other file systems. (and also because 200TB is a ton of data)

                              What caused it to balloon so much recently? Traditionally it has been 1GB per 1TB.

                              https://serverfault.com/questions/569354/freenas-do-i-need-1gb-per-tb-of-usable-storage-or-1gb-of-memory-per-tb-of-phys

                              ObsolesceO matteo nunziatiM 2 Replies Last reply Reply Quote 0
                              • ObsolesceO
                                Obsolesce
                                last edited by

                                Would you consider looking for duplicate files from the server directory by directory, rather than everything all at once?

                                Maybe scan in 500,000 file chunks and start reducing it little by little manually.

                                DustinB3403D 1 Reply Last reply Reply Quote 3
                                • ObsolesceO
                                  Obsolesce @scottalanmiller
                                  last edited by

                                  @scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                  @tim_g said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                  @scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                  @dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

                                  How did you get 256GB of RAM needed?

                                  That FreeNAS article recommends 5GB RAM per 1 TB of deduped data...
                                  Considering he has 200TB of data he'd want to dedup, that's at least 1TB of RAM to start.

                                  This is because dedup on ZFS/FreeNAS is much more RAM intensive than all other file systems. (and also because 200TB is a ton of data)

                                  What caused it to balloon so much recently? Traditionally it has been 1GB per 1TB.

                                  https://serverfault.com/questions/569354/freenas-do-i-need-1gb-per-tb-of-usable-storage-or-1gb-of-memory-per-tb-of-phys

                                  That's just for the ZFS file system itself.

                                  If using deduplication, then 5gb per tb. Dedup has it's own requirements.

                                  scottalanmillerS 1 Reply Last reply Reply Quote 0
                                  • scottalanmillerS
                                    scottalanmiller @Obsolesce
                                    last edited by

                                    @tim_g said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                    @scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                    @tim_g said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                    @scottalanmiller said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                    @dbeato would need 256GB of RAM to attempt that with ZFS. That's a lot of RAM on a NAS.

                                    How did you get 256GB of RAM needed?

                                    That FreeNAS article recommends 5GB RAM per 1 TB of deduped data...
                                    Considering he has 200TB of data he'd want to dedup, that's at least 1TB of RAM to start.

                                    This is because dedup on ZFS/FreeNAS is much more RAM intensive than all other file systems. (and also because 200TB is a ton of data)

                                    What caused it to balloon so much recently? Traditionally it has been 1GB per 1TB.

                                    https://serverfault.com/questions/569354/freenas-do-i-need-1gb-per-tb-of-usable-storage-or-1gb-of-memory-per-tb-of-phys

                                    That's just for the ZFS file system itself.

                                    If using deduplication, then 5gb per tb. Dedup has it's own requirements.

                                    Oh right, poop. Yeah that's a lot of RAM needed.

                                    1 Reply Last reply Reply Quote 0
                                    • DustinB3403D
                                      DustinB3403 @Obsolesce
                                      last edited by

                                      @tim_g said in What's the Best Way to Deduplicate & Organize Files/Folders on a 200 TB NAS?:

                                      Would you consider looking for duplicate files from the server directory by directory, rather than everything all at once?

                                      Maybe scan in 500,000 file chunks and start reducing it little by little manually.

                                      This would likely be the only way to do it.

                                      Ive used a few different tools (windows ones) that could scan directories and compare for hash matches. I'm sure there is a better Linux alternative.

                                      1 Reply Last reply Reply Quote 1
                                      • H
                                        HelloWill
                                        last edited by

                                        I was hoping there was some type of server migration software or enterprise deduplication software that would be able to crawl all our data, store the results in some type of database and then allow us to parse the results.

                                        When you throw 10MM files at traditional duplicate cleaners, they tend to blow up. Then, after you clean some parts up, guess what... you have to rescan and wait.

                                        There has to be a better way. Block-level deduplication solves part of the storage size equation, but doesn't address the root cause of the problem in the first place which is poor data governance. The challenge is going from messy > organized in an efficient manner.

                                        Has anybody used this, or know of something similar?
                                        http://www.valiancepartners.com/data-migration-tools/trucompare-data-migration-testing/

                                        DustinB3403D 1 Reply Last reply Reply Quote 0
                                        • DustinB3403D
                                          DustinB3403 @HelloWill
                                          last edited by

                                          @hellowill the biggest issue is you have way to many files and not enough resources to scan and dedup the system live.

                                          Your only reasonable approach is to do this in smaller chunks at a time. Since we can reasonably assume you don't have a TB+ of ram to throw at this job nor anywhere to store the updated files.

                                          1 Reply Last reply Reply Quote 1
                                          • dafyreD
                                            dafyre
                                            last edited by

                                            I like Scott's idea of storing the file hashes in a database but jeez... for 10m files, you're looking at a huge DB just to store the file list!

                                            scottalanmillerS 1 Reply Last reply Reply Quote 2
                                            • 1
                                            • 2
                                            • 2 / 2
                                            • First post
                                              Last post