ML
    • Recent
    • Categories
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Dell Server Not Recognizing Memory

    IT Discussion
    3
    11
    620
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • NashBrydgesN
      NashBrydges
      last edited by

      Here's a weird one. A new client with a Dell PE-R720XD SFF has 24 x 16GB sticks occupying every available slot in the server. As part of my inventory discovery work, I noticed that there are 6 slots that do not recognize the memory installed. Checked all modules and they are all the same Samsung ECC RDIMMs @ 16GB 1333Mhz memory so it isn't a compatibility thing. Spent a few hours moving modules around but appears as though the same slots are not recognizing memory regardless of which stick I have in there. It appears to be channel related because all unavailable slots are Processor 2 Channels 0 and 1...essentially 2 channels are not recognizing memory on that second processor.

      The weird thing is that the server is running "perfectly". I add the quotes because while there are no errors and all VMs are working well with no degradation in performance, there is obviously an issue.

      B1 = Processor 2 Channel 0
      B2 = Processor 2 Channel 1
      B5 = Processor 2 Channel 0
      B6 = Processor 2 Channel 1
      B9 = Processor 2 Channel 0
      B10 = Processor 2 Channel 1

      To make sure I wasn't missing anything, I checked the manual and for 2 processor setups, the memory currently installed should work properly. I've also reseated every single module just in case.

      118b5837-ff65-40f7-ad07-6d2c33ec50c6-image.png

      There are absolutely no log entries indicating any issues with memory going back over a year and the server has been rebooted a number of times since I've been looking at the memory issue.

      I've also run the Dell diagnostics utility on boot-up and everything checked out ok with a PASS on everything.

      Before I start dismantling the server to diagnose, any thoughts as to what to test next?

      These are the troublesome slots.

      2a3b8553-831e-468e-9648-4018de2f4ccd-image.png

      DanpD 1 3 Replies Last reply Reply Quote 0
      • DanpD
        Danp @NashBrydges
        last edited by

        @NashBrydges said in Dell Server Not Recognizing Memory:

        Samsung ECC RDIMMs @ 16GB 1333Mhz memory

        Did you notice this is the manual?

        NOTE: 16 GB quad-rank RDIMMs are not supported.

        Are you able to determine the specific part number for these DIMMs?

        1 NashBrydgesN 2 Replies Last reply Reply Quote 1
        • 1
          1337 @Danp
          last edited by 1337

          @Danp said in Dell Server Not Recognizing Memory:

          @NashBrydges said in Dell Server Not Recognizing Memory:

          Samsung ECC RDIMMs @ 16GB 1333Mhz memory

          Did you notice this is the manual?

          NOTE: 16 GB quad-rank RDIMMs are not supported.

          Are you able to determine the specific part number for these DIMMs?

          I'd check all the small numbers on the DIMMs.

          It's possible that someone screwed up and didn't notice.

          6x16GB of RAM that is not working is a total of 96GB RAM that is missing. That's a significant amount of the servers total RAM.

          It's also possible that one CPU is faulty. Extemely rare though but not impossible. I believe the DIMMs are connected directly to the CPUs internal memory controller.

          It's a slightly odd memory configuration so it's not unlikely that it has been upgraded during it's lifetime. Normally it's better to only use 8 DIMMs per CPU and if you need more than 16x16GB use 32GB LRDIMMs instead. Can't mix RDIMMs and LRDIMMs though which is another way to screw up

          1 Reply Last reply Reply Quote 0
          • 1
            1337 @NashBrydges
            last edited by

            @NashBrydges said in Dell Server Not Recognizing Memory:

            I've also run the Dell diagnostics utility on boot-up and everything checked out ok with a PASS on everything.

            The diagnosis utility can't test what the CPU can't recognize or find. So it's of limited value.

            NashBrydgesN 1 Reply Last reply Reply Quote 0
            • NashBrydgesN
              NashBrydges @Danp
              last edited by

              @Danp I did, yeah, no quad rank dimms.

              1 Reply Last reply Reply Quote 0
              • NashBrydgesN
                NashBrydges @1337
                last edited by

                @Pete-S That's what I also thought. I will have to spend some more time digging all the module numbers out tomorrow once I'm back there. There has to be something mismatched somewhere. Can't imagine anything else at this point.

                1 1 Reply Last reply Reply Quote 0
                • 1
                  1337 @NashBrydges
                  last edited by

                  @NashBrydges said in Dell Server Not Recognizing Memory:

                  The weird thing is that the server is running "perfectly". I add the quotes because while there are no errors and all VMs are working well with no degradation in performance, there is obviously an issue.

                  This is what to be expected when the CPU doesn't recognize the memory.

                  What you have is an one CPU with full memory bandwidth and 192GB of memory and the other CPU with 96GB memory and probably only half memory bandwidth. So the server is less performant than it would normally have been.

                  1 Reply Last reply Reply Quote 0
                  • 1
                    1337 @NashBrydges
                    last edited by 1337

                    @NashBrydges said in Dell Server Not Recognizing Memory:

                    @Pete-S That's what I also thought. I will have to spend some more time digging all the module numbers out tomorrow once I'm back there. There has to be something mismatched somewhere. Can't imagine anything else at this point.

                    If possible you should be prepared to swap the CPUs.

                    What kind of CPUs are in there? E5-26xx V2 something perhaps? V1 is probably more likely.


                    Troubleshooting quickly adds up so it might be time to consider what to do if the problem can't be solved easily. Like looking at the RAM and reseating it.

                    R720 is well over it's expected life span at this point. It's very much a possibility that the server is on the verge of catastrophic failure and this is the first sign.

                    NashBrydgesN 1 Reply Last reply Reply Quote 1
                    • NashBrydgesN
                      NashBrydges @1337
                      last edited by

                      @Pete-S The modules have all been reseated and swapped around to other slots and still the same thing. The same 6 slots remain unidentified (or unoccupied according to iDrac).

                      The CPUs are E5-2650 v1.

                      I've already had the conversation with the owner. Looks like we're going to keep things as they are since everything is operating normally (with the obvious missing RAM). We have good tested backups with another server to migrate the workload to in under an hour should something fail. He's unwilling to spend the cash on a new server and a deep diagnosis will be pretty pricy to pay for my time so...status quo for now.

                      NashBrydgesN 1 Reply Last reply Reply Quote 1
                      • NashBrydgesN
                        NashBrydges @NashBrydges
                        last edited by

                        @NashBrydges Guess you can take a horse to water but you can't force him to drink.

                        DanpD 1 Reply Last reply Reply Quote 1
                        • DanpD
                          Danp @NashBrydges
                          last edited by

                          @NashBrydges Did you try switching the positions of existing CPUs?

                          1 Reply Last reply Reply Quote 0
                          • 1 / 1
                          • First post
                            Last post