Posts made by Jimmy9008

Jimmy9008

We recently looked at moving our workload to Azure. Lead by the Dev team. They looked at ways they could leverage various Azure services and technology (no idea which, it's their job) and the most economy option they came back with was $23,500 CAD per month. I presume that's based on them leveraging the most economical services in the best way. No idea though as I'm not involved on that side.

Our new server infrastructure which Dev use comes to just below $350k CAD which is set to last 5 years until the next replacement.

That's just below $6,000 CAD per month. Far less than the $23,500 CAD that Azure would cost. Even if we take IT staff wages in to this were still at a much lower number for dedicated systems with no contention.

For email, CMS etc, sure. For what our Dev team use, nope. And that's them leading that costing, not IT.

Jimmy9008

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

So, 100 GB is actually 300 GB.

Not for the drives themselves. I'm assuming some kind of hardware/software raid, so that 100GB gets split among the drives the data goes to according to RAID level.

Blocks that are accessed more frequently (read data) don't really count as much, as I'm sure there is caching in multiple places.

If I have a VM on the CSV using 100 GB, the whole point of having the vSAN is that every byte exists on the vSAN partners to avoid any downtime of failure. So, it really is copied entirely three times.

Yeah, I know that. But what you shown concern of before was the wear on drives. If you pick out one random drive, it's not getting 3x the data use.

Yes, correct. I misunderstood what you are saying. Either way, I am sure we are fine on writes. Im not worried about them. I do however think its silly to use more writes than needed, for data that doesnt need HA, just because we can. By nature, leaving that data where we can take large downtime off of the vSAN, causes less wear on the disks... Why would we want to add wear where not needed...

Jimmy9008

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

How much data changes every day? Do you have 100gb of changes per day? 1Tb?

Last time I ran live optics (a week or two ago), we were at around 6 TB of changes per day

What are your drives warrantied at? What's the dwpd or whatever? The idea is they only need to last 5 years / X dwpd or whatever the period is they are rated for anyways.

1 DWPD, im not so worried about the writes. I just would like to avoid additional writes where not really needed.

WHat drive model are they? How many drives per server? What RAID level is being used? Is it hw/sw raid? If hw, which card?

PERC H740P 8 GB Cache. Drive: MTFDDAK1T9TDN
14 per server Raid6

Jimmy9008

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

So, 100 GB is actually 300 GB.

Not for the drives themselves. I'm assuming some kind of hardware/software raid, so that 100GB gets split among the drives the data goes to according to RAID level.

Blocks that are accessed more frequently (read data) don't really count as much, as I'm sure there is caching in multiple places.

If I have a VM on the CSV using 100 GB, the whole point of having the vSAN is that every byte exists on the vSAN partners to avoid any downtime of failure. So, it really is copied entirely three times.

Jimmy9008

Another thought. CSV data is replicated to all three hosts. So, 100 GB is actually 300 GB. 1 TB is actually 3 TB. Why would it make sense to add VMs (applications) the company can sustain long downtime with to an area where it takes up 3 x the space, on expensive SSDs. Why not put that application you dont care about, on one host, where it takes up one lot of space, leaving the other space for things the company does care about...

Jimmy9008

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

How much data changes every day? Do you have 100gb of changes per day? 1Tb?

Last time I ran live optics (a week or two ago), we were at around 6 TB of changes per day

What are your drives warrantied at? What's the dwpd or whatever? The idea is they only need to last 5 years / X dwpd or whatever the period is they are rated for anyways.

1 DWPD, im not so worried about the writes. I just would like to avoid additional writes where not really needed.

Jimmy9008

@Dashrender said in Windows Failover Clustering... what are your views and why?:

@Jimmy9008 said in Windows Failover Clustering... what are your views and why?:

@Dashrender said in Windows Failover Clustering... what are your views and why?:

Oh I see your points for sure, I was only asking if a non CSV based VM could cause a performance bottleneck on a single host - the CSV itself could get stalled out, causing a problem for all CSV based VMs. Now you have another issue to consider when troubleshooting CSV based issues.

Now - perhaps you have so many IOPs that this isn't a real issue, it was only a thought.

Oh I see. As another item on the list of why to not add everything to the cluster storage without that VM needing to be HA? Yeah, ill add that to my list.

So, do you agree option 2 is the way to go? Only add to CSV where needed...

I'm not agreeing or disagreeing - I don't know enough to have an opinion... but my question I think would lean more toward option 1 - because then whole system would be affected equally by the mentioned problem, instead of just a single node, which, which on one side might cause a fail-over and a kicking of this node from the cluster, all the way to crashing the whole cluster (but damn I would hope not).

If the single non CSV VM could cause a bottleneck on a host, couldnt you argue that that same VM would also cause a bottleneck on the CSV? If anything, as the CSVFS comes with the additional overheads due to being a clustered file system, its more likely having the VM on the CSV would create a performance issue, no?

Jimmy9008

@Obsolesce said in Windows Failover Clustering... what are your views and why?:

How much data changes every day? Do you have 100gb of changes per day? 1Tb?

Last time I ran live optics (a week or two ago), we were at around 6 TB of changes per day

Jimmy9008

@Dashrender said in Windows Failover Clustering... what are your views and why?:

Oh I see your points for sure, I was only asking if a non CSV based VM could cause a performance bottleneck on a single host - the CSV itself could get stalled out, causing a problem for all CSV based VMs. Now you have another issue to consider when troubleshooting CSV based issues.

Now - perhaps you have so many IOPs that this isn't a real issue, it was only a thought.

Oh I see. As another item on the list of why to not add everything to the cluster storage without that VM needing to be HA? Yeah, ill add that to my list.

So, do you agree option 2 is the way to go? Only add to CSV where needed...

Jimmy9008

@Dashrender said in Windows Failover Clustering... what are your views and why?:

Is the local physical storage all part of the pool for the CSV? If so, could you be spiking a single server's storage with a VM on that host, which could then cause performance delay for the whole CSV?

The storage currently looks like this. Each server has 21 TB of SSD as a $V. There are 5 CSVs/vSAN images on $V, each of 3 TB. That uses 15 TB of $V storage to provide the vSAN to the WFC. This leaves ~ 6TB usable on each host (18 TB total) as non CSV storage non HA local storage. The vSAN could of course be expanded in to this, rather than as is, but I dont think all should be CSV.

Jimmy9008

Hi folks,

We have a Windows Failover Cluster here using Starwind vSAN over three hosts, all local SSD storage. I'm looking for your thoughts on best practice on where VM should sit, as we have a few differences of opinions internally, and would like to get some external thoughts...

The two lines of thoughts currently are:

We have a Windows Failover Cluster, so we should add all VM to cluster storage and all VMs to the cluster, for management and HA for all VM
We have a Windows Failover Cluster, we should add VMs that need HA to the cluster storage and the cluster, and add all other VMs that do not need HA to the cluster, but with the VM storage local to one of the servers off of CSV storage (not HA)

I'm in the second group. Where are you and why?

My thoughts are (maybe wrong):

Data in the CSV will replicate over all three nodes because of Starwind. So, each write to a CSV is actually three fold. If all VMs are in CSV storage and writing to all three hosts, we could considerably lower the life of our SSDs.
CSVFS comes with a performance hit as each write has to be committed to each server which takes time. Plus, as a clustered file system that also comes with a performance hit. Adding everything to the cluster and CSV will just lower the performance over all for no reason as some applications do not need HA.
If all VMs are on the CSV the Starwind sync channel will have more work to do. Possibly introducing additional performance issues where we do not need HA for each service.
We can still add a VM to the cluster, but keep storage local to one of the three servers off of CSV. Where they dont require HA, thats fine as we can manage them all from WFC, whilst keeping performance (keeping load off the CSVs which isnt needed).
We save space on the CSV for future HA server requirements. If the CSV is used for everything, the vSAN space will fill up quickly for non HA VMs, and when we finally have a need for a new HA system, we wont have room.
Some applications naturally have HA with how they work, so as long as one VM is on hyper-v on each of the 3 hosts, the application stays up even without the VM being in cluster/CSV storage. So, why take up CSV space.

Logically, to me... this all adds up to using the CSV for only VMs that the company say need to be Highly Available, and leaving all else just on the local array outside of CSV storage.

What do you folks think?

Best,
Jim

Jimmy9008

@notverypunny said in File transfer drop:

So a couple of things I'd be looking at if it were me:

RAID card config: write-through / write-back will have performance impacts (but should hit S2019 and W10 equally)

Network vs storage:
-- iperf3 only runs in memory, so it completely removes storage from the troubleshooting equation, if you see the same type of drop-off testing with iperf3 you know that there's a networking gremlin somewhere that needs to be dealt with.
-- something like LANSpeedTest actually writes and reads a file on the far-end storage, so it should provide the same results as your typical file transfer, you can also arbitrarily set the transfer size, just in case you want to test something bigger than what you've got as a static file.

What's actually running in the OS at the same time
-- use something like processhacker to see what else might be using the network or other IO when your file transfer slows to a crawl.
--Maybe there's security configs being applied to your servers and not the W10 guests that aren't being taken into consideration.

I'm not sure if these will help...

The physical servers over the network work fine. Full speed ahead! So, cant be RAID settings, network issue or storage. Physical <-> Physical is perfect. What is the point of testing with iperf, im saying already physical <-> physical is perfect...

The issue is with the VMs. From a VM on host A to a VM on Host B, im seeing much slower speeds. From physical A to physical B, its fine.

Jimmy9008

One thing I have found is that if VMs are given 50GB RAM, they get solid transfer of ~300MB/s. I guess a percentage of VM RAM is used as a cache and once thats full, the network speed drops. Not sure though. And no idea where what cache is or how to edit it. Just a guess.

Jimmy9008

@Dashrender said in File transfer drop:

Is it the VMQ issue?

Yes, VMQ disabled on all hosts and VMs

Jimmy9008

@Dashrender said in File transfer drop:

@Jimmy9008 said in File transfer drop:

@scottalanmiller said in File transfer drop:

You left out the parts that tend to matter, like the storage. I'm guessing you've got a spinner in the chain there somewhere. If so, yup, that's as expected.

The servers are hyperv hosts. Each has one VM only (vm1 and vm2 referred in the original post). Physicals each have dual 2.1 GHz 22 core procs, 768 GB RAM, and 11 x 600 GB SSD in Raid 5 sitting behind hardware Raid. Dell card I believe with 8GB cache.

From host to host over the network, I get 1.6GB/s solid. From VM <-> VM, the transfer starts at ~600 MB/s then drops to ~30 MB/s on a 6 GB file with 1 GB left. Each and every time. The VMs are 2019 server both with 200 GB disk on the array, access to the 10Gbe network, and 10GB RAM assigned...

Dual 22 core processors (44 cores total)? Damn, those VMs cost like $5000 each in licensing (well per pair of VMs). Definitely expensive.
or did I miss something?

(hint - I made up the $5000 number, but it's definitely going to be WAY more expensive than the default 16 core setup at $800 for two VMs)

Currently only two test VMs. There will eventually be 100's of VMs.

Jimmy9008

To add, I just did this test with two new windows 10 vm on the hosts, solid ~600MB/s transfer, no slowdown...

WTF is 2019 doing!

Jimmy9008

@scottalanmiller said in File transfer drop:

You left out the parts that tend to matter, like the storage. I'm guessing you've got a spinner in the chain there somewhere. If so, yup, that's as expected.

The servers are hyperv hosts. Each has one VM only (vm1 and vm2 referred in the original post). Physicals each have dual 2.1 GHz 22 core procs, 768 GB RAM, and 11 x 600 GB SSD in Raid 5 sitting behind hardware Raid. Dell card I believe with 8GB cache.

From host to host over the network, I get 1.6GB/s solid. From VM <-> VM, the transfer starts at ~600 MB/s then drops to ~30 MB/s on a 6 GB file with 1 GB left. Each and every time. The VMs are 2019 server both with 200 GB disk on the array, access to the 10Gbe network, and 10GB RAM assigned...

Jimmy9008

Hi folks,

Can anybody else verify this issue? If you have two windows server 2019 VMs on two different hosts, on a 10 gigabit network, if you try to copy a large file (>10 GB) from \vm1\c$ to \vm2\c$ between VMs, do you start fast ~600 MB/s and then drop to 30 MB/s for the rest of the transfer after about 10 seconds?

Best,
Jim

Jimmy9008

Its bad timing that gets me, on Monday we get the servers that are replacing these and have been on order for a while, and we had approval to replace the UPS a while back but couldn't due to COVID-19 with the office closed! Oh well, will survive it

Have a good weekend folks!

Jimmy9008

Fighting a large outage we have been having all week! Bad timing!

I did not even know this was possible, but our UPS had a large failure on Monday and sent a surge to back end hardware. I thought the point of UPS was to protect end hardware from network surges, but hey... it happened.

We lost the UPS, two switch stacks (core and edge) each having three switches, two old blade chassis with 16 servers in each and a load of old fc and iscsi SANs, and one new server host. We have gone from 35 servers (32 blade servers and 3 standalone servers) to only 2 standalone servers. Luckily though our backups are rock solid.

We have been planning to remove the blades for about 3 months and get a new UPS for 3 months! Really bad timing. The hardware that replaces everything that failed...... is due for delivery on Monday!

Getting there though. Currently fighting one host of the three newer ones to spread the load.

But yeah, fun week!