Troubleshooting Slow Logins on Windows 2000 Terminal Services with roaming profiles / Outlook freezing when opening an attachment from Exchange Server (On Terminal Services and Fat Client).
How are these 2 problems related? First let me describe the environment.
Windows 2000 Active Directory
1x Raid 1 / Raid 5 with 2x Pentium 3 / 1gb Ram server that acts as Windows 2000 Exchange AND File server with SP3 (100Mb/s network)
4x Raid 1 with 2 x Pentium 3 / 1gb Ram Terminal Servers (Load Balanced with 3rd party software, RDP only / no Citrix). Each server averages approx 35-40 simultaneous users, mostly using Terminal Emulation to Unix and Outlook to Exchange Server. All users have Roaming Profiles and use reasonably extensive Group Policies to lock down desktops. 3 x have SP4 and 1 x has SP3 but problem was apparent across all 4 servers.Additionally:
Approx 30 x Fat Clients connecting to Exchange server
Approx 200 Fat Clients using POP3 for Exchange
The terminal servers sit on the same subnet as the Exchange / File Server connected by a 100Mb/s HP Procurve Switch.I will skip the initial testing stages, which is all standard stuff. eg: Memory Usage, CPU usage, Network Congestion etc. Needless to say, all of this did not point to any problems.
Then I got hold of Brian Madden’s “Terminal Server Performance Tuning” document and the first leaf out of that book that I took was to start logging the user login process by adding the following registry entry:
Key: HKLMSoftwareMicrosoftWindows NTCurrentVersionWinlogon
Data: 10002 (Hex)
From there, you check for a â€œuserenv.logâ€ file in the %System-Root%DebugUserMode folder
**Remember to turn the loggin off when you’re finished**
The log for each user login is approx 10 pages, so obviously I can’t provide the complete logs.
What I found interesting (or disturbing) in the logfile were the instances below:
USERENV(cab8.d1b0) 14:16:21:015 RecurseDirectory: Adding fileserver1profilesuser1WINDOWS
USERENV(cab8.d1b0) 14:16:57:000 ReconcileFile: fileserver1profilesuser1ntuser.dat
USERENV(d284.d278) 14:20:17:531 RecurseDirectory: Adding fileserver1profilesuser1WINDOWS
USERENV(d284.d278) 14:20:17:671 ReconcileFile: fileserver1profilesuser1ntuser.dat
USERENV(9100.9154) 15:48:16:125 RecurseDirectory: Adding fileserver1profilesuser1WINDOWS
USERENV(9100.9154) 15:48:54:984 ReconcileFile: fileserver1profilesuser1ntuser.dat
Here we see 3 login instances. In sessions 1 and 3, we see a 36 and 38 second delay between the Recurse and the Reconcile, whereas sessons 2: the delay is just more than .140 of a second.. a bit of a difference. But why?
So, I started testing file transfers (2 folders containing 60 files = 60mb) across our network.
Here, our problem raised it’s ugly head on occasion but not always replicable and without any apparent changes occuring. Here’s a brief description of the tests and an outline of what was happening.
fs1->ts1 – ts4
When attempting to copy files from the FS Share to the TS drives, Explorer would initially go blank as if I had not initiated the transfer. Then after 5-10 seconds it would show me “37 minutes” left to transfer. 37 minutes across a 100Mb/s network for 60mb??!! Ok, we have a problem.
But check this out. Whilst that 37 minute transfer is in progress, I can copy those same set of files and folders from fs1 -> dc1 or to a client machine (same switch, same subnet) in less than 20 seconds AND I could copy a similar sized folder/file set from ts1-ts2 (or any combination of Terminal Servers) between each other in 20 seconds or less. All of this whilst the fs1->ts(x) file transfer was still snail pacing along.
Additionally, when running perfmon to watch network activity and see what was happening, everything was idle except for a quick spike every 38-40 seconds (very close to the 36-38 second delay experienced in the login log)
What made it more difficult is that the problem only occured once you had a medium-high number of clients connected (across 15 branches, 1000kms away) and I couldn’t kick them off anytime I needed to make a change or reboot (oh, how I love Windows reboots).
In the meantime, I was searching for articles on Google, Forums, Newsgroups and Microsoft KB looking for any information on this – you know the drill.
Now, Windows Server has a feature in Network Connection (Properties)->File and Printer Sharing for Microsoft Networks (Properties) called “Server Optimization” with 4 choices:
- Minimize Memory Used
- Maximize Throughput for Sharing
- Maximize Throughput for Network Applications
Microsoft recommends number 3 if it is a File Server but number 4 if it is an Exchange or SQL server. When I approached the problem, the fs1 (Exchange/File Server) was set to 4.
If you want to read more about it, check out these articles.
None of the above had any effect on my problem so I decided in our situation, to leave it on 2: Balance
In doing this, the following day, myself and a number of users got the following error when logging off
“The Network BIOS Command Limit Has Been Reached” ..Yay, at least we are getting an actual error now. :)
A quick search for that gave me Microsoft KB 810886
I describe the important bits below:
After you install Windows 2000 Service Pack 3 (SP3), you may receive the following error message:
The network BIOS command limit has been reached.
This issue may occur if both of the following conditions are true:
Windows 2000-based clients submit multiple and simultaneous long-term requests against a file server that runs Server Message Block (SMB) Service. One example of a long-term request is a client using the FindFirstChangeNotification call to monitor a server share for changes. The MaxCmds registry value setting on the client is lower than 50, or the MaxMpxCt registry value setting on the server is lower than 50.
The maximum number of concurrent outstanding network requests between a SMB client and the server is determined when a client/server session is negotiated. The maximum value that a client supports is determined by the MaxCmds registry value. The maximum value that a server supports is determined by the MaxMpxCt registry value. For a particular client and server pair, the number of concurrent outstanding network requests is the lesser of these two values.
To track the number of concurrent outstanding SMB network requests on a client, add the Current Commands counter in the Redirector performance object to Performance Monitor.
OK, so let’s add the “Current Commands” counter from the Redirector object (*On the ts1-4 NOT on fs1) and guess what? It would hit a ceiling limit of 49 with the occasional jump to 52/53 when trying to perform some operations. When I would try a 60mb file transfer from fs1->ts(x), the current commands would jump but no Bytes get transferred. The Current Commands counter drops back to 49 every 40 seconds, then I would get a spike and some data transferred, it would jump back above 50 and my data would stop getting transferred.
So following the directions outlined in article Microsoft KB 810886 I ended up with the following:
My File/Exchange Server now has the following values:
“MaxWorkItems”=dword 8192 (decimal)
“MaxMpxCt”=dword 512 (decimal)
My Terminal Servers have the following values:
“MaxCmds”=dword 512 (decimal)
None of these Dword Keys existed, so I had to add them manually and they do not take effect until a reboot. My observations of the Terminal Servers since changing these keys and rebooting is the I can no longer replicate the slow / pausing file transfer. The “Current Commands” counter in Perfmon averages about 210 for 35 users.
The 6:1 ratio seems quite consistent in our environment (10 users / 60 Current Commands) and this may help you adjust your current commands ratio on your Terminal Servers.
Additionally, the “Current Commands” perfmon counter ALWAYS shows 0-1 on the File/Exchange Server, so unless somebody knows otherwise, I can’t help you discover a reasonable figure for that.
Basically, I just hope this information can help someone as it has taken me an unquantifiable amount of hours to research / debug / sort out this problem. If I can pull anymore information out of these servers, I will keep this updated.