So this is an interesting one… Yesterday we noticed one of the Exchange 2010 servers we manage was running low on drive space on the transaction log volume. I initially was thinking the usual suspects would be the culprit; VSS had freaked out and the logs weren’t getting truncated appropriately. This, after all, happens sometimes. After a quick review of the backups, everything had completed successfully in recent history. Odd. After looking again at the transaction logs I then realized they were all from today! I then noticed that a new transaction log was being created every couple of seconds! Not cool! Exchange 2010 can be transaction log heavy, but not to the tune of a log per second; that’s just crazy. I checked to make sure the server wasn’t an open relay, and it wasn’t. I checked to make sure the firewall was locked down to only allow inbound email from their SPAM filter provider; it was. I checked the queue…. empty. Literally no email coming or going from the Exchange server (small shop of like 20 users, so this isn’t all that unusual at a given time). I then watched the Exchange database and noticed that it wasn’t growing at all. So, something was asking the Exchange server to do something, which was generating a ton of logs… but wasn’t sending / receiving email and there wasn’t an influx of mail being dumped into a users mailbox from a PST import or something to that affect.
Here’s where it gets fun. I logged into the server after hours and rebooted the server. I was, at this point, assuming that it was a hung Exchange process or something so I didn’t bother to check if it was still exhibiting the behavior when I first logged in. After the reboot however, the Exchange server started acting normally again. One or two transaction logs per minute or so. Chalk it up to something being freaked out. Out of curiosity, I logged back in this morning and it was doing it again… and after a brief discussion with one of my coworkers who had checked it earlier this morning as well, we determined that it must have started again around 8:30. HAS to be a user related event.
I did some digging and found a few tools that can help us out in this situation; one stood out from the rest. Our old friend exmon. Turns out the Exchange Team has kept this tool up-to-date and it works just fine with Exchange 2010. Simply install the msi, browse to the install directory and run the .reg file to create the appropriate registry settings, and we’re off to the races. Here is what you get:
A listing of all your current Exchange users and exactly what they’re doing from a resource perspective. Note that my top talker is currently chewing on 68% of the CPU and asking the server to process about 19Mb of data… compared to the 716k of its next closest competitor, and around 4k from everyone else. Now, I will say that it isn’t necessarily unusual for a user to show up on the top of this list for one reporting period as using a nice chunk of CPU. This will happen if the user has just opened Outlook or has just performed a Send / Receive and there was a decent amount of email that needed to be downloaded. However, this user was sitting around 60-90% of the CPU for over 20 minutes. THAT is not normal.
So what was the culprit? This is a recent Exchange migration (we moved them from a hosted solution to their own Exchange server), and the PSTs had been imported. For whatever reason this users PST import had gotten stuck. The only sign that something was wrong was that the user’s Outlook constantly said “Folder is waiting to update”. Oddly enough, you could reboot the machine and as soon as the user logged in again and launched Outlook, the loop would start all over again. Deleting the user’s Outlook profile and recreating it solved the issue.
Another tool you’ll want in your toolbox is the Troubleshoot-DatabaseSpace.ps1 script provided with Exchange 2010 SP1. This can be an extremely useful tool in an emergency situation (i.e. you’re in very real danger of running out of disk space in the near future), however for my situation it would have been overkill as we had a couple days worth of disk space available to us easily. The big takeaway from the linked article is #4. If it finds an offending mailbox it will lock it out for 6 hours; meaning the user will no longer have access to their email. Again, since we had some time / disk space, and the customer hadn’t noticed any thing from a performance standpoint… this would have caused more disruption then we were currently experiencing.