Permissions. It’s Always the Permissions.

Lately, much of my work time has been spent upgrading the servers that keep the company network running. As the Extended Support End Date for Server 2003 has crept closer and closer, it has become essential to role out updated hardware running the latest version of Windows Server. There’s almost two years left before Server 2003 loses its support, but the earlier I can get things upgraded, the better. Now, any server upgrade can be a scary proposition, but my latest upgrade was particularly worrisome. It was time to upgrade the Primary Domain Controller (PDC), or perhaps more correctly, the Domain Controller (DC) that has the PDC emulator role assigned to it. For those not familiar with Windows networking, here’s a simple overview. DCs are the servers setup to store all network information on them. These servers work together to run a system called Active Directory (AD). Many items are stored in AD, but the most familiar to an average user would be user logins. Without AD, you wouldn’t be able to log in and access files and services on the network. So, not only was I upgrading one of the core machines behind “the network,” but I was upgrading the one that handles the bulk of the networking workload. That makes for a lot of potential for something to go wrong. Quick side note…I didn’t actually “upgrade” the old server. Instead, I migrated all of the server’s roles to a new server. I’m just using the word “upgrade” as most readers will understand that term. The best defense against something going wrong is, of course, research. I always make sure to thoroughly research and plan out any major software installation or upgrade. One of the best web searches you can perform is, “issue after installing X,” where X is the software you are installing. That will often result in a large library of issues other users have had, and hopefully, the solutions they found for said issues. Knowing what can go wrong (even if it’s unlikely) can only help when installing and configuring a piece of software. I had done my homework, I had fully configured my new DC, I had transferred all applicable roles to the new server, and I had run with this configuration for weeks. Nothing had gone wrong, and so the time had come to remove the old server from the network. I uninstalled AD from the old server, effectively demoting it to the role of a normal member server, then completely shutdown and removed the old server from the network.

Exchange Stops Working

Well, that was unexpected. With the removal of the old DC, Exchange went down. The Exchange service was left completely unable to start. What came next was a frantic, multi-hour search for a solution. Why did Exchange die? All had been working well on the new DC for weeks. Why would Exchange care about the removal of what was, at that point, nothing more than a Backup DC? It was time to dig into the Event logs on the Exchange server. First, I enabled a bit of advanced logging in Exchange:

Exchange System Manager → Find the appropriate server under Administrative Groups → Right-click and choose Properties → Diagnostics Logging tab → MSExchangeDSAccess (I was seeing many errors related to this source) → Set Topology to Medium logging level.

Here’s a sampling of the errors I was seeing:

Event Type:    Error
Event Source:    MSExchangeIS
Event Category:    General 
Event ID:    1121
Description:
Error 0x80004005 connecting to the Microsoft Active Directory.
Event Type:    Error
Event Source:    MSExchangeSA
Event Category:    RFR Interface 
Event ID:    9074
Description:
The Directory Service Referral interface failed to service a client request. RFRI is returning the error code:[0x3f0].
Event Type:    Error
Event Source:    MSExchangeDSAccess
Event Category:    Topology 
Event ID:    2103
Description:
Process MAD.EXE (PID=####). All Global Catalog Servers in use are not responding: 
DC1.fully.qualified.domain.name 
DC2.fully.qualified.domain.name

It immediately became clear that the Exchange server was having trouble accessing the remaining DCs. The Exchange server wasn’t left completely without DC access, though. If the server was unable to access AD, at all, then it wouldn’t be possible for the server to log in to the domain or access the network. All appeared to be working normally on that front. Not only on the Exchange server, but everywhere else on the network. Workstations had no issues with network access. No one had any issues logging into the domain. The issued appeared to lie with Exchange itself, and since the Exchange server was able to connect to the domain, the issue had to be with the Global Catalog (GC) service on the specific servers noted in the error logs, not with AD, in general. A GC server is a DC that has the GC service installed on it. GC is designed to catalog all of the objects on the network, and then return information about those objects to any machines that request it. One way of confirming that the issue was limited to GC was to check the Directory Access configuration in the Exchange System Manger:

Exchange System Manager → Find the appropriate server under Administrative Groups → Right-click and choose Properties → Directory Access tab

In my case, the “Automatically discover servers” option was checked, and all DCs were detected, but none were showing up as GCs. That was just further confirmation that the issue was specifically related to GC access. Before shutting down the old DC, I had been running three GCs. One was on the old PDC, one was on another old DC that had been in place for years, and one was on the new DC. Now, with the old PDC shut down, Exchange seemed to be having trouble connecting to the two remaining GCs. Exchange relies very heavily upon the GC service, so it’s essential that it be able to access at least one DC with the GC service installed. I first tried to troubleshoot the connectivity issue, exploring various DNS-related possibilities. Was the Exchange server unable to determine the correct IP address for the GCs? Was the internal firewall blocking access to the two GCs? I could find no faults in connectivity. The Exchange server could reach and receive a response from both GCs. It was even using one of the GCs as it’s authentication server. I ended up scouring countless articles and forum posts in search of an answer, coming up empty with each new attempt at a fix. Here’s a few of the leads I was following:

http://technet.microsoft.com/en-us/library/cc771844.aspx http://support.microsoft.com/kb/895858 http://forums.msexchange.org/m_140134600/printable.htm http://www.petri.co.il/forums/showthread.php?t=21738 http://support.microsoft.com/kb/316300 http://support.microsoft.com/kb/250570 http://support.microsoft.com/default.aspx?scid=kb;en-us;q316790 http://support.microsoft.com/kb/315457 http://www.petri.co.il/forums/showthread.php?t=7005 http://www.eventid.net/display-eventid-2103-source-MSExchangeDSAccess-eventno-1421-phase-1.htm http://support.microsoft.com/kb/281537 http://technet.microsoft.com/en-us/library/ff360140%28v=exchg.140%29.aspx

Eventually, I got to thinking about permissions. Since I was convinced that the GCs were fully reachable, the next area to examine was whether the Exchange server was allowed to communicate with the GCs. That would typically be the first area I would check, but I was thrown off in this case. One of the GCs that Exchange was failing to connect to had been in place for years. What could have gone wrong to make even that machine fail?

A Long Misconfigured Server Was the Culprit

It turned out that the older GC (the one that had been running for years) was lacking an important configuration, and it had probably been that way from day one. This would be a good time to note that I inherited this network, and every day working with it brings the potential for a new, baffling observation. As I’ve worked to upgrade the various network servers, I’ve been able to correct various oddities and mistakes, and this issue was just another oddity that needed fixing. As I previously stated, Exchange relies heavily upon GC. In order to work with the GC service on a DC, the Exchange account needs to be given the SeSecurityPrivilege right. Per Microsoft, this right is required to support various Exchange security functions, including the ability to report which Windows accounts are being used to gain access to mailboxes. In a typical network, this permission would be set via the Group Policy object that is applied to the network DCs. In the case of my network, the network Group Policy object had been disabled, and the various DC permissions and configurations had been applied via each DC’s local Group Policy (another one of those oddities). So, the solution was to give the “Exchange Enterprise Servers” group the “Manage auditing and security log” permission on each GC. This is an essential part of the Exchange setup process. In the case of my network, the permission was applied to the two DCs that existed at the time when Exchange was setup. Only one of those servers was a GC. Later, when another GC was added, the correct permission setting was never applied to that server. Exchange had been unable to communicate with that GC server for years. If the first GC had ever gone down, Exchange would have failed, just as it did when I purposefully took down the first GC. It had only been through sheer luck that the issue had never arisen in the past. Once I removed the fist GC from the network, neither the second GC nor my newly setup GC had the correct permission setting applied to them. The only DC that had the correct permission was the one without the GC role installed on it, leaving the Exchange server without an accessible GC server. Exchange came roaring back to life as soon as the remaining GCs were give the “Manage auditing and security log” permission. This article goes into detail regarding how to configure the permission: http://support.microsoft.com/kb/896703. As you can see in the article, the “Manage auditing and security log” permission is typically applied as part of Exchange’s domain preparation work. In my case, I manually added the permission. In the end, research and planning just isn’t enough to keep a network running. The ability to quickly determine what is going wrong, and then know where to look for a solution, is an essential skill. These random flukes and issues are all part of the job, and should be planned for accordingly. In other words, know that something will go wrong, and schedule plenty of time for any post-upgrade fixes.