Could DNS be YOUR Problem?

The Domain Name System (DNS) is one of the most fundamental mechanisms of distributed systems. It is so basic and so widespread that it has become nearly invisible. As a service so obscured from view, it may seem inconsequential. Nothing can be further from the truth. It is, in fact, one of the main underpinnings of everything we do so it can wreak havoc when it malfunctions.

We just take it for granted that our DNS is functioning flawlessly behind the scenes, but is it? Many performance and availability issues are misdiagnosed because of our blind faith in DNS. It is without doubt one of technology’s great innovations, but nothing in IT should be trusted so blindly. It turns out many DNS installations are misconfigured.

Let us first examine what it truly happening when an application transports data from point A to point B. Please bear with the Networking 101 basics, as many may need a refresher.

Ultimately, every such transaction eventually sends an IP packet across the network. In that packet, you have two IP addresses, one for the source and one for the destination. This is important, because the wrong destination address means the packet cannot reach its intended recipient. These addresses are almost always obtained from DNS.

If moe.stooge.com wants to slap curly.stooge.com, he needs to identify curly as the destination. If he gets an incorrect address from DNS, he may end up slapping the hapless larry.stooge.com instead! Addresses provided by DNS must be correct. I think everyone understands that point.

Addresses are correct only if the DNS database is properly configured. Maintaining that database is a critical task, so strong management systems must be in place. Often they are not. In fact, I’ve seen people managing DNS by editing their database files with a text editor! If this is your situation, I implore you to implement better methods and tools!

DNS is also a distributed system, where many DNS servers are likely in a large organization. In these distributed environments, the hierarchical interaction and inevitable data replication makes DNS an inherently complex system. This complexity can result is data stagnation, configuration drift, fractured hierarchies, and synchronization problems. Some are easily diagnosed, such as a fractured hierarchy, where the system may stop working altogether. Most will fall into those mysterious intermittent or slow categories that make diagnosis much more difficult.

The two most common problems are address mapping discrepancies and server timeouts.

Address mapping must concur across the various servers in use. Much of the DNS data is replicated across servers that are distributed by geography, business unit, or some other demarcation that makes sense. If you want an address for curly.stooge.com, all servers must resolve this to be the exact same IP address. It is possible (indeed likely, if management methods, policies and tools are weak) that some of this data can be corrupted. Such drift must be quickly identified and resolved. Ideally, DNS management systems can prevent this occurrence in the first place.

Server timeouts are an often overlooked issue, but one that can severely impact application performance. Most IP protocol configurations allow multiple servers to be defined for lookup. The principle is simple. If the main server fails to respond, the other(s) can be consulted as a backup.

Server timeouts can kill performance because the timeout interval is usually long relative to the overall application transaction. Default timeouts vary, but are always measured in seconds. While a 2-second default may seem quick, most DNS requests are answered in milliseconds. An application with a normal response time of 4 seconds now becomes 5 seconds with only one timeout. Timeouts often cascade, adding many seconds to the overall response time. We have grown to become a very impatient society, so every second counts when it comes to end-user satisfaction.

I worked with many clients whose performance problems were eventually diagnosed as DNS, but one stands out as legendary. I won’t mention the company, but it is a big one and their application performance was awful. They had examined countless performance reports to no avail. They were employing poor practices here as well, because each performance report was too myopic.

I suggested examining their syslogs. An extraordinary amount of wonderful information is logged to syslog that can be mined for patterns. DNS is one service that can report to syslog. Luckily they were already feeding syslog events that included DNS messages into their management console, so we looked there. For some reason they had overlooked this, but one glance immediately told the whole story.

The syslog was riddled with DNS events. In fact, and this is NOT an exaggeration, 99.97% of their syslog events were DNS failures! There were thousands upon thousands each day! I asked about how they managed their DNS. It was an all-too-typical manual approach undertaken by multiple IT groups on multiple server types (Windows, Unix, Linux, and some you thought were extinct) and synchronization was deplorable. Timeouts and incorrect resolutions were occurring all the time!

I recommended a relatively inexpensive DNS management product (from Infoblox, if I recall correctly) to replace their haphazard manual methods. I also advised the most senior people that they needed to consolidate ownership and accountability for DNS immediately (significant political barriers were a major root cause – go figure!). Failure to take these actions would continue to erode senior management’s faith in IT and punitive outsourcing was a distinct possibility.

I am happy to report that they followed my advice and they were quickly able to reconcile the DNS issues. Application services improved dramatically and confidence in IT took a big step in the right direction. This problem existed for years and evolved (devolved?) into a monster that plagued them for months. All it took was a different perspective on something simple but shrouded and therefore never even considered as suspect. Once identified, the resolution was rapid.

When trying to understand the complex nature of today’s IT systems, you must question the integrity of every aspect of that system. Assume nothing is truly working. If you ignore any single element, the results can be devastating.

Leave a Reply