Data-Driven Domain Name System (DNS)
Network designers are challenged with endpoint selection. How do you get eyeballs to the correct endpoint in multi-data centre environments? A recent podcast from Ivan Pepelnjak pointed me to a company called “NSONE” based in New York. Their leadership team consist of CEO Kris Beevers who joined Ivan to discuss their Domain Name System (DNS) based products. NSONE product set offers high-performance DNS, enabling customers to take complete control over where their end users are routed. They refer to themselves like “air traffic control” for your site. Offering probing mechanisms that extract real-time data from your infrastructure for automatic traffic management. NSONE offers data-driven DNS approach, optimizing traffic management to and from the data centre. So what is data-driven DNS? and how can NSONE add value?
Data-Driven DNS is about optimizing DNS, making it more intelligent by enabling efficient routing towards data centres. Data-driven DNS is how end users select services and on what criteria. Traffic engineering multi-data centres point users to different locations based on different metrics. NSONE have a platform for DNS. They offer a global anycast network consisting of 18 PoPs, 6 continents that sells in a SaaS ( Software as a service ) model. They also provide private DNS services, topologically different for each customer. Private DNS platform is dedicated entirely to your application and managed by NSONE experts.
Back to Basics – DNS Fundamentals
DNS is a naming system that is both hierarchical and distributed. Because of the hierarchical structure, you can assign the same “label” to multiple machines ( example www.abc.com maps to 10.10.10.10 and 10.10.10.20.).
DNS servers are machines that respond to DNS queries sent by clients. Servers can translate between the names and the IP addresses. There are differences between an authoritative DNS server and caching server. A Caching-Only server is a name server that does not have any zone files. It is not authoritative for any domain. Caching speeds up the name-resolution process. It can be viewed as a positive and negative element to DNS. Caching reduces the delay and number of DNS packets transmitted. On the negative side, it can produce stale records, resulting in applications connecting to invalid IP addresses and increasing the time applications failover to secondary services. The Time-to-Live (TTL) fields play an important role with DNS. It controls how long a record should be stored in the cache. Choosing the right TTL timer per application is an important task. A short TTL can send too many queries while a long TTL can’t capture any changes with the records. DNS proxies and DNS resolver respect the TTL setting for the record and usually honor TTL values, as it should be. However, applications do not necessarily honor the TTL which becomes problematic with failover events.
Site-Selection Considerations – Load Balance Data Centres?
DNS is used to perform site selection. Multi data centres use different IP endpoints in each data centre, DNS-based load balancing is used to send clients to one of the data centres. Design to start simply and use random DNS responses, and slowly migrate to geo-location-based DNS load balancing. There are many load balancing strategies and different designs match different requirements.
Try to combine the site selector ( the device that monitors the data centres) with routing, such as Route Health Injection ( RHI ) to overcome the limitation of cached DNS entries. DNS are used on outside performing load distribution among data centres and Interior Gateway Protocol (IGP) used for rerouting traffic internal to the data centre. Avoid false positives by tuning site selector accordingly.
DNS is not always the best ways to fail data centre. DNS failover can influence quickly 90 % of incoming DC traffic within the first few minutes. If you want 100% of traffic, you will probably need additional routing tricks and advertise the IP of the secondary data centre with conditional route advertisements or some other form of route injection.
The Application is Changing
The application has changed and DNS needs to be more intelligent. Users look up an “A” record for www.XYX.com and there are two answers. When you have more than one answer you have to think more about zone file presentation, what you offer, based on what criteria / metrics. Previously, DNS was a viable solution with BIND. You had a primary / secondary server redundancy model with very static configuration. People weren’t building applications with distributed data centre requirements. Application requirements started to change early 2000 with anycast DNS. DNS with anycast became more reliable and offered faster failover. Nowadays performance is more of an issue. How quickly can you spit out an answer?
10 years ago, to have the same application in two geographically dispersed data centres was a big deal. Now, you can spin up active – active applications in dispersed MS Azure and Amazon locations in seconds. Tapping new markets in different geographic locations takes seconds. The barriers to deploying applications in multi-data centres has changed and we can now, deploy multi environments with ease.
Geographic Routing and Smarter Routing Decisions
Geographic routing is where you try to figure out where a user is coming from based on Geo IP database. From this information, you can direct requests to the closest data centre. This doesn’t always work and you may experience performance problems.
NSONE wants to add intelligence on how you offer locations to end users and direct them to end targets. They take in all kinds of network telemetry about customers infrastructure and what is happening on the Internet right now. Then they can make smarter routing decisions. They analyse information about the end-user application to get an idea about what’s going on – where are you / how fast are your pipes, what speed you have? The more they know, the more granular routing decisions are made. Are your servers overloaded and at what point of saturation are your Internet or WAN pipes? They get this information by API driven approach, not by dropping agents on servers. NSONE also work with existing monitoring software, such as Catch Point, Cloud Helix and Amazon Cloud Watch. All these feed and integrate into their DNS product.
Geographical Location – Solution!
The first problem with the geographical location is network performance. Geographical location is not relevant to how close things actually are. The second is you are looking at resolving DNS server and not the actual client. You receive the IP address of the DNS resolver and not end clients IP address. Also, at times the user is using a DNS server that is not located where they are.
The first solution is an extension to DNS protocol – “EDNS client subnets”. This gets the DNS server to forward information about end users too, including end users IP address.
Google and OpenDNS will forward the first 3 octets of the IP address attempting to provide geographic routing based on the IP address of actual end-user and not DNS Resolver.
If you try to optimize response times or minimize packet loss you should measure the metrics you are trying to optimize and then make a routing decision based on that. Capture all information and then turn it into routing data.
Trying to send users to the “BEST” server varies from application to application. The word “best” really depends on the application. Some application performance depends heavily on response times. Others, for example streaming company, don’t care about RTT of returning the first MPEG file. It depends on the application and what routing you want.
DNS pinning in browsers is enabled due to security problems with DNS spoofing. Browsers that don’t honor the TTL gets stuck with the same IP for up to 15 mins. Applications should always honor the TTL for reasons mentioned at the start of the post.
No notion of session stickiness with DNS. DNS has no sessions but what you can do is have consistent routing hashing; same clients go to the same data centre. Route hashing optimizes cache locality. It’s like stickiness for DNS and is used for data cache locality. Put most users to same DC based on “source IP address” or other “EDNS client subnet” information.