Enabling Dual Stack on LinkedIn CDNs
January 19, 2018
A few months ago, LinkedIn surpassed the 50% IPv6 traffic milestone. In this post, we will look into the methodology we adopted to measure performance as we enabled IPv6 on our content delivery networks (CDNs), and share some key results of our performance analysis. We hope this information will help readers who are undertaking similar networking changes.
The Edge SRE team runs LinkedIn’s edge infrastructure, including four external CDNs, an in-house CDN, three Domain Name System (DNS) platforms, and all LinkedIn points of presence (PoPs). We build and manage tools to automate all aspects of our Edge stack.
We began to transfer our network traffic from IPv4 to IPv6 for several reasons, including the fact that the internet is running out of IPv4 addresses, and that IPv6 can be faster than IPv4, especially on mobile networks (the source of a majority of our member traffic). In 2013, we enabled IPv6 dual stack on our production mail servers. In 2014, we enabled it across all our data centers and CDNs, except for our CDNs in China. However, due to limited IPv6 coverage on some of our CDNs, the performance of their dual stack networks was not on par with the IPv4-only networks. In 2016, LinkedIn onboarded two new CDN partners, but we decided to hold off on enabling IPv6 until we had analyzed the performance of their dual stack networks and addressed any issues that we found.
Enabling IPv6 on a third-party CDN is trivial. A CDN will typically convert your provisioned CNAME to support a dual stack configuration or will provide a new CNAME that is dual stack enabled. For our pre-ramp performance analysis, our CDNs provided dual stacked test CNAMEs to work with.
Our objective was to enable IPv6 without impacting our members. Site reliability is important to LinkedIn and part of our "Members First" company values. We wanted to ensure that there was no negative impact to member experience on the site as a result of us starting to serve content over dual stack networks.
We leveraged a mix of third-party real-user measurement (RUM) using Cedexis and synthetic monitoring (Catchpoint) during the pre-ramp phase:
We used Cedexis to measure member performance and availability of a test object on our CDNs. We grouped our results by country and then by ASNs that carry a majority of LinkedIn’s traffic.
Catchpoint was used to dig deeper into performance and availability issues that surfaced during the experiment.
We uncovered a number of possible member-impacting issues over the course of this testing.
We noticed DNS resolution issues over IPv6 with one of our CDN partners in a major region in India. We worked closely with the provider on this issue, and they eventually set up an IPv6-enabled DNS PoP in the region, making resolution times significantly better.
From this experience, we determined that it was important to monitor network timing metrics, such as DNS, connect, SSL, request, and response times, when evaluating how the shift to IPv6 affects members.
Optimizing CDN usage
One of our CDN partners had limited IPv6 coverage on their PoPs, and targeted measurements showed clear corresponding performance degradation over IPv6 for members in that area. LinkedIn uses RUM-based DNS to steer traffic to CDN providers and work around such performance issues. With RUM DNS, member browsers and mobile clients report how fast a CDN is for a given network, and members on the same network are then steered to the most optimal CDN. As a result, DNS RUM steering can be used to limit member impact if there are issues.
We found that members in a certain geography were being misrouted over IPv6 to distant CDN edges, rather than the edges closest to them. Some providers’ dual stack network maps are different than their IPv4-only networks, which can result in routing discrepancies and affect performance. We worked with the CDN partner, who in turn worked with their upstream provider, to fix the routing in these instances. This illustrated the importance of identifying incorrect routes when troubleshooting IPv6 issues.
These headers gives us the ability to slice the RUM data by client IP version.
Findings of post-ramp analysis
- In North America and Europe, the performance of IPv6 dual stacked networks was on par with IPv4 networks, and in some cases, was better.
- In India, there is still room for improvement because there is limited PoP coverage in this region from providers; however, with modern browsers, we rely on Happy Eyeballs to fallback to IPv4 when needed.
- We need increased IPv6 support from carriers in China. According to APNIC, IPv6 usage in China is still less than 2% in 2017.
We’ve analyzed the navigation timing API data and we’re beginning to unravel the resource timing API for new discoveries and optimizations. We’ve completed our analysis for desktop members, but we want to surface the same data for mobile members. As we continue to sift through data, we’ll have a better understanding of which areas we can concentrate on for improvement. As IPv6 adoption continues to grow, we expect performance and availability of IPv6 networks to surpass IPv4.