IPv6 Inside LinkedIn Part III: The Elephant in the Room
November 10, 2016
Coauthor: Tim Crofts
The LinkedIn site has been available to the public over IPv6 since 2014, and our employees have been able to browse the internet over IPv6 for even longer. In Part I of this series, we explained why we decided to move our internal network over to IPv6. In Part II, we looked at the challenges we faced when we began to enable dual stack in our data centers (with the goal of one day removing IPv4 altogether). In this final post, we’ll look at how to install and manage servers (and other devices) on IPv6-only networks, as well as considering IPv6 from a software engineering point of view. We have not yet built an IPv6-only data center, but this post describes our progress towards that goal.
IPv6 in large deployments
One person, or a small team, can install, configure, and manage a limited number of devices (servers, switches, routers, etc.) manually. In any large network, with hundreds or even thousands of devices, it gets more complicated, and you will want to start automating some of these functions. This is one dimension of scaling your infrastructure. And as teams grow, more and more people are doing the same work on the same devices, so it’s necessary to have tools that ensure knowledge is transmitted, changes are reviewed, etc. This is another dimension of scaling.
There are several considerations that must be addressed when operating at scale, including:
The following paragraphs highlight some of the gaps we have identified in our ability to perform the above functions in an environment with IPv6. In order to make this discussion more concrete, some topics are focused on the software we have experience with, with references to open bug reports when known. We would like to thank all of our partners and the community for helping to address these gaps.
When building a large network of servers, several different teams are usually involved in the build process. Data center technicians rack and cable devices, while network and systems engineers configure them and make them operational on the network.
You want as few manual processes and touch points as possible; in the ideal scenario, you build one router and one server manually, and all of the other network equipment and servers will be auto-installed and configured from these two devices. This is called Zero Touch Provisioning (ZTP). To be fully automatic, a device (when powered on) will need to know it is not configured, look for its configuration on the network, and then install itself.
On the network side, there are protocols, such as Open Network Install Environment (ONIE), for bare metal switches. On the server side, there are protocols like Preboot eXecution Environment (PXE). These two technologies function in a similar manner. If a device is not configured, it will obtain an IP address (usually via DHCP) and then, via a DHCP option, use TFTP to download a small image or script that will be executed to retrieve a better and larger image with configuration files via a more complex protocol, such as HTTP. Installation and configuration of the device can then take place. This bootstrap function is part of the hardware, located in firmware on motherboards and network cards. Since this firmware has limited space, the TFTP client is prefered, because it is simple and small. However, TFTP may not work well with today’s large Operating System images, in part because it uses UDP which, unlike TCP, is not designed for reliability. For this reason, a small image is loaded first, which will allow you to subsequently use something like HTTP (a better but more complex protocol) to download the larger image.
The IPv4 process described above needs to also work as well or better on IPv6 for a dual stack environment to succeed. However, adding a second network stack to the firmware may cause the firmware to exceed the storage capacity. To work around this, the traditional BIOS is being supplanted by Unified Extensible Firmware Interface (UEFI), a better framework for today’s world. However, UEFI is relatively new, and the support for IPv6 has only been defined in the most recent versions. Flashing newer firmware onto older or current devices is complex and creates downtime; in fact, it can often brick the device. For this reason, it is ideal for the the hardware to be shipped with newer firmware that supports UEFI and IPv6. As a result, the ability to boot over IPv6 may only be available in upcoming servers that arrive out-of-the-box with the latest firmware.
On an IPv6-only network, UEFI Network Boot may only able to get an IPv6 address using autoconfig, and not DHCPv6. This may require changes in your provisioning architecture if you don’t control what IP the server is using (on IPv4, the IP address is given to the device using DHCP based on some rules set within the centralized DHCP config). Some TFTP clients have been found to be limited when using IPv6 and are only capable of accessing files within their own, local /64 network, which means images cannot be retrieved from a different network, and devices cannot be bootstrapped from another data center across the globe.
For all of these reasons, provisioning servers (or network devices, for that matter) over an IPv6-only network is not straightforward. Options for the selection of the right devices and network cards are limited. Devices can be still provisioned over IPv4 and then operate in an IPv6-only environment, but this involves running multiple IP transport stacks. Therefore, when migrating from a dual-stack data center to an IPv6-only data center, it is very important to ensure the devices can be provisioned over IPv6 only; otherwise, an IPv4 network will have to remain just to be able to re-image devices.
We are currently testing server provisioning over IPv6 only, and working with our vendors to get it working. Anyone interested in one day running an IPv6-only environment should check with their vendors too, as hardware lifecycle is at least three years, and it could become desirable to migrate to an IPv6 only environment during this timespan.
UEFI Network Boot over IPv6
Using new network firmware on Supermicro, we are able to UEFI Network Boot over IPv6 using grub2. This is how we do it.
We first set up a DHCPv6 server with the following option:
option dhcp6.bootfile-url "tftp://[2001:DB8::245]/bootx64.efi";
We set up TFTP and raddvd servers on [2001:DB8::245]. When booting over IPv6, UEFI Network Boot receives the DHCPv6 option and downloads the bootx64.efi over TFTP and executes it. We tried to load Linux and initrd over the network over IPv6 using either HTTP or TFTP with grub2.02-beta2; however, the downloads failed to be executed. Grub requires more development on IPv6 to have feature parity with its behavior on IPv4.
To solve this problem, everything needed was packed into a memdisk, included inside bootx64.efi; Grub2 just needed to chainload this image (Linux + initrd).
To create the memdisk:
The bootx86.efi image is relatively big, about 42M in our case, mainly because of the size of our initrd.img. Needless to say, TFTP is not optimal for downloading such a big file, as it relies on UDP instead of TCP (TCP has more built-in error correction features). Although it should be possible as per the UEFI specification, we have yet to be successful in using HTTP for UEFI Network Boot over IPv6. Until this works correctly, TFTP is a workable solution.
In this configuration, UEFI will download the image, chainload to the vmlinuz in the memdisk (no need to download it separately from the network using Grub), and this kernel will contact the kickstart server and start the configuration of the server.
Anycast for VIPs
While many load balancers support IPv6, we took a different approach from the available options in order to improve scalability. We configure VIPs on servers by adding IPs on the local interface, and then advertise these IPs via BGP sessions using bird and bird6 to the connected switch. Because this is an internal network, we can choose the level of aggregation for the advertised network: /24 for IPv4 and /128 for IPv6. However, you need to be very careful because some network gear may not be able to hold more than 128 routes for /128 prefixes, so it may be a better strategy to reserve a complete /64 even for one IPv6 address used as anycast. This allows us greater flexibility by eliminating the requirement that the servers are in the same network segment as the load balancer.
We are using CFEngine to configure the anycast VIPs and we had to modify some RedHat network scripts to make the configuration similar on IPv4 and IPv6. This is described in the section on CFEngine.
IPMI is a very important part of detecting hardware problems before they have noticeable effects, or of being able to query a machine status even when the operating system is down.
Intelligent Platform Management Interface (IPMI) is a standard to access a server without having an operating system installed. It allows an administrator to remotely control and view the screen, for instance, while the server is booting up, or to monitor the health of various hardware components like the hard drive. This access can be “out of band” or “side band.” “Out of band” means that an interface is reserved for this management and is in its own network. In a “side band” design, the same physical interface is used for normal network traffic and management in separate networks.
Because management may happen before any operating system is installed, IPMI must be able to be automatically configured on an IPv6-only network. Later on, once the operating system is installed, ipmitools on the server will allow the user to query and set hardware components.
Ipmitool on Linux has supported IPv6 since version 1.8.14, which was released in May 2015. However, it still needs to find its way into some Linux distros. For instance, the right version is not yet present in RedHat 7, so we need to create our own build.
In addition, you need a network card with the appropriate firmware to support IPMI IPv6 extensions. Such a network card would also likely be able to do UEFI and PXE over IPv6. We have been trialling a Redfish-compatible firmware as specified by the Distributed Management Task Force with some of our vendors.
For long-term support, administrators may have to update the Linux version currently running to a supported version with complete IPv6 support; this may have downstream impacts, not to mention the difficulty of flashing firmware.
Another part of managing servers is being able to configure them. There are many tools available: Chef, Puppet, Foreman, etc. We are using CFEngine. While CFEngine works over IPv6, the software has built-in IPv4 support functions for configurations that are not yet implemented for IPv6. We created a CFEngine module that would gather IPv6 information from the host and allow us to feed this into any CFEngine Sequence we wanted to. This module is saving us a tremendous amount of time for configuring servers with IPv6, setting up the network interface, creating the right iptables, updating software configuration files with IPv6 support, etc.
The automation method we use relies heavily on Linux virtual interfaces. This required us to patch network-scripts on Linux to make them manageable by CFE. The difficulty is with additional IP addresses on an interface. With IPv4, this is handled via virtual interfaces config natively, while for IPv6, this is just a variable containing all additional IPs. Our patch allowed us to use virtual interfaces with IPv6. We contributed our modifications to RedHat for consideration.
IPv6 support for CFE is tracked here, on a support ticket that describes minor bugs. CFE developers have reported that the software works in IPv6-only environments. For us, it is working well on a dual-stack environment, and we see IPv6 traffic, but we have not yet verified it indeed works on an IPv6-only environment. We have always worked very closely with the CFE developers to provide feedback on our experience, and they are very responsive—some of this feedback finds its way quickly into future releases, so if there are any issue we find, we are confident they will be promptly addressed.
Other configuration systems have various levels of support for IPv6. What is important is to test them in real-life scenarios and provide feedback so that each system is always improving.
Monitoring (and metrics)
Over the years, we have built our own monitoring and alerting services, called inGraphs and AutoAlerts, described here.
This software provides a framework for various collectors to get metrics from a server, a network device, an application, or a service, and to send this information to be graphed. This process may also generate alerts if the data is out of conformance from what is expected (hard drive full, CPU usage, too much traffic on an interface, too many requests to a service, etc.). We monitor many metrics per device to allow us to understand trends, and daily and weekly cycles, as well as to inform us when something is getting bad before it is really bad. These metrics are also useful for post-mortem analysis.
inGraphs and AutoAlerts are just a framework for us to deploy scripts and probes that will help us measure the differences between our operations and services, be it on IPv4 or IPv6. Any organization will do the same and build into its own monitoring system new graphs and alerts to inform when a service is performing differently depending on the protocol stack.
As we transition from IPv4-only to a dual-stack environment, and ultimately to IPv6 only, we need to have metrics that allow us to understand if a service is operating correctly over IPv4 and IPv6. Do we observe a difference in terms of latency, queries per second, etc.? Although we may not make a change to the product itself, this means having tools and instrumentation that collect the data needed to check performance on IPv4 and IPv6. For instance, we found that some network equipment might offer traffic breakdowns for IP, ICMP, TCP, or UDP, but they do not tell us the split between IPv4 and IPv6 traffic. This is a much-needed metric to measure progress and success and will be essential when the time comes to remove IPv4.
A large company like LinkedIn cannot move to IPv6 without measuring many metrics to understand if issues we discover are due to a difference between our implementation of IPv4 and IPv6. Sometimes changes are not directly obvious, and sometimes they are very small in terms of percentage, but are still important—for instance 0.1% of 400 million members is still 400,000 members.
Eventually you will need to decommission machines. This requires removing them from operations, wiping out the data, removing them from the network and, finally, unracking them. And in an IPv6-only environment, the software needs to be able to do these operations on an IPv6-only network.
Planning early how the decommissioning will happen, while still architecturing the provisioning of machines, will help you in the future. This is something we have put in place on IPv4 and are looking into for how to execute on an IPv6-only environment.
The elephant in the room (software support)
There are many pieces of software that need to be IPv6-capable. A large number of deployed devices and software in many organizations can be related to the deployment of a Hadoop grid. We will take Hadoop as an example (as this usually involves a large deployment of devices in many organizations) but then look at other software and strategies as well.
Many large companies have to store and analyze big data. One of the common systems to use for this today is Hadoop. This usually requires lots of servers for data storage and computation. Today, Hadoop does not support IPv6. The community has created a development fork to to get IPv6 support, and once this fork is tested, it will be merged back into the main branch. This is how the Hadoop community tends to work when new features are added. It gives a safe developing environment for the feature and the main branch.
Engineers at Facebook have been contributing to the effort to add IPv6 support to Hadoop, and we are looking at how we can participate in this effort. The project is progressing well so far. This is awesome work from all involved with Hadoop, but needless to say, like in all open source projects, we believe the community always welcomes more help.
Once Hadoop supports IPV6, we could easily have large deployments of IPv6-only machines in many organizations, saving millions of IPv4 addresses for each organization.
We are looking forward to be able to test Hadoop with IPv6 support and report and help fix bugs as we are finding them.
Many tools and software need to be modified in order to handle IPv6 and its data structure. For regular socket connections, there are a few strategies to take into consideration. By default, the operating system will return the preferred IP address for a hostname (IPv4 or IPv6). This preference is usually IPv6 when the device has a global IPv6 address on one of its interfaces. However this preference can be changed to always prefer IPv4. The other strategy is to get all the addresses of a hostname, and deal with them in the software in a way that would be different from the operating systems. For instance, Java has a couple of flags to set so that you can enable connections over IPv6. For more examples, you can find a good tutorial on Python and sockets here.
This is just one example of how there is still some software that is not yet ready for IPv6. Disabling IPv6 in the operating system is still a frequent solution in this software for any problems with IPv6 support! This is a pity, because it does not help solve the real problems.
In some situations, instead of fixing the needed software to work with IPv6, a task that may depend on an external third party, you can provide a wrapper or proxy around it. For instance, Apache in proxy mode or Nginx have both been extensively used as IPv6 frontends to IPv4-only backends. This is how many sites have been offering connectivity over IPv6. This can also be used for internal sites, making people more familiar with IPv6 and increasing IPv6 traffic internally, therefore increasing visibility so that people know that coding for IPv4 only is no longer an option.
Preparing applications for IPv6
The American Registry for Internet Numbers (ARIN) has published this documentation on how to ensure developers are making network-agnostic code.
There is still some road left ahead, but we are on the cusp of full IPv6 support for all devices and software. There is no longer any domain that is not affected by the need to get IPv6. It is no longer the work or the need of a few—many are recognizing it.
Migrating all our software to connect over IPv6 is the task we are now tackling over the next few months. We will also continue to work on the way we provision devices. Once this is done, it will be time to start to dispose of IPv4.
In this part we would like to acknowledge some of the external people that helped us in this endeavour. There are many more, and apologies for the ones we have forgotten or who could not be named: Fred Baker, John Brzozowski, Vint Cerf, Lorenzo Colitti, Jason Fesler, Lee Howard, Pradeep Kathail, Martin Levy, Christopher Morikang, Paul Saab, Mark Townsley, Eric Vyncke, Dan Wing, and Jan Zorz.