IPv6 Inside LinkedIn Part II
Back to the Future
August 9, 2016
Coauthor: Tim Crofts
In Part I of this series, we explained why we decided to move our internal network over to IPv6. The LinkedIn site has been available to the public over IPv6 since 2014, and our employees have been able to browse the internet over IPv6 for even longer. While many parts of our network have been IPv6 for a while, until recently our internal data center network was still on IPv4. In this post, we’ll explain from a network operator point of view why we needed to create a global network with IPv4 and IPv6, as well as the challenges we faced when we began to enable dual stack in our data centers with the goal to one day remove IPv4.
Network design: Back to the future
With the exception of link-local addresses and the now-deprecated site-local addresses, all IPv6 addresses are globally routable. In IPv4 RFC1918 space, a certain level of comfort was felt knowing that these private networks should not be able to “leak” to the global internet. With IPv6, however, this isn’t guaranteed, as all global IPv6 space is routable globally. Data center designs now need to implement more robust security policies, as traffic can originate both internally and externally. It’s no longer feasible to have a simple policy that only permits or denies RFC1918 addresses.
When all addresses are globally routable, packets may take different paths to reach the same destination. The forward path, if not carefully designed, may go outbound via one firewall and return via another firewall. Firewalls are stateful, meaning they remember the state of an internal machine connecting to an external machine in order to automatically allow the return traffic. It is very difficult, and potentially insecure, to share states across firewalls. Moreover, some portions of a path may be over the public internet, a less secure network, rather than within the internal network.
Using IPv6, internal traffic cannot be easily identified by the destination IP contained within RFC1918 space. Internal and external traffic need to be separated, even if only virtually. In order to achieve this separation, we must return to network designs that were done before NAT was invented. In other words, we are “back to the future.”
NAT allows a machine deep inside a data center to have access directly to the internet; essentially, a one-to-one NAT can be established. In our data centers, we decided to have an IPv6 range that we do not advertise to the internet, which means that all machines in that range access the internet via a set of proxies or gateways that are on our DMZ. To keep the architecture simple, we do not want internal machines to be able to access the internet directly over IPv4 if they cannot do it over IPv6. As we are doing dual stack, we have been forbidding any NAT setup on IPv4. All internal machines now need to go via multi-homed machines, be it on IPv4 or IPv6.
With Border Gateway Protocol (BGP), you can advertise the IPv4 and IPv6 routes known by one router to its neighbor router over either IPv4 or IPv6. Because our goal at LinkedIn is to eventually remove IPv4 altogether, we decided to not cross network stacks, and our IPv6 route advertisements are done exclusively over IPv6-based BGP sessions.
Adding a new network stack requires that the security is at least equivalent to the previous IPv4 security. One approach could be to use a filter to deny all IPv6 traffic and work from there. However, once a device becomes IPv6 aware, you may break services if the device cannot reach the destination over IPv6 and the application either does not gracefully fallback to IPv4 or does not do so in a timely manner. Therefore, we took a different approach.
Another option was to simply convert our existing access control lists (ACLs) from IPv4 to IPv6. Are there significant differences using this method? For instance, converting ICMP filters to ICMPv6 filters to allow ping packets to reach machines is straightforward, but as explained in Part I of this series, ICMPv6 Packet Too Big (PTB) is unique to IPv6 and very important for communications over encapsulating tunnels, or when using jumbo packets. An IP packet normally has a maximum size of 1,500 bytes on Ethernet. However, tunnels will encapsulate a packet within a packet (like with NAT64), and jumbo packets (bigger than 1,500 bytes) can also be used to allow faster transmission over high speed links. To avoid suffering from hard-to-diagnose connectivity issues, it is very important to ensure that PTB messages are generated, received, and processed. Because PTB does not exist on IPv4 (and other similar situations), we could not simply convert our existing ACLs to IPv6 versions or they would have lacked rules to authorize PTB messages.
Finally, we have Access Control Lists (ACLs) to protect the environment, machines, and so forth. When you enable two devices on IPv6, the communication between them may break if you do not create an IPv6 ACL rule set that’s equivalent to the IPv4 rule set before you dual stack the devices.
Additionally, we use Virtual IPs (VIPs) to connect a client to many servers via a single IP address. The configuration of these VIPs is slightly different on IPv4 and IPv6. For instance, some load balancers that handle the VIPs are configured to rewrite the Ethernet part of the packet to send it to the final destination server. This server needs to be able to handle the packet in its native protocol, either IPv4 or IPv6; therefore, if a VIP is dual stack, it means that all the machines represented by the VIP need to be dual stack as well. On the load balancer configs, the configurations of VIPs on IPv4 and IPv6 are two separate configuration statements, but we did not want to name the VIPs differently in DNS. Likewise, to avoid technical debt, we do not want to end up with a hostname such as vipname-v6 when we will ultimately have an IPv6-only environment. Thus, on the DNS side, the VIP names will have an AAAA record added when they are ready to serve IPv6 traffic.
From all these points, we realized that creating an IPv6 addressing scheme for the devices would help us do things like define IP rules based on machines being able to access other machines, when not all machine hostnames have a DNS AAAA record (or for that matter a reverse DNS record).
What IPv6 addresses should we give devices?
As explained in Part I, we did not want to add a DNS AAAA record on hostnames because with such a record, connections to the servers would be preferred over IPv6. We would have had to certify all the software we use for IPv6 before being able to dual stack servers.
We also did not want to embed the IPv4 address in the IPv6 address (for example: 2620:abcd:efef::192.168.1.1) because:
The above example, on the interface, will be represented as 2620:abcd:efef::c0a8:0101;
We do not solve the problem of exhaustion of IPv4 space;
We create technical debt when removing IPv4.
However, we still need to be able to know the IPv6 address of a machine given its IPv4 address when no AAAA records are set for the hostname in order to easily convert our IPv4 ACLs into IPv6 ACLs. (ACLs control which machines are authorized to access a specific machine.)
Our solution to this problem is to pair each IPv4 network with an IPv6 network and use the last two octets of the IPv4 address in hexadecimal format as the final quibble in the IPv6 address (IPv6 addresses are represented by quibbles, 4 bytes/16 bits, separated by columns). We choose the last 2 octets because some of our smallest IPv4 networks that could be paired with an IPv6 network had a /23 mask (majority were /24 or /25 to fit a cabinet). We used the subnet pairing option that we have in our IP Address Management system (IPAM) for this. Using this method, the IPv6 subnet can be obtained based on the current IPv4 subnet allocated for a particular VLAN.
To simplify ACLs, routing aggregation, and network boundaries between internal and external, we decided to use one single large IPv6 network across all our data centers, present and future. This method also simplifies identification of internal IPv6 traffic because we can identify it just by looking at the block from which the address comes.
To make things easy, we also decided that the interfaces of all our routers connected to servers will have fe80::1 as the local link IPv6 address. In all LinkedIn data centers, our servers always use the eth0 interface to reach the default gateway, so the default gateway is always fe80::1 via eth0, or fe80::1%eth0. We do not need to rely on Router Advertisement messages to establish the default gateway. We have all IPv6 addresses as static (because dynamic IPv6 addresses need to be maintained in the DNS) so that clients can find the server to connect to. With many servers, there are a lot of updates happening for servers that are basically static, running 24/7. With FE80::1 as gateway, special scripts or tools to parse the routing table to learn the default gateway are not needed for any network segment. Any dynamic scheme, for IPv6 addresses or default gateway, obliges the system to keep the dynamic state up by continuously broadcasting information about the network. With a static state, there is no need to ensure timely information is broadcasted to servers and devices so they can keep their network state. The scheme just described applies to our server IPs. Our network devices use a more traditional scheme, where point-to-point links have unique networks carved out of larger blocks, and loopback addresses come out of dedicated blocks of IPv6 space.
On our servers, we prefer to use static IP addresses for both IPv4 and IPv6. We use the IPAM tool mentioned above to record all our networks and hostnames. When we provision, we know in which cabinet each host is located and which port it’s connected to. IPAM feeds into our DNS, so we do not use Dynamic DNS to map IPs to hostnames. Furthermore, our application stack relies on discovery services that map services to hostnames. Because of these factors, using static IP addresses makes more sense, since it ensures that we control the IP assigned, which will not change, and our application stack can rely on persistence of the names to IPs.
Unlike our IPv4 strategy, we will not use NAT66 to support proxy and other DMZ functions that need to access the internet over IPv6. Rather, all hosts in the DMZ will use dual-homed connectivity to provide both internal IPv6 connectivity within the DC and external IPv6 connectivity to the internet through a firewall.
In order to support LinkedIn’s ultimate goal of an IPv6-only data center, we need to ensure that other services, such as Terminal Access Controller Access Control System (TACACS), Network Time Protocol (NTP), System Logging (Syslog), Simple Network Management Protocol (SNMP), and sFlow all support IPv6 sources, and that there is feature parity with IPv4. In general, the application layer needs to support IPv6, but the tools to manage all these devices also need to understand IPv6. Finally, all devices need to be able to be provisioned over an IPv6-only network. Zero Touch Provisioning (ZTP) still needs some work on IPv6 and suffers from a lot of legacy. This higher level, discussing how to make software or applications run on IPv6, will be described in Part III.
The following people contributed to this blog post through their participation in our AAAA team:
Zaid Ali, Sriram Akella, Andrey Bibik, Donaldo Carvalho, Brian Davies, Bo Feng, David Fontaine, Prakash Gopinadham, David Hoa, Sanaldas KB, Henry Ku, Prasanth Kumar, Vikas Kumar, Tommy Lee, Leigh Maddock, Navneet Nagori, Marijana Novakovic, Ved Prakash Pathak, Stephanie Schuller, Chintan Shah, Harish Shetty, Andrew Stracner, Veerabahu Subramanian, Shawn Zandi, Andreas Zaugg, David Paul Zimmerman, Paul Zugnoni.