Rethinking Path Validation: Pt. 1, New Requirements

BGP, or the Border Gateway Protocol, is a widely-used protocol that allows very large networks, such as the Internet, to be able to scale. It is built on the concept of "transitive trust," which will be described below. While BGP was originally developed for Internet routing, now it is used in some large institutional networks as well. This two-part post will discuss one proposed model for improving BGP security, based on a talk given by LinkedIn Network Architect Russ White at NANOG 66. Today’s post will explain how BGP works in general, and then proposes how BGP can be reused to provide better security, as long as certain requirements are met. The second part, to come later, will describe an architecture that meets these requirements. 

As it’s difficult to secure what you cannot describe, it’s best to begin looking at the problem of BGP security with an accurate description of BGP. While most engineers would describe BGP as a routing protocol, this seems to fall short of the mark, as BGP uses policy as its primary metric, only falling back to the “shortest path” when multiple available paths have the same policy weight. So perhaps a more complete definition would be: BGP is a distributed system that describes peering relationships, policy, and reachability grounded in transitive trust.

As an example, consider the network below:

ASN Graph

Assume you’re looking at the BGP table at Routers H and K in AS65004. What would you be able to tell from the entry for 2001:db8:0:1::/64?

First, that 2001:db8:0:1::/64 is originated by AS65000. In more common terms, this means that AS65000 “owns” this destination, or rather that some router within AS65000 is actually directly connected to 2001:db8:0:1::/64.

Second, that AS65002 and AS65003 (assuming there is no policy) both have paths to this destination, as they’re both connected to AS65000.

Both of these things are known transitively; AS65004 cannot see whether or not some router in AS65000 is actually connected to 2001:db8:0:1::/64, nor can AS65004 check the connection between AS65002 and AS65000 or AS65003 and AS65000. Routers H and K must assume these things are true based on BGP advertisements—or rather, AS65004 must trust AS65002 and AS65003 to tell it the truth about the connectivity in the network. Everything AS65004 knows about reachability to 2001:db8:0:1::/64 is, therefore, known transitively. AS65004 believes it can trust what AS65000 is advertising because AS65004 trusts AS65002, and AS65002 (apparently) trusts AS65000.

The entire point of BGP security is to augment these transitive trust relationships with something a little more solid. For instance, origin validation -- in other words, some way to prove the origin of a request -- would provide a way for AS65004 to independently verify that AS65000 actually owns 2001:db8:0:1::/64—without trusting AS65002 or AS65003. This would allow AS65004 to distinguish between advertisements claiming reachability to 2001:db8:0:1::/64 from AS65001 and AS65000. In other words, if both AS65000 and AS65001 claim to be connected to the same destination, AS65004 should be able to tell which of the two are telling the truth without relying on transitive trust through AS65002 and AS65003. Origin validation appears to be a somewhat solved problem in technical terms through the Routing Public Key Infrastructure (RPKI). Although widespread deployment of the RPKI is still in question, this post won’t deal with the problem of origin validation.

But what questions can you ask about the AS Path? The primary question seems to be this: If Router H sends a packet towards some destination in 2001:db8:0:1::/64, will the packet actually reach the destination? A second question—much harder to answer—is: Will such a packet pass through a third party, not listed in the AS Path, who is not supposed to be in the path? This second question is harder for several reasons, including:

The AS Path at Router H may not actually describe the path the packet will take for valid reasons. For instance, aggregation, local policy, and the availability of multiple equal cost paths can legitimately cause traffic to flow along a path not listed in the AS Path.

It’s difficult for Router H to know, for instance, what AS65003’s policy is towards AS65000, or AS65000’s policy is towards AS65003. This information can be inferred, of course, from the existence of an advertisement, but here the operator is reaching into murky waters.

Two points, then, are worth noting:

  • While connectivity can be verified, the actual path a packet takes through a packet switched network cannot be verified in any meaningful way.
  • The more clearly a policy is stated and known, the more readily it can be enforced. Inferring policy transitively across a set of autonomous systems is a tricky game; there is some level to which it can be done, but reaching beyond this level is difficult (if not impossible).

With this background in mind, let’s turn to more practical requirements. What, specifically, would a deployable system look like? To ask this in a different way, what are the operational requirements that must be met for any system that reduces our reliance on transitive trust in relation to the AS Path?

First, it would be nice to reuse BGP. This isn’t a hard and fast requirement—in fact, there are those who object to it, because we seem to be piling just about everything into BGP, from flow policy to layer two reachability, to transitive IGP cost, and so on. In this particular case, though, BGP was actually designed to provide inter-AS reachability at a global scale. So while it might turn out that BGP isn’t used to distribute information about policy and reachability in a more secure way, it seems like a good candidate to start with.

Second, it would be nice, if we’re re-using BGP, to re-use existing ways of indicating policy. This specifically refers to the re-use of communities, which are widely used to filter and indicate route preferences within and between autonomous systems.

Third, it would be ideal if any proposed system to validate the AS path would interact with the existing proposed (and partially deployed—though again there are still questions about its widespread deployment) RPKI origin authentication system.  The interaction here would be that the RPKI could be used to provide the public/private key pairs necessary to cryptographically secure any proposed path validation system. There are a number of operators who object to this premise, so we’ll leave this as an open question here, to be resolved later.

Fourth, we cannot replace the existing external BGP (eBGP) speakers (routers that interconnect different autonomous systems) along autonomous system borders. This might seem like a trivial point to some, but it’s crucial to the operation of many large providers. Replacing the peering routers across an entire AS is expensive in terms of capital costs (CAPEX), but CAPEX pales in comparison to operational costs (OPEX). Even if some governmental organization were to make hundreds of millions of dollars available for such a replacement, the real cost would still dwarf that amount in terms of service disruptions alone.

Fifth, we cannot release new information about the structure or connectivity of an AS into the wild, for security reasons. This rules out systems that use any sort of per-router key, as such systems that go beyond existing connectivity information, providing actual peering relationships on a per-eBGP speaker basis. From the perspective of LinkedIn, it’s fine if someone can infer that LinkedIn peers with some provider in Brussels. It’s not fine if someone can infer there are two specific eBGP peering routers in Brussels on the LinkedIn edge, and the set of remote autonomous systems those two routers peer with. This sort of information opens up an entirely new attack surface which would most likely offset any gain in security from validating the AS Path.

Sixth, the operation of any proposed system within an autonomous system can be suggested, but not mandated. Each autonomous system has its own internal operational processes and tempo. These operational processes are deeply intertwined with the structure of the company’s business. Mandating a particular operational model is unenforceable and overly restrictive.

Seventh, the system must provide incremental value if it is deployed incrementally. While there is some realistic lower bound for usefulness, the bound should be as low as possible, and each additional participant should increase overall security for everyone who participates.

Eighth, the system must allow operators to hide specific connectivity information, or rather constrain such information within a tightly bound set of peers. For instance, some organizations will contract with a provider to provide a backup link, but they don’t want the existence of the backup link to be widely known for various reasons. In this case, it should be possible to advertise the connection only when it is being used. Another use case is for several organizations who want to connect to a single provider to reach one another, but don’t want the global Internet to know about this private connectivity. While this could also be handled through some sort of VPN, any proposed BGP security solution needs to be able to support this configuration.

These eight requirements might seem like they are overwhelming; can a system actually be designed that will meet them all? One possible solution that would meet many (if not all) of these requirements is to build an AS level graph of interconnections and policies across an overlay on top of the per destination reachability information carried in BGP. In the next post, we’ll work through this kind of system to see how it might work.