LinkedIn NYC Tech Talk series: Engineering Excellence Meetup

August 28, 2019

We regularly play host to a series of meetups here at the LinkedIn office in the Empire State Building. Open to the community, these events cover a range of topics—from distributed systems, web and mobile development, to machine learning—and are a great way for engineers to meet, share notes, and learn from each other on various technical topics.

At our latest meetup in July, we were joined by keynote speakers from LinkedIn and Facebook, and over 150 attendees from the tech community here in NYC. The presentations centered around how engineers deal with challenges of productivity and efficacy in engineering a few of the largest scale systems in the world.

For those who couldn't make the event, here’s a short recap of the keynotes from the Engineering Excellence Meetup.

Deceptive problems in distributed systems

To kick off the night, Swapnil Ghike took the stage to share some of the most common challenges that come with running large-scale distributed systems. He discussed the famous CAP theorem, which describes how a shared-data system can have at most two of the following three properties: consistency, availability, and tolerance to network partitions.

In an interactive session, Swapnil worked with the audience in addressing the following questions when designing a real-time messaging application.

How do you provide exactly-once delivery guarantee?
How do you prevent the decoration failures?
Is it okay sometimes to miss an event?
How do you ensure First In First Out (FIFO) ordered delivery?

Case Study: Instant messaging with client B disconnected

His recommendations at the end were to design distributed systems for availability, consistency, operations, and performance with the assumption that anything can fail at any time. The main key takeaway is to watch out for maintenance and operational overhead. It’s extremely difficult to undo major design choices—this is why Swapnil recommends working out and simplifying requirements before building.

A full replay of his talk can be viewed here.

Swapnil Ghike leading an interactive session with the audience

Using ML for developer productivity at Facebook

Next up, Facebook’s Stephen Fink shared his excitement about how machine learning is poised to be a significant force multiplier in developer productivity.

If you are a developer, you are probably familiar with Stack Overflow and likely use it quite frequently. Developers love it because even if they're not familiar with a certain programming language, they can type in questions in natural language, and receive related answers in the form of actual code snippets. In some cases, however, using traditional search engines or StackOverflow is not efficient, especially when a company has a specific code base with a less common programming language such as Hack. How does one solve the problem of code search at the scale of Facebook?

Stephen shared a tool that his team has been working on called Neural Code Search (NCS). It enables engineers to find the right code snippet that answers most natural language questions. For example, "How do I calculate a time zone difference in Hack?”

One important thing to note here is that this technology goes beyond simple grep search. The method bodies do not exactly match the whole query phrase, but rather, encapsulate the overall semantic meaning of the query.

The underlying technology concept that drives NCS is embeddings, which are vector representations of code. The characteristic that semantically resembles similar pieces of code are placed closer in the vector space.

When a query comes in, it is tokenized and represented as a vector in order to compare it to the rest of the vectors created during the model generation portion

NCS is built using Facebook's open source tools: fastText (for word embeddings) and FAISS (for cosine similarity vector search). Stephen encouraged everyone to take a look at these tools along with the published paper, which explains NCS in more detail.

Distributed tracing: A Facebook case study

To wrap up the night, Facebook’s Michael Bevilacqua-Linn presented on tackling end-to-end performance tracking in large distributed systems. He started off by discussing the previous tracing systems at Facebook such as FBTrace, which is a node-based model that is very similar to the standardization in OpenTelemetry.

Michael discussed the evolution of several trace systems and tools at Facebook and how some of them were even developed in parallel, which has resulted in a lot of duplicated work across teams. Canopy emerged as the single system to handle both broad distributed traces and single node traces to give detailed performance data.

Comparison view: compare populations of traces for performance regressions

One of the main benefits in switching to Canopy is the Trace Processing feature, which enables the creation of aggregate datasets for trace comparison and visualization.

Since effective observability requires high quality telemetry, Michael recommended that everyone try out OpenTelemetry. OpenTelemetry makes robust, portable telemetry a built-in feature of cloud-native software. He also recommended joining the W3C Distributed Tracing Working Group.

Michael Bevilacqua-Linn and Stephen Fink fielding questions from attendees

Acknowledgements

For this latest event, I’d like to give a big thanks to the hosts, Abdullah Haydar, Siva Visakan Sooriyan, and Anita Desai, for organizing the meetup. Many thanks to the speakers (Swapnil Ghike, Stephen Fink and Michael Bevilacqua-Linn), the volunteers from LinkedIn, and all the attendees for making the meetup a success!

To stay up to date on our latest events, please follow the @LinkedInEng Twitter account and join our LinkedIn NYC Tech Talks Meetup page. I’m looking forward to seeing both returning and new faces at the next event!

Topics: Culture Distributed Systems