LinkedIn NYC Tech Talk Series: Machine Learning and Data Science Meetup

Yunpeng Xu

Machine Learning @ LinkedIn

October 19, 2018

On the rainy evening of Sept. 12, we continued our technical meetup events hosted by LinkedIn’s NYC engineering group in the Empire State Building with the third meetup in the NYC Tech Talk series. This event was focused on the topics of Machine Learning and Data Science. We were joined by keynote speakers from LinkedIn, Cornell Medicine, and Google, along with 145 outside attendees. We’d like to thank everyone who attended the meetup. For those who were not able to attend, here is a short recap of the three presentations by the keynote speakers.

Inferring enterprise relationships from professional social networking data and beyond
Events and relationships among enterprises entities, such as funding events and merger & acquisition, can provide valuable insight for understanding a company’s needs in marketing, sales, and recruiting. However, collecting and compiling this type of data at scale is often very hard. In the first talk of the evening, Xiaoqiang Luo, a senior engineering manager at LinkedIn on the sales and company relevance team, presented a novel approach to harvesting hundreds of thousands of company relation data points—with high precision—from news articles.

The key, he explained, was to automatically curate a large training dataset, starting with LinkedIn internal data, that could be used to train a model to learn the relationships between companies. By observing that the work experiences in member profiles often contain rich company relationship information with members’ own annotations, the NYC data team took the natural language text and aligned it with the labeled data from crowdsourcing to build the training data. With this sizable training data, they turned the relationship extraction into a structured prediction problem, in which the structured output domain is the set of all possible relation hypotheses. Then they developed a model pipeline, which included an event model, field extractors, and a hypothesis generation and classification model. The trained pipeline was then applied to the abundant news articles to extract high-quality relationship data.

During the closing remarks, Xiaoqiang re-stated the importance of obtaining training data at scale, with which you can do a lot of great work, even with a relatively simple model.

For more information about AI and Machine Learning at LinkedIn, please check out this post by Deepak Agarwal.

Trajectories of health: Chasing factors that drive health progression
Yiye Zhang, an assistant professor at Weill Cornell Medicine, Cornell University, presented an interesting talk on how to build and analyze health trajectories using Electronic Health Records (EHR) from millions of patients. The goal of these trajectories is to assist with clinical decision-making for the optimal treatment of patients.

Healthcare is not just a big, three trillion dollar industry; it’s also something that impacts every one of us. As technology evolves, the healthcare industry also evolves, with the help of huge amounts of data. The EHR data contains rich information on a patient’s medical history, condition, diagnosis, medication, procedures, etc. One particular problem that Prof. Zhang was interested in was how to create timelined health trajectories for millions of patients using the EHR data, and how to determine the association between the patients’ health and their social determinants from these trajectories.

The process of building these trajectories can be summarized in four steps. First, multidimensional information was incorporated into trajectory patterns. Specifically, every diagnosis/procedure/drug was treated as an item. These items were then grouped into nodes, and then further grouped into super nodes, as an effort to reduce the complexity of the data. Second, similarity among different patients’ trajectories was measured based on their longest common subsequences, and was then used as the metric for their commonality. Third, subgroups of the trajectories were identified such that patients in the same group saw similar health changes. Finally, common trajectories were extracted for further analysis, such as exploring the important patterns for patients’ health and finding the optimal treatment for them.

The usage of the trajectories was then demonstrated by Prof. Zhang with applications on two diseases: heart conditions and abdominal pain. Patients were clustered into several groups using their trajectories, where certain clusters correlated well with their respective occupations and other factors. The trajectories also presented a clear view of how patients’ conditions progress over time.

Perspective API
Discussing things we care about can be difficult. The threat of abuse and harassment online may stop us from expressing ourselves freely. But what if machine learning technology could help improve conversations online? CJ Adams, a product manager at Google, introduced a powerful tool that makes it easier to host better conversations using machine learning.

Perspective API, the tool developed by Google, is an open source project aiming to improve participation and the quality of online discussion at scale. It uses a convolutional neural network (CNN) on both the word and sentence level to determine how close a sentence is to the toxic comments in the training set, and to calculate a score in relation to its toxicity. Currently it has three usages: moderation helps editors process comments faster by quickly identifying the toxic elements; authorship helps writers understand the impact of what they write; and readership helps readers understand the comments.

At the end of the talk, CJ elaborated on how Perspective API measures and mitigates bias. Quite often, certain identity terms are heavily represented in the toxic language text. There are not enough examples of the identity terms in positive, non-toxic phrases, and too many examples of the identity terms in toxic phrases. This causes bias in the model, resulting in phrases containing these identity terms being mis-classified (in some instances) as "toxic." To reduce this kind of bias, it is important to rebalance the existing data with extra training data. For this purpose, Google created Project Respect for researchers to submit training samples.

Acknowledgements

Back in April, our NYC-based engineering team started hosting a series of technical meetup events to engage local tech experts in sharing their knowledge, best practices, and culture. Thanks to the strong support from the community, the two previous events were well received and brought together hundreds of passionate engineers from across the NYC area.

For this latest event, big thanks to the hosts, Siva Visakan Sooriyan, Xiaoqiang Luo, and Anita Desai, for organizing the meetup. And many thanks to the speakers, Xiaoqiang Luo, Yiye Zhang, and CJ Adams, the volunteers from LinkedIn, and all the attendees for making the meetup a success!

To stay up to date on our latest events, please join our group page.

Topics: Culture Data