Ron Bekkerman and Matan Gavish
In the 17th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2011)
We present a document classiﬁcation system that employs lazy learning from labeled phrases, and argue that the system can be highly eﬀective whenever the following property holds: most of information on document labels is captured in phrases. We call this property near suﬃciency. Our research contribution is twofold: (a) we quantify the near sufﬁciency property using the Information Bottleneck principle and show that it is easy to check on a given dataset; (b) we reveal that in all practical cases—from small-scale to very large-scale—manual labeling of phrases is feasible: the natural language constrains the number of common phrases composed of a vocabulary to grow linearly with the size of the vocabulary. Both these contributions provide ﬁrm foundation to applicability of the phrase-based classiﬁcation (PBC) framework to a variety of large-scale tasks. We deployed the PBC system on the task of job title classiﬁcation, as a part of LinkedIn’s data standardization eﬀort. The system signiﬁcantly outperforms its predecessor both in terms of precision and coverage. It is currently being used in LinkedIn’s ad targeting product, and more applications are being developed. We argue that PBC excels in high explainability of the classiﬁcation results, as well as in low development and low maintenance costs. We benchmark PBC against existing high-precision document classiﬁcation algorithms and conclude that it is most useful in multilabel classiﬁcation.