Pegasus Data Language: Evolving schema definitions for data modeling

Yingjie (Nicki) Bi

Staff Software Engineer at LinkedIn

November 19, 2020

Pegasus Data Schema (PDSC) is a Pegasus schema definition language that has been used for data modeling with Rest.li services for years. It's the underlying language that helps define data models, describe the data returned by REST endpoints, and generate derivative schemas for other uses, such as XML schemas and various database schemas. However, writing PDSC files is hard and often error-prone because they can be verbose and lack integrated developer environment (IDE) support.

With this in mind, we are excited to announce the general availability of Pegasus Data Language (PDL) as a new Pegasus schema definition language to replace PDSC. In open sourcing PDL, we hope to help Rest.li developers bring in a new standard for defining schemas in a more user-friendly format. In this blog, we discuss the differences between PDL and PDSC, the process of migrating existing PDSC to PDL, and future plans for PDL.

Why Pegasus Data Language?

As the successor to PDSC, PDL was developed to address user experience issues with PDSC and decrease the number of errors while writing schemas. PDL boasts a multitude of improvements, including a Java-like syntax that makes it more readable and rich editor support bundled with LinkedIn’s IntelliJ plugin.

Core features

Although PDSC has a syntax that is a subset of JSON, it lacks in readability. Therefore, in developing PDL, we knew a priority would be to design it to be easier to read. PDL also features extra shorthand that developers can leverage to write less lines of code and make their schemas easier to understand.

Java-like syntax
PDL and PDSC are fully compatible. Developers will write less for the same schema than they would if they used PDL instead of PDSC.

comparing-p-d-s-c-and-p-d-l-by-schemas-written

Side-by-side comparison of PDSC and PDL for the same schema

Support for import statements
In PDSC, all references to types outside of the schema’s own namespace must be written as fully-qualified type names. In PDL, however, feature-import statements allow the user to specify types that can be referenced by their simple name rather than their full name. This helps to reduce the amount of redundant data written in schemas that refer to the same type numerous times.

Side-by-side comparison of PDSC and PDL with the latter using import statements

Shorthand for custom properties
In PDSC, the custom properties are arbitrary values keyed at anything that’s not a reserved keyword. In PDL, the custom properties are cleaner and carry a more Java-like syntax.

Cleaner enum declarations
In PDSC, the metadata of enum must be specified in individual mappings that are separate from the main symbol list. In this way, defining complex enums in PDSC is unintuitive and can be hard to read and maintain. In contrast,defining enum symbol metadata in PDL is quite intuitive. Each doc string, custom property, and deprecation annotation can be placed right alongside the symbol.

Side-by-side comparison of PDSC and PDL with enum declarations

Benefits of PDL

Rich editor (IntelliJ) support
For PDSC, we have had limited IDE support. The experience of writing PDSC is similar to writing JSON text in IDE.

screenshot-showing-the-limited-i-d-e-support-in-p-d-s-c

With rich IDE (IntelliJ) support, PDL makes writing schema much easier and reduces the likelihood for human error. The IntelliJ PDL plugin features include, but are not limited to:

1. Syntax highlighting, including warnings for errors and deprecation.
Syntax highlighting makes the Pegasus schema more readable.

intelli-j-screenshot-showing-syntax-highlighting

The error highlighting feature helps developers easily detect any mistakes while writing a schema rather than finding the errors later during build time.

intelli-j-screenshot-showing-error-highlighting

2. Autocomplete suggestions for keywords and schema names

Autocomplete helps developers easily find all the available keywords or schema names while writing a schema by prompting suggestions.

intelli-j-screenshot-showing-auto-complete

3. Auto-suggestions to fix common errors like file or schema name mismatch

intelli-j-screenshot-showing-auto-suggestions

IDE support helps developers reduce possible human errors while writing Pegasus schemas and overall, improves the developer experience.

4. Readable syntax and rich editor support for PDL to reduce build time errors.

graph-showing-failure-counts-before-and-after-the-move-to-p-d-l

As seen in the chart above, the move to PDL was significant in reducing GenerateDataTemplate failure counts. The orange line indicates the failure counts over 2019, while the blue line indicates the failure counts in 2020. The GenerateDataTemplate is a gradle task used to generate Pegasus schemas’ Java classes based on the Pegasus schema. When compared year over year, you can clearly see a drop in GenerateDataTemplate failure counts since beginning the PDL migration in the spring of 2020.

Migrating existing PDSC to PDL

Because PDL is fully compatible with PDSC, the process of migrating PDSC to PDL can be done through an automated job. We provided a command line tool to help developers easily migrate existing PDSC files to PDL format. Developers can run ./gradlew convertToPdl to convert all existing schemas from PDSC to PDL. Considering most repositories within LinkedIn used PDSC, it was important that the migration process was seamless in adoption. In addition to the command line tool, we also introduced an automated migration tool to convert all the existing PDSC files to the PDL format and to generate code reviews for the schema changes across all code repositories in LinkedIn. The schema owners only needed to review the Review Boards and give the greenlight for the schema changes. Once the code review is approved, the change is automatically committed to the mainline. By leveraging the automated migration tool, we were able to migrate 90% of LinkedIn’s PDSC schemas to PDL in one quarter.

Future plans

Our biggest focus in continuing to expand PDL will be to provide more comprehensive IntelliJ plugin support. The PDL IntelliJ plugin will be updated to provide extension points that will allow other teams to implement context-specific auto-complete suggestions for custom schema annotations. Moreover, since the PDL schema format is more developer friendly and comes with comprehensive IntelliJ plugin support, we plan to completely replace all other schema formats (e.g., Avro) at LinkedIn with PDL. In the future, developers at Linkedin will be able to use PDL to define any kind of schema with support to be translated from PDL using the infrastructure we have built. Future updates can be found on the Github.

Acknowledgements

A special thanks to the Rest.li development team for working on this project: Karthik Balasubramanian, Evan Williams, Junchuan Wang, and Aman Gupta. Thanks to the management team for supporting this project: Heather McKelvey, Goksel Genc, Maxime Lamure, and Qunzeng Liu.

Topics: Open Source Data