Infrastructure

Smart Argument Suite: Seamlessly connecting Python jobs

Co-authors: Jun Jia and Alice Wu

Introduction

It’s a very common scenario that an AI solution involves composing different jobs, such as data processing and model training or evaluation, into workflows and then submitting them to an orchestration engine for execution. At large companies such as LinkedIn, there may be hundreds of thousands of such executions per day, submitted and executed by multiple teams and engineers. Any improvements in the tools used by machine learning engineers lead to significant improvements in productivity, which highlights the need for robust productivity infrastructure to support machine learning engineers.

In most cases, these jobs are launched via the command line interface (CLI). Passing the arguments through the CLI becomes a producer and consumer problem: on the workflow generation side, you need to produce a set of arguments which are passed to the CLI to launch the jobs; on the other side, the launched jobs would consume the arguments passed from the CLI. We built Smart Argument Suite (smart-arg) to make this process standard, smooth, and safe while also being human-friendly.

gif-showing-smart-arg-coding-in-action

Designing the Smart Argument Suite

Most of the popular AI packages, e.g., Tensorflow or PyTorch, in the open source domain nowadays come in Python, as do orchestration engine SDKs such as Airflow, Kubeflow, and Cloudflow (a.k.a Azkaban-ng). There are many Python packages available for CLI argument parsing, and there is even one from Python standard library argparse—all helping on the consumer side. However, none of them offer any functionality on the producer side, to the best of our knowledge. Engineers at LinkedIn developed this slim Python library (smart-arg) to help both sides of the problems: producing human-friendly CLI representation of the arguments, and consuming them consistently.

smart-arg-code-example

For the ease of discussion, we will assume the argument container is defined as a class in Python and call the conversion of such a class instance to and from a CLI compatible form “serialization and deserialization” (SerDes).

flowchart-showing-serialization-and-deserialization

Why smart-arg?

There are many excellent existing choices for parsing command lines, such as Click, docopt, TAP, or the bare metal argparse/optparse, so you may be wondering, “Why smart-arg?”

The answer is simple: smart-arg is not (just) a command line parser. It’s also for creating and passing typed arguments through CLI as seamlessly as passing arguments through function calls. Its design goal is to hide all the low-level parsing/deserialization and the additional serialization work and let users directly work with typed Python objects. 

Why not Click, docopt, or …?

These options are perfectly fine for command line parsing and invocation of command line applications.

They can parse the command line (deserialization), but none of them offer a way to create the command lines (serialization) programmatically, to our best knowledge. Their intended use case is for a human to manually type in those commands to run the utilities.

If you work with orchestration engine SDKs to create workflow pipelines or prefer not to manually type the command line, or just simply don’t want to worry about the parsing, smart-arg is here for you!

Principles

When designing our solution, we knew that we wanted to specifically address a few pain points. We formed these into principles that dictated how we approached the creation of smart-arg.

It should be simple
We wanted the usage of our tool to be as simple as defining an argument container object and passing it through a function call. We felt it should give the user peace-of-mind around handling the argument passing through CLI. It should let the user simply focus on how to define an argument container class that makes sense, instead of how to create a CLI using a raw argument parsing tool, such as argparse, or how to compose the command line correctly.

smart-arg allows you to simply define your argument container class “ArgClass” as a NamedTuple or dataclass, annotate it with the decorator @arg_suite, and, voilà, “arg_class.__to_argv__()” gives the serialized form for CLI, while “ArgClass.__from_argv__()” deserializes the command line to the corresponding “ArgClass” instance.

It should be safe
We wanted our tool to have a verifiable and testable systematic SerDe process with certain safety guarantees, including type-safety. We wanted it to help users minimize human errors around the argument handling. Given that Python is a dynamic language, our solution would need to maximize the utilization of all the existing tools to improve type-safety.

smart-arg deploys the well-trusted Python standard library argparse under the hood for deserializations and keeps the corresponding serialization process well-tested.

smart-arg enables IDEs' code autocompletion and type hints functionalities by utilizing the commonly used, typed and immutable NamedTuple and dataclass from the standard Python library to help users spot errors early. It also brings in field value validation against its declared type, in addition to argparse, which it uses for parsing or the container class instantiation.

It should be human-friendly
We have mentioned this phrase multiple times now. Why? Because it’s important! There is always a need for human intervention with the workflows, whether by an AI engineer, a devops practitioner, or an SRE. We need to make it easy for people to do inspection or debugging on the serialized form.

smart-arg serializes an argument container class instance to a sequence of strings that is compatible with the standard Python library argparse, which can be easily inspected by human eyes.

It should be extensible
A user should be able to extend the support to the argument container classes when desirable.

They should also be able to extend the support to their own types of the fields of any argument container classes.

smart-arg supports NamedTuple and dataclass out-of-box, and other classes by implementing a simple interface. To extend the support to any additional field types, type handlers can be implemented for the SerDe process.

Implementation and usage

chart-showing-components-of-smart-arg

The general working principle of smart-arg

For each supported argument container class (NamedTuple or dataclass by default), there is a proxy class to define the communication to the actual container class. For any supported types, there are corresponding TypeHandlers to specify the SerDe process for those types.

Users only need to define a Python NamedTuple or dataclass with all the argument options defined in the class if the arguments were not modeled this way already and then decorate the container class with @arg_suite. With such a decorator, smart-arg can dynamically decompose every field (with experimental support of nested container classes) into corresponding argparse arguments. Python argparse is a common library Python users use to digest the command line arguments. Whenever parsing the command line from the system, smart-arg will compose the defined Python container class object (either NamedTuple or dataclass). It is mostly type safe, given that smart-arg will cast the command line string into corresponding type information defined in the decorated Python class. Referencing the option is also much easier than before, because IDE would autocomplete and offer hints whenever the users tried to use the argument option. So, users can finally say “goodbye” to the miserable experience of memorizing all the argument options. In addition, smart-arg also provides a bunch of nice add-ons for users, such as systematic post-validation and user defined post-initialization of the user arguments. It’s also extensible, because users have choices when it comes to defining their own parsing behaviors, which is achieved through extending the smart-arg provided base classes. In short, smart-arg is a simple, safe, user-friendly, extensible Python library which can benefit day-to-day work for AI engineers and others.

Caveats
The SerDe process won’t be as universally applicable to any argument container class as compared to a generic-purposed standard, such as JSON. However, we believe the provided default type support covers the majority of the use cases already.

To preserve human-friendliness, the serialized form of all user inputs or the actual field values are intact and inspectable—not encoded to be fully CLI compatible—so there is a chance that the CLI might be confused by special characters, such as quotation marks.

Current status and future work

The smart-arg has been released to PyPI and the source code is on GitHub. It’s already being battle tested in action with LinkedIn open source AI solutions: the deep personalization framework GDMix and the deep NLU ranking and classification framework DeText.

There is still work that we foresee in the future, such as:

  • Adding escaping to make the serialization safer with CLI. Please reach out if you have a good solution to this problem, or better yet: create a PR! 

  • Expanding beyond the language boundaries; for example, there are many Scala Spark jobs in LinkedIn’s AI ecosystem, and it is desirable that we be able to seamlessly integrate between Python and JVM worlds.

We’re looking forward to collaboration with the open source community to make smart-arg a useful tool.

Acknowledgements

Thanks to our open-source guru Christopher Eppstein for the help all along the open source journey, and Python vetern Barry Warsaw for providing valuable feedback to improve the quality of the project all around.