Microsoft open-sources ONNX Runtime model to speed up Google’s BERT

On Jan 22, 2020

Microsoft Research AI today said it plans to open-source an optimized version of Google’s popular BERT natural language model designed to work with the ONNX Runtime inference engine. Microsoft uses to the same model to lower latency for BERT when powering language representation for the Bing search engine. The model, which “delivers its largest improvement in search experience” for Bing users, was detailed in a post last fall.

This means developers can deploy BERT at scale using ONNX Runtime and an Nvidia V100 GPU with as little as 1.7 milliseconds in latency, something previously only available in production for large tech companies, a company spokesperson told VentureBeat in an email.

Microsoft joined Facebook to create ONNX in 2017 to fuel interoperability across AI hardware like semiconductors and software like machine learning frameworks. The BERT-optimized tool joins a number of ONNX Runtime accelerators like one for Nvidia TensorRT and Intel’s OpenVINO. Using the ONNX standard means the optimized models can run with PyTorch, TensorFlow, and other popular machine learning models.

The work is the result of a collaboration between Azure AI and Microsoft AI and Research.

“Since the BERT model is mainly composed of stacked transformer cells, we optimize each cell by fusing key sub-graphs of multiple elementary operators into single kernels for both CPU and GPU, including Self-Attention, LayerNormalization and Gelu layers. This significantly reduces memory copy between numerous elementary computations,” Microsoft senior program manager Emma Ning said today in a blog post.

This is the most recent leap forward in natural language for Microsoft, but not its first attempt to make Google’s BERT better. About a year ago, Microsoft AI researchers also released MT-DNN, a Transformer-based model that set new high performance standards for the GLUE language model performance benchmark.

Top minds in machine learning who spoke with VentureBeat about machine learning trends in 2020 called advances in natural language models in tasks like text generation through use of Transformer-based models like BERT and MT-DNN one of the most important stories in AI in 2019.

In other natural language developments at Microsoft, last month at NeurIPS in Vancouver, Microsoft and Zhejiang University shared FastSpeech, a model that seeks to improve the performance and speed of text-to-speech models that speak to people. In summer 2019, Microsoft introduced Icecaps, a toolkit to help create conversational AI assistants with multiple personas.