Uber details Fiber, a framework for distributed AI model training

On Mar 27, 2020

A preprint paper coauthored by Uber AI scientists and Jeff Clune, a research team leader at San Francisco startup OpenAI, describes Fiber, an AI development and distributed training platform for methods including reinforcement learning (which spurs AI agents to complete goals via rewards) and population-based learning. The team says that Fiber expands the accessibility of large-scale parallel computation without the need for specialized hardware or equipment, enabling non-experts to reap the benefits of genetic algorithms in which populations of agents evolve rather than individual members.

As the researchers point out, increasing computation underlies many recent advances in machine learning, with more and more algorithms relying on distributed training for processing an enormous amount of data. (OpenAI Five, OpenAI’s Dota 2-playing bot, was trained on 256 graphics cards and 1280,000 processor cores on Google Cloud.) But reinforcement and population-based methods pose challenges for reliability, efficiency, and flexibility that some frameworks fall short of satisfying.

Fiber addresses these challenges with a lightweight strategy to handle task scheduling. It leverages cluster management software for job scheduling and tracking, doesn’t require preallocating resources, and can dynamically scale up and down on the fly, allowing users to migrate from one machine to multiple machines seamlessly.

Fiber comprises an API layer, backend layer, and cluster layer. The first layer provides basic building blocks for processes, queues, pools, and managers, while the backend handles tasks like creating and terminating jobs on different cluster managers. As for the cluster layer, it taps different cluster managers to help manage resources and keep tabs on different jobs, reducing the number of items Fiber needs to track.

Fiber introduces the concept of job-backed processes, where processes can run remotely on different machines or locally on the same machine, and it makes use of containers to encapsulate the running environment (e.g., required files, input data, and dependent packages) of current processes to ensure everything is self-contained. Helpfully, Fiber does this while directly interacting with computer cluster managers, eliminating the need to configure it on multiple machines.

In experiments, Fiber had a response time of a couple of milliseconds. With a population size of 2,048 workers (e.g., processor cores), it scaled better than two baseline techniques, with the length of time it took to run gradually decreasing with the increasing of the number of workers (in other words, it took less time to train 32 workers than the full 2,048 workers).

“[Our work shows] that Fiber achieves many goals, including efficiently leveraging a large amount of heterogeneous computing hardware, dynamically scaling algorithms to improve resource usage efficiency, reducing the engineering burden required to make [reinforcement learning] and population-based algorithms work on computer clusters, and quickly adapting to different computing environments to improve research efficiency,” wrote the coauthors. “We expect it will further enable progress in solving hard [reinforcement learning] problems with [reinforcement learning] algorithms and population-based methods by making it easier to develop these methods and train them at the scales necessary to truly see them shine.”

Fiber’s reveal comes after the release of SEED ML, a framework that scales AI model training to thousands of machines. Google said that it could facilitate training at millions of frames per second on a machine while reducing costs by up to 80%, potentially leveling the playing field for startups that couldn’t previously compete with large AI labs.