logo
logo
Sign in

Determined AI Announces the Future of AI Infrastructure

avatar
venkat k
Determined AI Announces the Future of AI Infrastructure

We are entering the golden age of Ai services. Model-driven, statistical AI has already made progress on applications such as Computer Vision, Speech Recognition and Machine Translation, with countless more use cases on the horizon.

If AI is the lynchpin of a new era of innovation, why does it feel like the infrastructure it built is trapped in the 20th century? What’s worse, why the sophisticated AI tool is locked inside the walls of a few billion-dollar tech companies, is no longer available.

Today we are officially introducing Determined AI, which enables AI engineers everywhere to focus on models rather than infrastructure. Determined AI supports this:

GV (formerly Google Ventures),

Extend partners, CRV, Haystack,

SV Angel, The House, and Specialty Types.

We are in the dark age of AI infrastructure

AI and especially deep learning (DL) is becoming a very important computational workload for all types of businesses and industries. For example, DL has dramatically improved the performance of autonomous vehicles at Waymo; Siri, Apple’s personal assistant who communicates through speech synthesis; And this has revolutionized Facebook’s ability to understand user sentiment. These applications, launched by some cutting-edge technology companies, speak to the power of DL, but it needs to be available to a wide range of businesses and developers.

These organizations have a key advantage when using deep learning power: they build sophisticated AI-native infrastructure for internal use. Everyone has to do with existing tools, which are not bad for AI-based application development, as this example differs from traditional software development. In fact, today’s majority of engineers are forced to combine non-standard protocols with non-standard protocols on top of the standard, multi-stage workflows. These point solutions lead to the enormous complexity and huge time and productivity inefficiencies. Consequently, organizations that rely on advances in AI — for anyone who works with vision, speech, or natural language — can mitigate risk without a radically new approach to AI infrastructure.

Our focus:

At Decisive AI, our goal is to empower deep learning at the speed of thought. We build specialized software that directly addresses the challenges that DL developers face every day. Here’s what engineers expect from a decided AI:

Fast:

In traditional software development, users change a file and re-compile almost instantly, but in DL, it can take hundreds of thousands of GPU hours to create a new high-quality model. Our DL-Aware scheduling system allows for workload elasticity with cluster sharing, fault tolerance, and sub-second latency.

Seamless:

DL researchers are today forced to combine narrow technologies over an inefficient, common infrastructure. The results are painful: DL modelers need to worry about tasks such as setting up parameter servers, configuring MPI and understanding Kubernetes. We started from the problems faced by DL researchers today and worked backward to provide a dramatically easier, seamless, integrated environment than traditional tools.

Fully Interoperable:

Determined AI frees your DL investment from the risk of cloud or hardware lock-in. You get the best-in-class DL solution that works on any hardware you want, works with all the popular DL frameworks, and integrates with your current enterprise software environment.

Building the first AI-native infrastructure platform

How deep learning has become important — and how DL differs from traditional computational tasks — it is time to rethink how we are building AI infrastructure off the ground.

To achieve this, we started by gathering the right people. We believe that building an AI-native infrastructure requires a rare set of skills: a deep understanding of the modern AI workload, but also expertise in building large-scale data-intensive systems. Our team consists of world-leading experts in both domains. An important first step is to create an environment where people from both groups can collaborate and co-design the system.

Next, we follow two key design principles.

Perfect Design:

Traditional DL tools usually focus on solving a single narrow problem (eg, distributed training of a single neural network). This forces DL researchers to integrate certain tools to complete their work, often creating an elaborate ad hoc structure

Plumbing along the way. On the contrary, we have thought carefully about the key high-level tasks that DL researchers need to do, such as training a single model, hyperparameter tuning, batch inference, experimentation, collaboration, reproduction, and deployment in a restrictive environment — the class support for those workflows comes directly into our software.

Specialization:

Deep learning requires the application of enormous computational resources to large data sets. Although systems such as Spark and Hadoop work well for traditional analytical work, they do not address the unique challenges posed by deep learning. AI-native infrastructure must support seamless integration with efficient GPU and TPU access, high-performance networking and deep learning toolkits such as Tensor Flow and PyTorch.

Combining these principles — building a unique and comprehensive platform for the unique challenges of deep learning — gives huge improvements to both performance and usability. For example, many companies employ cluster schedulers such as Kubernetes, Mesos or YARN, which can be used to implement a deep learning workload.

However, traditional cluster schedulers and popular DL frameworks are designed independently, resulting in poor performance and usability. In contrast, we have developed a specialized GPU scheduler that locally understands the critical deep learning workload, including distributed training, hyperparameter tuning and batch inference. This gives dramatically better performance: for example, our software performs 50x more hyperparameter tuning than traditional methods! In addition, the DL workload on our platform automatically supports seamless tolerance, dynamic resilience and can scale to on-demand cloud capacity from on-premises sources.

The road ahead

Although we announced the company today, our software product has been running on GPUs for over a year. Our customers have told us that we have already saved hundreds of engineering hours and hundreds of thousands of GPU-hours per person

Their teams. However, there is still much work to be done to reinvent the software stack for the AI-native era, and we are excited to build that future together with our customers.

There are many reasons to be optimistic about the enormous potential of AI, but in order to realize that potential, AI development must be as extensive as it is available for software development today. Someone needs to apply AI to those problems

They are working to fix it and we are excited to be part of that journey.

collect
0
avatar
venkat k
guide
Zupyak is the world’s largest content marketing community, with over 400 000 members and 3 million articles. Explore and get your content discovered.
Read more