Powered by Solana

Putting together one million GPUs in a DePIN - decentralized physical infrastructure network

io.net Cloud is a state-of-the-art decentralized computing network that allows machine learning engineers to access distributed Cloud clusters at a small fraction of the cost of comparable centralized services.

Modern machine learning models frequently leverage parallel and distributed computing. To optimize performance or scale to larger datasets and models, it's crucial to harness the power of multiple cores across several systems. Training and inference processes are not just simple tasks running on a single device, but often involve a coordinated network of GPUs that work in synergy.

Unfortunately, due to the shortage of GPUs in the public cloud, obtaining access to distributed computing resources presents several challenges. Some of the most prominent are:

  • Limited Availability: It can often take weeks to get access to hardware using cloud services like AWS, GCP or Azure and popular GPU models are often unavailable.
  • Poor Choice: Users have little choice in terms of GPU hardware, location, security level, latency, etc…
  • High Costs: Getting good GPUs is extremely expensive, and projects can easily spend hundreds of thousands of dollars per month on training and inferencing.

io.net solves this problem by exclusively aggregating GPUs from a variety of underutilized sources such as independent data centers, crypto miners, crypto projects like Filecoin and Render, etc…. These resources are combined within a Decentralized Physical Infrastructure Network (DePIN) giving engineers access to massive amounts of computing power in a system that is accessible, customizable, cost-efficient and easy to implement.

With io.net, teams can scale their workloads across a network of GPUs with minimal adjustments. The system handles orchestration, scheduling, fault tolerance, and scaling and supports a variety of tasks such as preprocessing, distributed training, hyperparameter tuning, reinforcement learning, and model serving. It is designed to serve general purpose computation for Python workloads.

io.net offering is purpose built for four core functions:

  • Batch Inference and Model Serving: Performing inference on incoming batches of data can be parallelized by exporting the architecture and weights of a trained model to the shared object store. io.net allows machine learning teams to build out inference and model serving workflows across a distributed network of GPUs.
  • Parallel Training: CPU/GPU memory limitations and sequential processing workflows present a massive bottleneck when training models on a single device. io.net leverages distributed computing libraries in order to orchestrate and batch train jobs such that they can be parallelized across a number of distributed devices using data and model parallelism.
  • Parallel hyperparameter tuning: Hyperparameter tuning experiments are inherently parallel, and io.net leverages distributed computing libraries with advanced Hyperparam tuning for checkpointing the best result, optimizing scheduling, and specifying search patterns simply.
  • Reinforcement learning: io.net uses an open-source library for reinforcement learning, which provides support for production-level, highly distributed RL workloads alongside a simple set of APIs.

It all started at the Solana Hackathon, Feb 2023 and the Solana Austin Hacker House