Our Mission: Putting together one million GPUs in a DePIN - decentralized physical infrastructure network.
io.net Cloud is a state-of-the-art decentralized computing network that allows machine learning engineers to access distributed Cloud clusters at a small fraction of the cost of comparable centralized services.
Modern machine learning models frequently leverage parallel and distributed computing. It is crucial to harness the power of multiple cores across several systems to optimize performance or scale to larger datasets and models. Training and inference processes are not just simple tasks running on a single device but often involve a coordinated network of GPUs that work in synergy.
Unfortunately, due to the need for more GPUs in the public cloud, obtaining access to distributed computing resources presents several challenges. Some of the most prominent are:
- Limited Availability: It can often take weeks to get access to hardware using cloud services like AWS, GCP or Azure, and popular GPU models are often unavailable.
- Poor Choice: Users have little choice regarding GPU hardware, location, security level, latency and other options.
- High Costs: Getting good GPUs is extremely expensive, and projects can easily spend hundreds of thousands of dollars monthly on training and inferencing.
io.net solves this problem by aggregating GPUs from underutilized sources such as independent data centres, crypto miners, and crypto projects like Filecoin, Render and others. These resources are combined within a Decentralized Physical Infrastructure Network (DePIN), giving engineers access to massive amounts of computing power in a system that is accessible, customizable, cost-efficient and easy to implement.
With io.net, teams can scale their workloads across a network of GPUs with minimal adjustments. The system handles orchestration, scheduling, fault tolerance, and scaling and supports a variety of tasks such as preprocessing, distributed training, hyperparameter tuning, reinforcement learning, and model serving. It is designed to serve general-purpose computation for Python workloads.
io.net offering is purpose-built for four core functions:
- Batch Inference and Model Serving: Performing inference on incoming batches of data can be parallelized by exporting the architecture and weights of a trained model to the shared object-store. io.net allows machine learning teams to build out inference and model-serving workflows across a distributed network of GPUs.
- Parallel Training: CPU/GPU memory limitations and sequential processing workflows present a massive bottleneck when training models on a single device. io.net leverages distributed computing libraries to orchestrate and batch-train jobs such that they can be parallelized across many distributed devices using data and model parallelism.
- Parallel hyperparameter tuning: Hyperparameter tuning experiments are inherently parallel, and io.net leverages distributed computing libraries with advanced Hyperparam tuning for checkpointing the best result, optimizing scheduling, and specifying search patterns simply.
- Reinforcement learning: io.net uses an open-source reinforcement learning library, which supports production-level, highly distributed RL workloads alongside a simple set of APIs.
Updated about 1 month ago