Troubleshooting Guide
GPU-Enabled Docker Platform Troubleshooting Guide
This guide is designed to assist users in troubleshooting their GPU-enabled Docker platform setup on both Linux and Windows operating systems. It provides a structured approach to ensure the platform is correctly configured and functioning.
Initial Setup for Linux
- Proper Installation Process:
- Run the setup from the website for the first time. This step installs the necessary drivers.
- After the first installation, reboot your Linux system.
- Perform the setup process a second time, which will install Docker.
Verifying the Setup on Linux and Windows
- Confirmation Command:
- To verify if your setup is correctly functioning, execute:
docker run --gpus all nvidia/cuda:11.0.3-base-ubuntu18.04 nvidia-smi
- The output should be similar to that of nvidia-smi .
- This command checks if Docker is properly utilizing your GPU.
Troubleshooting Failed Setup on Linux
-
Using the Reset Script(end of page):
-
If the confirmation command fails, use the reset_drivers_and_docker script:
-
chmod +x reset_drivers_and_docker.sh ./reset_drivers_and_docker.sh
-
After running the script, restart your device.
-
Rerun the setup from the website. After an automatic restart, rerun the setup to complete the installation.
-
If the confirmation command fails again after these steps, seek assistance on the community support channel.
-
Stopping the Platform
- Windows (Using PowerShell):
- To stop and remove all containers, execute:
docker ps -a -q | ForEach { docker rm $\_ }
- To stop and remove all containers, execute:
- Linux (Using Terminal):
- To stop and remove all containers, use:
sudo docker stop $(sudo docker ps -a -q); sudo docker rm $(sudo docker ps -q)
- To stop and remove all containers, use:
Restarting the Platform After Reboot
- After rebooting your computer or server, the platform will require a restart.
- Rerun the same command provided on the website during the initial setup. (looks like docker run -d ....)
NOTE (IMPORTANT) :
Make sure you're not running two instances of io-worker-vc
How to check this :
docker ps
If there are 2 containers running the same image io-worker-vc , the platform will malfunction - output looks like the following
~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
87b1b066bdfa ionetcontainers/io-worker-monitor "tail -f /dev/null" 3 seconds ago Up 2 seconds agitated_hawking
7033c1b8feba ionetcontainers/io-worker-vc "sudo -E /srp/invoke…" 8 seconds ago Up 8 seconds friendly_ritchie
67f699e12c2e ionetcontainers/io-worker-vc "sudo -E /srp/invoke…" 10 seconds ago Up 8 seconds sleepy_feynman
How to fix this?
Run the stop all docker containers (check troubleshooting guide) and run the (docker run -d ...) command from website only ONCE to run the platform normally
Extra Guides
Which ports need to be exposed on firewall for proper functioning of platform: (both linux and windows)
- TCP 443 25061 5432 80
- UDP 80 443 41641 3478
How can I verify that program has started successfully?
-
on running the following command on powershell(windows) or terminal(linux) you should have 2 docker containers running at all times:
docker ps
-
Incase there are no containers or only 1 container running after docker run -d ... command from website:
- stop the plaform (check guide above for the command) and restart the platform with command from website again
-
If this still doesn't work - reach out to our discord community for help : https://discord.com/invite/kqFzFK7fg2
reset_drivers_and_docker.sh :
Create a new file called "reset_drivers_and_docker.sh", and copy paste the code snippet below
#!/bin/bash
# Stop all running Docker containers
echo "Stopping all running Docker containers..."
docker stop $(docker ps -a -q)
# Remove all Docker containers
echo "Removing all Docker containers..."
docker rm $(docker ps -a -q)
# Remove all Docker images
echo "Removing all Docker images..."
docker rmi $(docker images -q)
# Uninstall Docker Engine, CLI, and Containerd
echo "Uninstalling Docker..."
sudo apt-get purge -y docker-engine docker docker.io docker-ce docker-ce-cli containerd containerd.io
# Remove Docker's storage volumes
echo "Removing Docker storage volumes..."
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
# Remove Docker group
sudo groupdel docker
# Remove Docker's configuration files
echo "Removing Docker configuration files..."
sudo rm -rf /etc/docker
# Remove any leftover Docker files
sudo find / -name '*docker*' -exec rm -rf {} \;
# Uninstall NVIDIA Docker
echo "Uninstalling NVIDIA Docker..."
sudo apt-get purge -y nvidia-docker
# Uninstall NVIDIA drivers
echo "Uninstalling NVIDIA drivers..."
sudo apt-get purge -y '*nvidia*'
# Remove any remaining NVIDIA directories
sudo rm -rf /usr/local/nvidia/
# Update the package lists
echo "Updating package lists..."
sudo apt-get update
# Autoremove any orphaned packages
echo "Removing unused packages and cleaning up..."
sudo apt-get autoremove -y
sudo apt-get autoclean
# Rebuild the kernel module dependencies
echo "Rebuilding kernel module dependencies..."
sudo depmod
# Inform the user that a reboot is required
echo "Uninstallation complete. Please reboot your system."
Updated 19 days ago