New GPU server setup (on the makingπ·πΌ)
Prerequisites: - NVIDIA drivers installed - OS docker compatible (check docs)
Docker installation
First things first, install docker:
# Uninstall conflicting packages
sudo apt remove $(dpkg --get-selections docker.io docker-compose docker-compose-v2 docker-doc podman-docker containerd runc | cut -f1)
# Setup docker's apt repository
# Add Docker's official GPG key:
sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
sudo tee /etc/apt/sources.list.d/docker.sources <<EOF
Types: deb
URIs: https://download.docker.com/linux/ubuntu
Suites: $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}")
Components: stable
Architectures: $(dpkg --print-architecture)
Signed-By: /etc/apt/keyrings/docker.asc
EOF
sudo apt update
# Install the docker packages
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
After installation, verify that docker is running:
sudo systemctl status docker
If not, start it manually:
sudo systemctl start docker
Verify the installation by running the hello-world image:
sudo docker run hello-world
Docker GPU
In order to be able to use the gpu inside the docker containers we need to install the NVIDIA Container Toolkit.
Install the prerequisites for the instructions below:
sudo apt-get update && sudo apt-get install -y --no-install-recommends \
ca-certificates \
curl \
gnupg2
Configure the production repository:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
Install the NVIDIA Container Toolkit packages:
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.19.0-1
sudo apt-get install -y \
nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
Finally, restart docker:
sudo systemctl restart docker
Copying Needed Files
On a new server we need to copy a list of files from a server that already has them:
- Docker and next free port scripts
- pyenv.tar (pyenv image)
- GPU monitor and summary scripts
- User creation script (opkssh)
Docker and next free port scripts
You can either copy paste the entire bin folder (paying attention that only the scripts that you need are being copied). Usually all the scripts in the /bin folder in peak or apex are the ones I want to transfer, so you can copy paste from there. Here's a list of the files to copy:
docker-bash.sh docker-ps-all.sh docker-run-pyenv_ffmpeg.sh next-free-port.tcl
Dockerfile_ffmpeg docker-ps.sh docker-run-pyenv.sh opkssh
docker-logs.sh docker-rm.sh docker-start.sh
docker-ls.sh docker-run-pyenv.bak docker-stop.sh
And the way to do it:
scp -r /usr/local/bin/ <user>@<server>.ti.bfh.ch:~
Once copied you can move them to their respective location, be careful you don't want to replace /bin:
# On new server
cd ~/bin
sudo mv docker-*.sh <other_script> <etc..> /usr/local/bin/
Pyenv.tar (pyenv image)
This pyenv image is essential to get things running for the first time, the way we obatin it is by compressing the already existing image in one of the active servers by doing:
sudo docker save -o pyenv.tar local:pyenv
After this you copy paste it with scp to the new server the same way as before:
scp -r /path/to/pyenv.tar <user>@<server>.ti.bfh.ch:~
After this we can recreate this image by executing:
sudo docker load -i /path/to/pyenv.tar
GPU monitor and summary scripts
These scripts are located in /usr/local/sbin/, namely gpu-monitor.sh gpu-summary.sh. Same way as before using scp:
scp -r /usr/local/sbin/gpu-*.sh <user>@<server>.ti.bfh.ch:~
and then move them to their respective location in the new server.
User creation script
This script is to create users with their respective home directories and entries to the auth_id file (opkssh). Copy it same way as before:
scp -r /usr/local/sbin/user_creation.sh <user>@<server>.ti.bfh.ch:~
Dependencies and Additional configs
There are some scripts that might need some packages installed for them to work correctly, for that execute:
#when reviewing this how to, verify the gawk dependency.
sudo apt update
sudo apt install jq
Usually the scripts that we copied before are already executable, if not make them so:
sudo chmod +x <file>
Then gpu-monitor.sh needs to be executed constantly in the background, to do so execute:
sudo nohup /usr/local/sbin/gpu-monitor.sh &
sudo bash gpu-summary.sh.
MOTD
To set up a MOTD, create a motd file in /etc, use the following as template:
______ __
/ / ___| ___ _ ____ _____ _ _\ \
/ /\___ \ / _ \ '__\ \ / / _ \ '__\ \
\ \ ___) | __/ | \ V / __/ | / /
\_\____/ \___|_| \_/ \___|_| /_/
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WELCOME: Please read this message-of-the-day to get β
β started! β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Getting started with Docker on <server name> β
β β
β On the first login create your shared folder, in β
β which you can put your code and data: β
β 'mkdir ~/dworkspace' β
β 'chmod -R 777 ~/dworkspace' β
β β
β To see all available commands, please type: β
β 'sudo -l' β
β or visit: β
β https://infra.pages.ti.bfh.ch/mlmp/src/running_code/ β
β β
β To instantiate a docker image: β
β 'sudo docker-run-pyenv.sh' β
β β
β To see your docker instance UUID: β
β 'sudo docker-start.sh' β
β β
β To list the docker instances: β
β 'sudo docker-ps.sh' β
β β
β To (re)activate the docker instance: β
β 'sudo docker-start.sh UUID' β
β β
β To stop the docker instance: β
β 'sudo docker-stop.sh UUID' β
β β
β To remove the docker instance: β
β 'sudo docker-rm.sh UUID' β
β β
β To connect to a running docker instance: β
β 'sudo docker-bash.sh UUID' β
β β
β In case of issues: |
| ask in the MLMP Teams channel 'General' β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β NOTE: Cooperative GPU/CPU usage β
β Check the Support Channel and Wiki on MS Teams for β
β more Information and help: Team MLMP β
β β
β Check also the public calendar mlmp.ti@bfh.ch β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Scroll up for the getting started guide. β
To create a server name in ascii use: https://www.asciiart.eu/text-to-ascii-art
enable that the users can use the docker scripts as sudo with:
sudo visudo
Setup opksssh for this please refer to opkssh-setup section
take as reference the config of apex or peak
Create a test user and test the docker scripts, follow the docker step by step guide and check that you can use the gpu from inside docker.