But it is like a game of Jenga, trying to update or exchange a brick of the system often results in complete collaps of the stack. This creates a situation were "never change the running system" seems the better option then taking the risk of updating and to be locked out from a working system for hours or days.
The challenge increases if multiple users are on the system and/or different frameworks are needed to be installed as for example Tensorflow 1.15 relies on CUDA 10.0 but Pytorch 1.4 is available for CUDA 10.1 or CUDA 9.2, what a mess.
A working GPU deep learning setup relies on the:
- OS version
- NVIDIA driver version
- CUDA version
- CuDNN version
- Additional CUDA libraries and their version
- Python version
- Python packages and their required system libraries
There are some approaches which try to help with this problem:
The more advanced version of Pythons venv. It helps with setting up a virtual environment for your Python packages and also takes in account the required system libraries (apt packages). There are, to some extent, deep learning framework versions for different CUDA versions available for Conda.
- more powerful then Python venv
- nice for switching among different Python package setups for different projects which rely on compatible deep learning framework versions
- well supported by PyTorch
- not well supported by all major deep learning frameworks
- stuck when different system libraries or driver version are needed
The all-round solution for multi user and multi version abstraction problems. Setup a virtual machine for each user, deep learning framework or project.
- safest option to separate different users and framework setups
- dedicated resource management possible
- only available for GPUs that have virtual machine driver support (Tesla, Quadro)
- very resource intensive
- to some extent performance losses
- expensive to maintain
The next generation of virtualization are containers as provided by Docker. Only as much as required and as little as necessary is virtualized. All required resources are bundled in so-called containers, which can have installed a completely different version stack than the host system. There are interfaces for interacting from the container directly with the host system.
- leaner than complete virtual machines
- Docker container available for most deep learning frameworks
- works on all NVIDIA GPUs
- bare metal performane
- very flexible in use and configration
- hard to get started, there are not many conventions how to use Docker 'correctly'
- can get messy and confusing without conventions
- no built in multi user features
AIME machine learning container management system
Easily install, run and manage Docker containers for the most common deep learning frameworks.
- Setup and run a specific version of Tensorflow, Pytorch or Mxnet with one simple command
- Run different versions of machine learning frameworks and required libraries in parallel
- manages required libraries (CUDA, CUDNN, CUBLAS, etc.) in containers, without compromising the host installation
- Clear separation of user code and framework installation, test your code with a different framework version in minutes
- multi session: open and run many shell session on a single container simultaneously
- multi user: separate container space for each user
- multi GPU: allocate GPUs per user, container or session
- Runs with the same performance as a bare metal installation
- Repository of all major deep learning framework versions as containers
So how does it work?
Read the essential commands to get you through the complete tasks of creating, opening, starting/stopping and deleting your own machine learning containers.
Create a machine learning container
mlc-create container_name framework version [-w=workspace]
Create a new machine learning container
Available frameworks and versions:
Pytorch: 2.1.0, 2.0.1, 2.0.0, 1.13.1, 1.13.0, 1.12.1-jlab, 1.12.1, 1.12.0, 1.11.0, 1.10.2-aime, 1.10.0, 1.9.0, 1.8.0, 1.7.1, 1.7.0, 1.7.0-nvidia
Tensorflow: 2.14.0, 2.13.1, 2.13.0, 2.12.0, 2.11.0, 2.10.1, 2.10.0, 2.9.2-jlab, 2.9.0, 2.8.0, 2.7.0, 2.6.1, 2.5.0, 2.4.1, 2.4.0, 2.3.1-nvidia, 1.15.4-nvidia
Example to create a container with the name 'my-container' as Tensorflow 1.15.0 with mounted user home directory as workspace use:
> mlc-create my-container Tensorflow 1.15.0 -w=/home/admin
Open a machine learning container
To open the created machine learning container "my-container"
> mlc-open my-container
[my-container] starting container [my-container] opening shell to container ________ _______________ ___ __/__________________________________ ____/__ /________ __ __ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / / _ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ / /_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/ You are running this container as user with ID 1000 and group 1000, which should map to the ID and group for your user on the Docker host. Great! [my-container] admin@aime01:/workspace$
The container is run with the access rights of the user. To use privileged rights like for installing packages with 'apt' within the container use 'sudo'. The default is that no password is needed for sudo, to change this behaviour set a password with 'passwd'.
Multiple instances of a container can be opened with mlc-open. Each instance runs in its own process.
To exit an opened shell to the container type 'exit' on the command line. The last exited shell will automatically stop the container.
List available machine learning containers
will list all available containers for the current user
will output for example:
Available ml-containers are: CONTAINER FRAMEWORK STATUS [torch-vid2vid] Pytorch-1.2.0 Up 2 days [tf1.15.0] Tensorflow-1.15.0 Up 8 minutes [mx-container] Mxnet-1.5.0 Exited (137) 1 day ago [tf1-nvidia] Tensorflow-1.14.0_nvidia Exited (137) 1 week ago [tf1.13.2] Tensorflow-1.13.2 Exited (137) 2 weeks ago [torch1.3] Pytorch-1.3.0 Exited (137) 3 weeks ago [tf2-gpt2] Tensorflow-2.0.0 Exited (137) 7 hours ago
List active machine learning containers
show all current running ml containers and their CPU and memory usage
> mlc-stats Running ml-containers are: CONTAINER CPU % MEM USAGE / LIMIT [torch-vid2vid] 4.93% 8.516GiB / 63.36GiB [tf1.15.0] 7.26% 9.242GiB / 63.36GiB
Start machine learning containers
to explicitly start a container
mlc-start is a way to start the container to run installed background processes, like an installed web server, on the container without the need to open an interactive shell to it.
For opening a shell to the container just use 'mlc-open', which will automatically start the container if the container is not already running.
Stop machine learning containers
ml-stop container_name [-Y]
to explicitly stop a container.
mlc-stop on a container is comparable to a shutdown of a computer, all activate processes and open shells to the container will be terminated.
To force a stop on a container use:
mlc-stop my-container -Y
Remove/Delete a machine learning container
to remove the container.
Warning: the container will be unrecoverable deleted only data stored in the /workspace directory will be kept. Only use to clean up containers which are not needed any more.
Update ML Containers
to update the container managment system to latest version.
The container system and container repo will be updated to latest version. Run this command to check if new framework versions are available. On most systems privileged access (sudo password) is required to do so.
With these basic but powerfull commands it is possible to create, open and manage your deep learning containers.
One can install apt and Python packages within the container which do not compromise the host system. An additional venv to manage your Python package is possible but not really necessary. Just create a new container to experiment with a new setup.
Open multiple instances of your containers and run your deep learning sessions. Share data and source between the container and host through the mounted workspace directory.
In case you are working on a workstation edit and manage your data and sources with your favourite editor running on the desktop of the host system and test your changes without pushing and pulling.
AIME machines come pre installed with AIME machine learning container management system. Just login and above commands are available to get started in seconds.
AIME ML containers is also available as a open source project on github.