Background
In the past time, I used CoCalc to complete the Machine Learning in Python course lab with my teammates. Indeed, its real-time collaboration is very fantastic. However, its modified Jupyter kernel sometimes act weird, and I have to use its
The CoLab is very easy to use, and it stores the .ipynb
file in Google Drive, we can easily share with friends and sync between multi devices. Its mechanism is very similar to Google Docs. It provides decent CPU quota and 12 GB RAM, also sufficient GPU quota for a small project. Since it is totally free, it is an excellent product for data scientist beginner!
With the increase of complexity of neuron network, the CoLab
Fortunately, Professor Dr. Larson approved my access request to SMU ManeFrame II, it is a kind of HPC includes 36 accelerator nodes with NVIDIA P100 GPU, each node has 256 GB of DDR4-2400 RAM. The hardware equipment performs blazing fast for calculation works. It based on CentOS 7.x and use Slurm manage user workload.
Although we can use port forward to use HPC plain Jupyter server on our local machine, it would be fascinating if we can use HPC works with CoLab together! Thankfully, Google CoLab team added local runtime support feature, and we can use the jupyter_http_over_ws
plugin published by them to makes it work.
Walkthrough
I will use SMU ManeFrame II as the example, but this procedure will theoretically work on other HPC that use Slurm. Also, you can skip the HPC part and use it on your AWS GPU EC2 Instance, etc.
Conda Environment
First of all, we need to create a Conda virtual environment to let us easily maintain our library version and install packages. In ManeFrame II, the system GPU conda environment is in path /hpc/applications/anaconda/3/envs/tensorflow_gpu
. Therefore we can simply clone this environment to our home directory and use -n
parameter to name it tf
, you can change the name to whatever you like.
conda create -n tf --clone="/hpc/applications/anaconda/3/envs/tensorflow_gpu"
After setup the conda env, you can update Python to latest version use:
conda update -n tf python
Also, you can update TensorFlow or Keras version as the same. And if you want to install a new package like OpenCV, use:
conda install -y -c conda-forge -n tf opencv
Since the Jupyter is running on the server side, you may want to use a password instead of token for convenience:
jupyter notebook --generate-config
jupyter notebook password
This two command above will create the Jupyter folder if necessary, and create notebook configuration file, then prompt you for your password and record the hashed password in your jupyter_notebook_config.json
.
Request GPU Node
Then we can write a sbatch file to let Slurm request node for us:
#!/bin/bash
#SBATCH -J jupyter
#SBATCH -o tf_jupyter_%j.out
#SBATCH -p gpgpu-1
#SBATCH --gres=gpu:1
#SBATCH --mem=250G
#SBATCH --exclusive
module purge
module load tensorflow
source activate tf
login_node="login05.m2.smu.edu"
unset XDG_RUNTIME_DIR
port=60099
jupyter notebook --no-browser --NotebookApp.allow_origin='https://colab.research.google.com' --port=${port} &
sleep 30s
ssh -N -R ${port}:localhost:${port} ${login_node} &
wait
I saved the sbatch file on ~/script/jupyter_gpgpu.sbatch
, you can change the login_node or port number to fulfill your demand. After that, submit our request use sbatch
command:
sbatch ~/script/jupyter_gpgpu.sbatch
Then this request is in the Slurm queue and wait for resources, you can check the status by the following command:
squeue -u `whoami`
Connect to CoLab
After Slrum requested corresponding resources for us, we can simply use SSH to forward remote ManeFrame II Jupyter port to our local machine. In this case, since we defined login node to login05
and port to 60099
, we can use this command on our local machine:
ssh -L 60099:localhost:60099 your_username@login05.m2.smu.edu
And then open http://localhost:60099
in your web browser, login by the password you set to let Jupyter write credential information to Cookies. After that, we can open CoLab, and click the arrow down icon in CONNECT button, select60099
in the prompted modal:
Click the CONNECT button, if everything goes well, it will connect to HPC runtime, and you are all set.
The CoLab will work like a charm, and you can enjoy your training!