Create a VM with Jupyter and DataHub

Step by step guide to get a Virtual Machine for Jupyter and DataHub in your cloud provider

Overview

This tutorial describes how to start a Virtual Machine in the EGI Federation that runs a browser-accessible Jupyter server with DataHub spaces mounted. This setup can be used in the EGI Federation or in any other provider which synchronise images from AppDB but is not part of the federation.

Requirements

This tutorial assumes you have:

  • A valid EGI account: learn to can create one in Check-in.
  • Access to a cloud provider where the Jupyter DataHub VM is available. Alternatively, this VM can be run on your computer using a virtualisation tool like VirtualBox.

Create a VM with Jupyter and DataHub

Step 1: Start your VM

Start your VM on your cloud provider or virtualisation tool. You can check the tutorial on how to start a VM to learn how to start a VM at EGI’s Federated Cloud infrastructure.

This VM does not contain any default credentials, in order to access it you need a ssh key. Check this FAQ entry for more information. If you are starting this VM on VirtualBox, you will need to pass some valid context for cloud-init, see here how to prepare it.

The VM image is ready to listen on port 22 for ssh access and port 80 for accessing the notebooks interface. Make sure your have those ports open on your security groups, otherwise you will not be able to reach the Jupyter notebooks.

Once your instance is ready, assign it a public IP so you can reach it from your computer.

Step 2: Get a hostname and certificate for your VM

Your VM is ready to be accessible, but runs a plain HTTP server, which is not secure enough. If you try to connect with your browser to your VM, you will get a message as shown in the screenshot below:

no-https

You must enable HTTPS to encrypt requests and responses, thus making your VM safer and more secure.

Firstly, you need a valid name for your VM. You can use the FedCloud Dynamic DNS to create a name. See Dynamic DNS docs for more information on the service. Once you have your name ready, assign it your VM’s IP.

Secondly, you need to get a certificate to enable HTTPS. The VM has certbot already installed, you just need to run it with the hostname you have allocated and your email address as shown here:

# log into your VM
$ ssh ubuntu@<your VM's IP>
# now request the certificate
$ sudo certbot --nginx -d <your registered name> -m <your email>

Finally, open your browser and go to https://<your registered name>/ to see Jupyter started. Follow next steps for getting the credentials to access the service.

Step 3: Get your token for the Jupyter server

Your VM will spawn a Jupyter notebooks server upon starting. This server runs as an unprivileged user named jovyan with the software installed using micromamba. The server uses a randomly generated token for authentication that you can obtain by logging into the VM and becoming jovyan:

$ ssh ubuntu@<your VM's IP>
# become jovyan user and activate the default environment
$ sudo su - jovyan
$ micromamba activate
$ jupyter server list --jsonlist | jq -r .[].token
<your token>

Step 4: Start your notebooks

Now point your browser to http://<your VM's IP> and you will be able to enter the token to get started with Jupyter.

You can install additional packages with mamba from a terminal started from Jupyter or via ssh. For example for installing tensorflow:

$ micromamba activate
$ micromamba install -c conda-forge tensorflow

Step 5: Mount DataHub spaces

Log into EGI’s DataHub and create a token for mounting your data in the VM.

You will also need the IP or address of your closest Oneprovider for the spaces you are interested in accessing. This information is easily obtainable via DataHub’s web interface.

Go to your Jupyter session in your browser and edit the mount.sh file in your home directory. Set the ONECLIENT_ACCESS_TOKEN and ONECLIENT_PROVIDER_HOST values to get access to DataHub:

Edit mount.sh

Open a terminal from the launcher screen and execute the mount.sh script:

mount-onedata

You should now see a datahub folder with all your spaces available directly from your Jupyter interface