Neither one nor Many
Software engineering blog about my projects, geometry, visualization and music.
Not really a "proper" blog post but maybe useful to someone, it's basically the notes I took for myself after a few attempts to properly deploy it. It's easy to do some steps wrong if you are not super familiar with Cloudera Manager (CM).
I am not going into detail with every caveat, but https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_install.html is what I used as a basis for installing Cloudera Data Science Workbench (CDSW) on top of CM.
In my case I am using OpenStack nodes with the following layout:
Headnode master.cm.cluster
, Computenodes node00[1-6].cm.cluster
(10.141.0.0/24) (. All with 8GiB Memory and 4VCPU's and 80GiB disks.
Note that you have to assign two additional volumes of 500 GiB to one of the computenodes.
I created two 200 GiB volumes (it will only give a warning that it's probably not enough, but for demo purposes it is.) and assigned them to node002.cm.cluster where I will deploy CDSW.
Versions used
Supported OS currently is CentOS 7.2 (apparently CDSW does not support 7.3)
Make sure port 7180
will be reachable on your Headnode so you can configure Cloudera Manager later.
Easiest for me is to just copy & paste the commands I prepared for the Docker image. First Headnode then do the same on the Compute nodes.
If you use something like Bright Cluster Manager you probably just do the computenode stuff once in the desired computenode image and start an image transfer.
I install the free version and I don't install Spark, because version 1.6.0 is not supported by CDSW. You can uninstall it and replace it later if you already deployed it.
I am not going to write out every detail, but you need to install:
Please note that in the case of Spark 2 you also need to install the CSD (Custom Service Definition!) Or you won't find "Spark 2" when you do "Add new Service" inside CM. This stuff is described in the above links.
It is important that the Spark2 gateway role is assigned to the computenode where you intend to run CDSW.
Now that you have spark2-submit hopefully working on your node002.cm.cluster
node, you can start to deploy CDSW there.
Write down where the additional 200GiB volumes are, something like /dev/vdc
and /dev/vdd
for example.
Simply yum localinstall /path/to/cloudera-data-science-workbench-1.0.1-1.el7.centos.x86_64.rpm
and follow the instructions.
It should print out details on how to proceed (and you have to click OK on a lot of warnings probably)
The instructions include editing the cdsw.conf
*
Which should be easy:
[root@node002 ~]# grep = /etc/cdsw/config/cdsw.conf
DOMAIN="node002.cm.cluster"
MASTER_IP="10.141.0.2"
DOCKER_BLOCK_DEVICES="/dev/vdd"
APPLICATION_BLOCK_DEVICE="/dev/vdc"
TLS_ENABLE=""
# You must also set TLS_ENABLE=true above to enable and enforce termination.
TLS_CERT=""
TLS_KEY=""
HTTP_PROXY=""
HTTPS_PROXY=""
ALL_PROXY=""
NO_PROXY=""
KUBE_TOKEN=05023c.3a168925213858dc
If a previous cdsw init
failed, just run cdsw reset
first.
One of the steps that failed during cdsw init
in my case was docker, I edited the Docker service file (systemctl status docker
to find the location)
I removed the --storage
parameters from /etc/systemd/system/docker.service
:
ExecStart=/usr/bin/docker daemon \
--log-driver=journald \
-s devicemapper \
--storage-opt dm.basesize=100G \
--storage-opt dm.thinpooldev=/dev/mapper/docker-thinpool \
--storage-opt dm.use_deferred_removal=true \
--iptables=false
Leaving:
ExecStart=/usr/bin/docker daemon \
--log-driver=journald \
-s devicemapper \
--iptables=false
Before running cdsw init
.
Please write down the kubeadm join command that is suggested, and have fun waiting for watch cdsw status
.
Forgot to write details down for this, but I think you can just yum localinstall the cdsw rpm first for all dependencies like docker, kubernetes. Then there is the following command you can execute:
kubeadm join --token=05023c.3a168925213858dc 10.141.0.2
This was printed earlier when we ran cdsw init
on node002
.
In my case I did not really properly configure DNS etc., it's just a test setup. So I needed to add the following to my hosts file and add some of the more important hosts:
10.141.0.2 node002 node002.cm.cluster livelog.node002.cm.cluster consoles.node002.cm.cluster
(For console access some random hostname is used, you may have to add those in case you stumble upon a non resolving hostname..)
You might need to start a tunnel to the cluster if your computenodes are not exposed like this:
function tun {
sshuttle -r root@$1 10.141.0.0/16
}
tun <IP>
Teknik Telekomunikasi
website: https://smb.telkomuniversity.ac.id/ @
2024-07-02 13:36:51