Build your own computing cluster on ChameleonCloud

Social scientists also run heavy computational jobs. In one of my projects, I need to analyze the psychological state of a few billion Telegram messages. ChameleonCloud provides hosts with up to 64 cores (or “threads”, sometimes “workers”, yes these terms are confusing but CS folks to blame). But even with parallel computing on the best server, the job will run for years, and I need this project for tenure.

Build a computing cluster on ChameleonCloud

So I started to build my own computing cluster on ChameleonCloud with dask and joblib. Here is a good reference to start. Things have been changed but the reference is not updated. Here is what I did.

Step 1: Start a few hosts on CC and install necessary Python packages (ref).

conda install dask distributed -c conda-forge
conda install bokeh
pip install paramiko joblib

Step 2: Specify a host as “scheduler,” and we will call all the other hosts “worker hosts,” and make sure the scheduler can communicate with all the other hosts through SSH.

  • See if you have a key on the scheduler host, ls .ssh/ should show a pair of files named id_rsa and id_rsa.pub; otherwise, create a key with ssh-keygen (you can just push enter when question prompts).
  • Copy the content in id_rsa.pub to all the hosts’ ~/.ssh/authorized_keys files (including scheduler host).
  • For non-CS people, the above two steps seem strange. An intuitive explanation is, id_rsa is your key, and id_rsa.pub is the lock. You put the lock on all of the worker hosts. Therefore, your scheduler can use the key to open all the worker hosts, that is, communicating through SSH.

Step 3: Start the distributed computing system using dask-ssh. It’s reference is useful but with caveats which I will detail later. As you can see I can also specify the scheduler host as a worker host.

dask-ssh \
    --scheduler 10.52.2.120 \
    --ssh-username root \
    --ssh-private-key ~/.ssh/id_rsa \
    --nprocs 48 \
    --nthreads 1 \
    --memory-limit 0 \ # No limit on memory.
    10.52.2.120 10.52.3.108 10.52.3.62 10.52.2.164

You should be good now, if not, very common. Just spend some time Googling. If everything looks good, you can have a computing cluster like this:

Input:

from dask.distributed import Client
import joblib
client = Client("10.52.2.120:8786")
client

Output:

Client
Scheduler: tcp://10.52.2.120:8786
Dashboard: http://10.52.2.120:8787/status

Cluster
Workers: 288
Cores: 288
Memory: 1.16 TB

Pretty fancy. Now you can easily parallel your job on this small cluster, for example:

with joblib.parallel_backend('dask'):
    %time _ = Parallel(n_jobs=-1)(delayed(a_function)(param=param) for param in param_list)

You can monitor the system via: http://scheduler_ip:8787/status. A few things were quite confusing to me and spent me a lot of time. Probably you will need to handle them as well.

How many --nprocs and --nthreads should I start?

The terms (e.g., workers, process, jobs, threads, cores …) are chaos. Sometimes different names mean the same thing, but sometimes the same name indicates different things, and you don’t you are in which situation. Here are my lessons:

  • --nprocs is the number of jobs you want to start per host. You can set it as the number of a host’s cores (e.g., 48).
  • --nthreads is the number of threads per --nprocs. I explicitly specify it as 1.
  • For example, I have 6 worker hosts, each has 48 processes, this gives me 288 “workers”–this number will be the number for n_jobs in joblib if you specify n_jobs=-1 (i.e., use all available workers/processes/threads/whatever the name is).

When to use computing cluster?

You should use a cluster for heavy-duty jobs. Scheduling tasks adds additional time (one millisecond/task). What I really mean here is, each job is a heavy-duty job, and you have many heavy-duty jobs; NOT, you have a billion easy jobs, and the entire job is heavy-duty.

But what if you have a billion easy jobs? My task is one like this: Analyzing one Telegram message is easy, but I have a few billion messages. In this case, you can pack a million jobs as one task, and this task will be a heavy-duty job. Now we are talking about efficiency, and this page is a good one.

Through the status dashboard, this is a job before chunking. Each job takes about less than a second.

Worker status before chunking. Each horizontal line is a worker, yellow color indicates the CPU is working on the job.

This is a job after chunking. Each job takes about 10 seconds.

Worker status after chunking. Each horizontal line is a worker, yellow color indicates the CPU is working on the job.

Eventually, you may want to optimize the code to have the yellow lines as continuous and as long as possible.

Leave a Reply