Social scientists also run heavy computational jobs. In one of my projects, I need to analyze the psychological state of a few billion Telegram messages. ChameleonCloud provides hosts with up to 64 cores (or “threads”, sometimes “workers”, yes these terms are confusing but CS folks to blame). But even with parallel computing on the best server, the job will run for years, and I need this project for tenure.
Build a computing cluster on ChameleonCloud
So I started to build my own computing cluster on ChameleonCloud with dask
and joblib
. Here is a good reference to start. Things have been changed but the reference is not updated. Here is what I did.
Step 1: Start a few hosts on CC and install necessary Python packages (ref).
conda install dask distributed -c conda-forge conda install bokeh pip install paramiko joblib
Step 2: Specify a host as “scheduler,” and we will call all the other hosts “worker hosts,” and make sure the scheduler can communicate with all the other hosts through SSH.
- See if you have a key on the scheduler host,
ls .ssh/
should show a pair of files namedid_rsa
andid_rsa.pub
; otherwise, create a key withssh-keygen
(you can just pushenter
when question prompts). - Copy the content in
id_rsa.pub
to all the hosts’~/.ssh/authorized_keys
files (including scheduler host). - For non-CS people, the above two steps seem strange. An intuitive explanation is,
id_rsa
is your key, andid_rsa.pub
is the lock. You put the lock on all of the worker hosts. Therefore, your scheduler can use the key to open all the worker hosts, that is, communicating through SSH.
Step 3: Start the distributed computing system using dask-ssh
. It’s reference is useful but with caveats which I will detail later. As you can see I can also specify the scheduler host as a worker host.
dask-ssh \ --scheduler 10.52.2.120 \ --ssh-username root \ --ssh-private-key ~/.ssh/id_rsa \ --nprocs 48 \ --nthreads 1 \ --memory-limit 0 \ # No limit on memory. 10.52.2.120 10.52.3.108 10.52.3.62 10.52.2.164
You should be good now, if not, very common. Just spend some time Googling. If everything looks good, you can have a computing cluster like this:
Input:
from dask.distributed import Client import joblib client = Client("10.52.2.120:8786") client
Output:
Client Scheduler: tcp://10.52.2.120:8786 Dashboard: http://10.52.2.120:8787/status Cluster Workers: 288 Cores: 288 Memory: 1.16 TB
Pretty fancy. Now you can easily parallel your job on this small cluster, for example:
with joblib.parallel_backend('dask'): %time _ = Parallel(n_jobs=-1)(delayed(a_function)(param=param) for param in param_list)
You can monitor the system via: http://scheduler_ip:8787/status
. A few things were quite confusing to me and spent me a lot of time. Probably you will need to handle them as well.
How many --nprocs
and --nthreads
should I start?
The terms (e.g., workers, process, jobs, threads, cores …) are chaos. Sometimes different names mean the same thing, but sometimes the same name indicates different things, and you don’t you are in which situation. Here are my lessons:
--nprocs
is the number of jobs you want to start per host. You can set it as the number of a host’s cores (e.g., 48).--nthreads
is the number of threads per--nprocs
. I explicitly specify it as 1.- For example, I have 6 worker hosts, each has 48 processes, this gives me 288 “workers”–this number will be the number for
n_jobs
injoblib
if you specifyn_jobs=-1
(i.e., use all available workers/processes/threads/whatever the name is).
When to use computing cluster?
You should use a cluster for heavy-duty jobs. Scheduling tasks adds additional time (one millisecond/task). What I really mean here is, each job is a heavy-duty job, and you have many heavy-duty jobs; NOT, you have a billion easy jobs, and the entire job is heavy-duty.
But what if you have a billion easy jobs? My task is one like this: Analyzing one Telegram message is easy, but I have a few billion messages. In this case, you can pack a million jobs as one task, and this task will be a heavy-duty job. Now we are talking about efficiency, and this page is a good one.
Through the status dashboard, this is a job before chunking. Each job takes about less than a second.
This is a job after chunking. Each job takes about 10 seconds.
Eventually, you may want to optimize the code to have the yellow lines as continuous and as long as possible.