Parallel computing using IPython: Important notes for naive scholars without CS background

Analysis of network and complex system requires too much computing resources. Although the learning curve is deep, the power of parallel computing must be utilized, otherwise, more time will be spent on waiting. Moreover, for exploratory academic research, we will not know what’s the next step until we finish the current analysis. So the research life-cycle becomes hypothesis -> operationalization -> LONG TIME coding and debugging -> LONG TIME waiting for result -> new hypothesis.

With IPython Notebook, parallel computing can be easily operated; however, like what I’ve said: We cannot understand the easiest programming skills unless we are able to operate them. I’ll not come to this post if I do not have to wait for a week only for one result. Playing parallel computing with IPython is easy, but for real jobs, it’s not. Scholars in social science area may be less skilled in programming – we are not trained to be. I’ve made great efforts and finally got some progress which may be laughed by CS guys.

While using IPython Notebook (now named Jupyter Notebook) for parallel computing, Jupyter will start several remote engines beside the local one we are using. These remote engines are blank which means that the variables and functions defined and modules imported on the local engine do not work on the remote ones. Specifically, the puzzle for me was (yes, was!): How to operate the variables, functions, and modules on the remote engines.

  • Operate Variables on Remote Engines

The locally defined variables cannot be directly used in remote engines, if we do, we will get error message like “Local variables *** is not defined.” Defining variables on remote engines is easy.

# Import modules and define DirectView of engines.
import ipyparallel as ipp
c = ipp.Client()
print c.ids
dview = c[:]

I start 4 engines in the Notebook (under IPython Clusters tab), so the result of above codes is:

[0, 1, 2, 3]

Define variable a = 5 on all four engines.

dview['a'] = 5
print dview['a']

This gives result:

[5, 5, 5, 5]

Change the value of a variable on a specific engine, say engine#0:

c[0]['a'] = c[0]['a'] +5
print c[0]['a']
print dview['a']

This gives result:

[10, 5, 5, 5]

We see the value of a on engine#0 is 10 now (line 1), different from the other three engines (line 2). Easy task.

  • Operate Functions on Remote Engines

“Remote functions are just like normal functions, but when they are called, they execute on one or more engines, rather than locally.” (Quote from Official Docs) IPython provides two decorators: @dview.remote(block=Boolean) and @dview.parallel(block=Boolean).

dview.execute('import numpy')
def RandomNumNP(i):
    return numpy.random.rand(i)
f = RandomNumNP(1)
print f

This gives result:

[array([ 0.27501591]), array([ 0.23903476]), array([ 0.83110713]), array([ 0.26537037])]

If we change the @dview.remote(block=True) to @dview.remote(block=False), the result will be (also refer to this post):

<AsyncResult: RandomNumNP>

Obtain the result:

print f.ready()
print f.get()

Result will be:

[array([ 0.52439151]), array([ 0.9508556]), array([ 0.89470218]), array([ 0.17898525])]

@dview.remote does not provide map function with which we can scatter a list of arguments to different engines. In the case above, we see the argument “1” is passed to all four engines. But @dview.parallel comes with map function.

def RandomNumNP(i):
    return numpy.random.rand(i)
dview.execute('import numpy')[1, 2, 3, 4, 5])

This gives result:

[array([ 0.10194246]),
 array([ 0.96114591,  0.12823448]),
 array([ 0.41361251,  0.0376582 ,  0.04891329]),
 array([ 0.18110394,  0.0382513 ,  0.01177635,  0.00672713]),
 array([ 0.22964665,  0.58272549,  0.55383465,  0.57941184,  0.73482765])]

The block=Boolean argument is the same with .remote.

  • Import Modules on Remote Engines

Two ways of importing modules on remote engines, we’ve already seen one in the above example (.execute), another way is using with loop.

with dview.sync_imports():
    import pandas as pd
    import matplotlib
    import matplotlib.pyplot as plt
    import numpy as np

The output is:

importing pandas on engine(s)
importing matplotlib on engine(s)
importing matplotlib.pyplot on engine(s)
importing numpy on engine(s)

Attention should be paid here: the import * as * command does not work in this scenario. If we use np instead of numpy in defining the function, we’ll have the following error message (But it does work in .execute!):

<ipython-input-5-c78f990d2e74> in RandomNumNP()
NameError: global name 'np' is not defined

With these three basic skills, i.e., operation of variables, functions, and modules on remote engines, we will be able to handle more complex tasks using IPython Notebook and parallel computing.