Skip to content

dask-mpi not working #30

Closed
Closed
@andersy005

Description

@andersy005

I have this script dask_mpi_test.py:

from dask_mpi import initialize
initialize()

from distributed import Client 
import dask
client = Client()

df = dask.datasets.timeseries()

print(df.groupby(['time', 'name']).mean().compute())
print(client)

When I try to run this script with:

mpirun -np 4 python dask_mpi_test.py

I get these errors:
~/workdir $ mpirun -np 4 python dask_mpi_test.py
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://xxxxxx:8786
distributed.scheduler - INFO -       bokeh at:                     :8787
distributed.worker - INFO -       Start worker at: tcp://xxxxx:44712
/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
  warnings.warn('\n' + msg)
distributed.worker - INFO -       Start worker at: tcp://xxxxxx:36782
distributed.worker - INFO -          Listening to:               tcp://:44712
distributed.worker - INFO -              bokeh at:                      :8789
distributed.worker - INFO -          Listening to:               tcp://:36782
distributed.worker - INFO - Waiting to connect to: tcp://xxxxxx:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -              bokeh at:                     :43876
distributed.worker - INFO - Waiting to connect to: tcp://xxxxx:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    3.76 GB
distributed.worker - INFO -                Memory:                    3.76 GB
distributed.worker - INFO -       Local Directory: /gpfs/fs1/scratch/abanihi/worker-uoz0vtci
distributed.worker - INFO -       Local Directory: /gpfs/fs1/scratch/abanihi/worker-bb0u_737
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
Traceback (most recent call last):
  File "dask_mpi_test.py", line 6, in <module>
    client = Client()
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/client.py", line 640, in __init__
    self.start(timeout=timeout)
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/client.py", line 763, in start
    sync(self.loop, self._start, **kwargs)
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/utils.py", line 321, in sync
    six.reraise(*error[0])
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/utils.py", line 306, in f
    result[0] = yield future
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/client.py", line 851, in _start
    yield self._ensure_connected(timeout=timeout)
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/client.py", line 892, in _ensure_connected
    self._update_scheduler_info())
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
tornado.util.Timeout
$ conda list dask
# packages in environment at /glade/work/abanihi/softwares/miniconda3/envs/analysis:
#
# Name                    Version                   Build  Channel
dask                      1.2.0                      py_0    conda-forge
dask-core                 1.2.0                      py_0    conda-forge
dask-jobqueue             0.4.1+28.g5826abe          pypi_0    pypi
dask-labextension         0.3.3                    pypi_0    pypi
dask-mpi                  1.0.2                    py37_0    conda-forge

$ conda list tornado
# packages in environment at /glade/work/abanihi/softwares/miniconda3/envs/analysis:
#
# Name                    Version                   Build  Channel
tornado                   5.1.1           py37h14c3975_1000    conda-forge
$ conda list distributed
# packages in environment at /glade/work/abanihi/softwares/miniconda3/envs/analysis:
#
# Name                    Version                   Build  Channel
distributed               1.27.0                   py37_0    conda-forge

Is anyone aware of anything that must have happened in an update to dask or distributed to cause dask-mpi to break?

Ccing @kmpaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions