-
Notifications
You must be signed in to change notification settings - Fork 2
mpirun: add cli option to turn off gpu support #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Jeff Squyres <[email protected]>
Homebrew has broken something and I cannot figure out how to fix it. Signed-off-by: Ralph Castain <[email protected]> (cherry picked from commit 2ac45f3)
Fix the issues with the MacOS builds so that they work again in Github Action environments. Signed-off-by: Jeff Squyres <[email protected]>
Fix macos tests
It has been reported (and confirmed) that building against one version of PMIx and then running with another version will cause PRRTE to segfault. This isn't a universal rule. For example, one can switch v5.0 and master without a problem. However, switching v5.0 and v4.2 is a definite segfault. The root cause of the problem is a change in the layout of the base pmix_object_t definition. This renders all PMIx objects binary incompatible when crossing between the v5 and v4 (and below) series. Changing the v5 definition back to match v4 is an overly complex task. The changes were required to accommodate the new shared memory support that was introduced in v5. So instead, we check the runtime version of PMIx against the build version. If the runtime version is incompatible with the build version, then we print an explanatory error message and error out. Signed-off-by: Ralph Castain <[email protected]> bot:notacherrypick dd Signed-off-by: Ralph Castain <[email protected]>
In some recent Slurm versions, the Slurm runtime is inserting custom arguments to the PRRTE launcher's `srun` cmd line without the user being aware of it. In many cases, this may not be a problem - but in some cases (where the user or the system admin needs/wants particular cmd line arguments used) this can cause problems as it happens silently, without the user being aware of it. Make this visible when it happens, and provide a mechanism by which the user/admin can override it. Provide a fairly long help message explaining what happened and offering advice on resolution, along with a param for disabling the warning. Add a param for overriding the "args" param if necessary, along with a caution as to possible consequences. Signed-off-by: Ralph Castain <[email protected]>
Enables build against v1.11.8 and above. Signed-off-by: Ralph Castain <[email protected]>
Add a new cmd line option that corresponds to this attribute. Add the attribute to the prun payload. When received, it will default to including in the job info for the spawned job. Add query support for it. Signed-off-by: Ralph Castain <[email protected]> (cherry picked from commit 3957789)
Changes will need to be made to Open MPI to parse the contents of the OMPI_MCA_mpi_memory_alloc_kinds environment variable to determine how to use the user supplied memory-alloc-kinds information. See section 11.4.3 of the MPI 4.1 standard. Signed-off-by: Howard Pritchard <[email protected]>
Check the runtime version of PMIx
Provide a warning of potentially unknown Slurm params
MPI 4.1: add support for memory-alloc-kinds
If we use one cpu from an object, then we will get a NULL response if we ask for the next object of that type within the remaining cpuset since not all of the cpus in the object are still available. This problem resulted from the recent change to only use available cpus in PRRTE topologies. So instead scan across the cpus, check to see if it is inside the object of interest - if so, then we can bind to that cpu, if not then we keep searching. Signed-off-by: Ralph Castain <[email protected]>
Repair the binding algorithm
If we are trying to bind to an HWLOC object type that is not defined on a given node, then (a) if the binding policy was specified by user, then error out; and (b) if we are using a default binding policy, then simply do not bind. Signed-off-by: Ralph Castain <[email protected]>
Protect against missing HWLOC object types
Signed-off-by: Luke Robison <[email protected]>
Fix a segfault when no arguments are provided
We currently do not support the LTO optimizer as it is incompatible with our plugin component architecture. So detect it has been specified in configure and error out with an explanation. Includes suggestions from @jsquyres Signed-off-by: Ralph Castain <[email protected]>
Break the multi-loop thru loading of param files that caused us to overwrite values. Defer to the PMIx pmdl components for obtaining envars and for checking MCA param overlaps across projects. Signed-off-by: Ralph Castain <[email protected]>
Python 3.12 no longer allows escapes in regular expressions. Instead, use "r" strings. Signed-off-by: Ralph Castain <[email protected]>
Protect against LTO optimizer
docs: update for Python 3.12
Use the PMIx functions to check params
Signed-off-by: Ralph Castain <[email protected]> (from upstream commit e204c73)
The formatting is messed up in places, so try and fit it. Signed-off-by: Ralph Castain <[email protected]> (from upstream commit b4f5884)
We don't really have stale issues, so we can close anything old manually. Signed-off-by: Ralph Castain <[email protected]> (from upstream commit 2215b29)
See open-mpi/ompi#12829 for an explanation Signed-off-by: Ralph Castain <[email protected]> (from upstream commit bff20a8)
RTD is rolling out some changes. Per https://about.readthedocs.com/blog/2024/07/addons-by-default/, these are the changes we need to make. Port of open-mpi/ompi#12687 Signed-off-by: Ralph Castain <[email protected]> (from upstream commit 584845f)
Protect against the envar version of the Slurm custom args param
We have deprecated the "socket" object in favor of "package", so we need to extend that treatment to the resource qualifier in the `--map-by ppr` directive. Detect both "socket" and the "skt" shorthand for backward-compatibility reasons. Signed-off-by: Ralph Castain <[email protected]> (from upstream commit 7f507df)
Fix deprecation warnings for ppr on socket objects
We played with alternative implementations over the years, but nothing ever really stuck. So let's simplify the code by removing the framework and the associated loopbacks that were built into it (e.g., if a message cannot be sent, then loop it back into the OOB base to see if another component is available that can send it). The code badly needs reorganization as I've made no attempt to do so here. A pass to see if event steps can be eliminated would also be good - I've cleaned up a few of them, but what remains could use another pair of eyes. Signed-off-by: Ralph Castain <[email protected]> (from upstream commit 7b54c48)
We no longer support all the way to pre-StoneAge versions. Signed-off-by: Ralph Castain <[email protected]> (from upstream commit cd8b2f7)
Not sure the compiler is correct in its complaint, but just in case, let's ensure that the variables being used are always initialized. Signed-off-by: Ralph Castain <[email protected]> (from upstream commit d5e580a)
Fix typo that locked the target to rank=0 and instead use the rank provided by user. Signed-off-by: Ralph Castain <[email protected]> (from upstream commit b0d461d)
It isn't perfect and it does raise precedence concerns, but let the OMPI schizo component parse the OMPI param files. Use the PMIx pmdl functions to check for relevant params to convert to PRRTE and PMIx. Restore the overlap detection for when params are set for OMPI frameworks that have a corrolary to frameworks in PRRTE and PMIx. This still raises precedence questions. For now, what this will do is have values in the OMPI param files give way to corresponding values specified in PRRTE and PMIx param files. In other words, values in PMIx and PRRTE param files are written into the environment, not overwriting any previously existing envar. OMPI param files are then read, and any params that correspond to PRRTE and PMIx params are written into the environment - without overwriting any that already exist. So if you have a param specified in a PRRTE param file, and you also specify it in the OMPI param file, the PRRTE param file value will "win". User beware - someone is bound to be very unhappy. Signed-off-by: Ralph Castain <[email protected]> (from upstream commit fcbefcc)
Collapse the OOB framework
Unlock stdin target
Raise min required HWLOC version to 2.1.0
See long explanation on openpmix#2025 Signed-off-by: Ralph Castain <[email protected]> (from upstream commit 7de93e9)
No longer any challengers to hwloc, so streamline the binding support by re-integrating it back into the odls framework. Signed-off-by: Ralph Castain <[email protected]> (from upstream commit add324c)
Remove the rtc framework
Get takes a (pmix_value_t**), so don't cast it to (void**) Signed-off-by: Ralph Castain <[email protected]> (from upstream from commit 16ce9a8) Signed-off-by: George Bosilca <[email protected]>
Restore NUMA tracking
Restore parsing of OMPI param files
Protect against uninitialized var
Silence warning
@edgargabriel check wording in the rstxt file and suggest changes. I'll open a small docs related commit on ompi as well. |
Errr...few things: (a) there is no "disable gpu support" defined in PMIx - I assume you are planning something over there? (b) What are you expecting PMIx to do with this? I don't see PRRTE doing anything. Are we just setting an envar (MCA or otherwise)? (c) why would you put it here, and then include a change to the PRRTE schizo component??? |
that was me looking at your stuff too much. i was following the mem kind thing example that you did. what does pmix have to do with memkind support? seems same question applied there. |
idea is that maybe one might want to build open mpi with gpu support, but there may be performance benefit of disabling all gpu related paths when using this library on systems without accelerators. Signed-off-by: Howard Pritchard <[email protected]>
771fed4
to
7ef15a1
Compare
Sigh - that isn't what I said. I asked what you were expecting PMIx to do with it - pass an envar, add it to the job-level info, ...? It was a simple question. We knew the answer for memkind - just wanted to hear an answer for this one. There are several ways one can deal with these things. Setting an envar is one - another is adding it to the job-level info. The first doesn't require a PMIx key, while the second does. Envars can be a bad way to pass things, while job-level info tends to be more reliable. Sometimes, you just need to stop and think thru the various use-cases (e.g., will this be an inherited characteristic across comm-spawn? Will a tool need to determine the status of this for a given job, or include it in a launch request?) before focusing down on a solution. |
Please take a look at openpmix/openpmix#3528 and openpmix#2134 to get a better understanding of what I was trying to communicate. See if that does what you were after. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wording looks good to me
FWIW: I committed the upstream changes as (a) they were of potential interest to another library group, and (b) the tools folks indicated interest in querying the state of GPU support. |
idea is that maybe one might want to build open mpi with gpu support, but there may be performance benefit of disabling all gpu related paths when using this library on systems without accelerators.