Skip to content

[RFC] Merging RABIT into XGBoost. #5995

Closed
@trivialfis

Description

@trivialfis

Background

This is a RFC for merging RABIT into XGBoost. For a long time, RABIT enables support of distributed training for XGBoost and is integrated as a git submodule. Most of tests are run on XGBoost and the code base of XGBoost is tightly connected to RABIT. For example, serialization is built on rabit serializable interface, nccl unique ID is obtained from RABIT, quantile merging between workers is based on RABIT's serializable allreduce handler. Because there are more matured MPI solutions like OpenMPI and UCX for CPU, RABIT did not get too much attention beyond XGBoost. Eventually maintaining RABIT in a separated repository creates more overhead on developers than actual benefits, which is one of the reasons that RABIT is rarely updated. Also we plan to sunset the allreduce implementation in RABIT in the future and utilize other widely adopted MPI solutions listed previously. Merging RABIT into XGBoost first will allow us achieve that incrementally with sufficient tests.

Plan

Future work

  • Rework on the OpenMPI backend to have better support for new backend (to be decided), probably adding nccl as a new backend too. This way we can have both old and new backends enabled for a smooth transition.

Concerns

As if we replace RABIT with other MPI solution and drop single point model recovery, would it be better if we don't merge it at all? This seems much cleaner, but as mentioned previously XGBoost is tightly connected to RABIT, also every change on RABIT must be tested on XGBoost first before merging. The replacement won't be trivial and I would like to do it incrementally and carefully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions