Description
Background
This is a RFC for merging RABIT into XGBoost. For a long time, RABIT enables support of distributed training for XGBoost and is integrated as a git submodule. Most of tests are run on XGBoost and the code base of XGBoost is tightly connected to RABIT. For example, serialization is built on rabit serializable interface, nccl unique ID is obtained from RABIT, quantile merging between workers is based on RABIT's serializable allreduce handler. Because there are more matured MPI solutions like OpenMPI and UCX for CPU, RABIT did not get too much attention beyond XGBoost. Eventually maintaining RABIT in a separated repository creates more overhead on developers than actual benefits, which is one of the reasons that RABIT is rarely updated. Also we plan to sunset the allreduce implementation in RABIT in the future and utilize other widely adopted MPI solutions listed previously. Merging RABIT into XGBoost first will allow us achieve that incrementally with sufficient tests.
Plan
- Merge rabit as a git subtree under XGBoost as a standalone directory. ([DO NOT MERGE] Merge rabit #6001)
- Merge tests into XGBoost. (Refactor rabit tests #6096)
- Change CMake scripts for better integration. Removing the mock target. (Refactor rabit tests #6096)
- Add win socket support in order to reduce the 3 rabit libraries into 1, while supporting distributed training on Windows. (rabit_robust, rabit_base, rabit_empty). (Enable building rabit on Windows #6105, Remove unused RABIT targets. #6110)
- Enable clang-tidy on RABIT, other than code style, clang-tidy found some real issues on RABIT's code base so we should enable it as soon as possible (Correct style warnings from clang-tidy for rabit. #6095, Merge logging facilities for rabit and xgboost. #6101).
- Drop single point model recovery. ([WIP] Drop single point model recovery. #6112)
Future work
- Rework on the OpenMPI backend to have better support for new backend (to be decided), probably adding nccl as a new backend too. This way we can have both old and new backends enabled for a smooth transition.
Concerns
As if we replace RABIT with other MPI solution and drop single point model recovery, would it be better if we don't merge it at all? This seems much cleaner, but as mentioned previously XGBoost is tightly connected to RABIT, also every change on RABIT must be tested on XGBoost first before merging. The replacement won't be trivial and I would like to do it incrementally and carefully.