Skip to content

Commit a61a079

Browse files
committed
doc.
1 parent a736125 commit a61a079

File tree

1 file changed

+89
-52
lines changed

1 file changed

+89
-52
lines changed

doc/tutorials/external_memory.rst

Lines changed: 89 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,27 @@
22
Using XGBoost External Memory Version
33
#####################################
44

5-
XGBoost supports loading data from external memory using builtin data parser. And
6-
starting from version 1.5, users can also define a custom iterator to load data in chunks.
7-
The feature is still experimental and not yet ready for production use. In this tutorial
8-
we will introduce both methods. Please note that training on data from external memory is
9-
not supported by ``exact`` tree method.
5+
When working with large datasets, training XGBoost models can be challenging as the entire
6+
dataset needs to be loaded into memory. This can be costly and sometimes
7+
infeasible. Staring from 1.5, users can define a custom iterator to load data in chunks
8+
for running XGBoost algorithms. External memory can be used for both training and
9+
prediction, but training is the primary use case and it will be our focus in this
10+
tutorial. For prediction and evaluation, users can iterate through the data themseleves
11+
while training requires the full dataset to be loaded to the memory.
12+
13+
During training, there are two different approaches for external memory support available
14+
in XGBoost, one for CPU-based algorithms like ``hist`` and ``approx``, another one for the
15+
GPU-based training algorithm. We will introduce them in the following sections.
1016

11-
.. warning::
17+
.. note::
1218

13-
The implementation of external memory uses ``mmap`` and is not tested against system
14-
errors like disconnected network devices (`SIGBUS`). In addition, Windows is not yet
15-
supported.
19+
Training on data from external memory is not supported by the ``exact`` tree method.
1620

1721
.. note::
1822

19-
When externel memory is used, the CPU training performance is IO bounded. Meaning, the
20-
training speed is almost exclusively determined by the disk IO speed. For GPU, please
21-
read on and see the gradient-based sampling with external memory. During benchmark, we
22-
used a NVME connected to a PCIE slot, the performance is "usable" with ``hist`` on CPU.
23+
The implementation of external memory uses ``mmap`` and is not tested against system
24+
errors like disconnected network devices (`SIGBUS`). In addition, Windows is not yet
25+
supported.
2326

2427
*************
2528
Data Iterator
@@ -28,8 +31,8 @@ Data Iterator
2831
Starting from XGBoost 1.5, users can define their own data loader using Python or C
2932
interface. There are some examples in the ``demo`` directory for quick start. This is a
3033
generalized version of text input external memory, where users no longer need to prepare a
31-
text file that XGBoost recognizes. To enable the feature, user need to define a data
32-
iterator with 2 class methods ``next`` and ``reset`` then pass it into ``DMatrix``
34+
text file that XGBoost recognizes. To enable the feature, users need to define a data
35+
iterator with 2 class methods: ``next`` and ``reset``, then pass it into the ``DMatrix``
3336
constructor.
3437

3538
.. code-block:: python
@@ -73,18 +76,84 @@ constructor.
7376
7477
# Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some caveats
7578
# as noted in following sections.
76-
booster = xgboost.train({"tree_method": "approx"}, Xy)
79+
booster = xgboost.train({"tree_method": "hist"}, Xy)
80+
81+
82+
The above snippet is a simplified version of ``demo/guide-python/external_memory.py``.
83+
For an example in C, please see ``demo/c-api/external-memory/``. The iterator is the
84+
common interface for using external memory with XGBoost, you can pass the resulting
85+
``DMatrix`` object for training, prediction, and evaluation.
86+
87+
It is important to set the batch size based on the memory available. A good starting point
88+
is to set the batch size to 10GB per batch if you have 64GB of memory. It is *not*
89+
recommended to set small batch sizes like 32 samples per batch, as this can seriously hurt
90+
performance in gradient boosting.
91+
92+
***********
93+
CPU Version
94+
***********
95+
96+
In the previous section, we demonstrated how to train a tree-based model using the
97+
``hist`` tree method on a CPU. This method involves iterating through data batches stored
98+
in a cache during tree construction. For optimal performance, we recommend using the
99+
``grow_policy=depthwise`` setting, which allows XGBoost to build an entire layer of tree
100+
nodes with only a few batch iterations. Conversely, using the ``lossguide`` policy
101+
requires XGBoost to iterate over the data set for each tree node, resulting in slower
102+
performance.
103+
104+
If external memory is used, the performance of CPU training is limited by IO
105+
(input/output) speed. This means that the disk IO speed primarily determines the training
106+
speed. During benchmarking, we used an NVME connected to a PCIe-4 slot, other types of
107+
storage can be too slow for practical usage. In addition, your system may perform caching
108+
to reduce the overhead of file reading.
109+
110+
**********************************
111+
GPU Version (GPU Hist tree method)
112+
**********************************
113+
114+
External memory is supported by GPU algorithms (i.e. when ``tree_method`` is set to
115+
``gpu_hist``). However, the algorithm used for GPU is different from the one used for
116+
CPU. When training on a CPU, the tree method iterates through all batches from external
117+
memory for each step of the tree construction algorithm. On the other hand, the GPU
118+
algorithm concatenates all batches into one and stores it in GPU memory. To reduce overall
119+
memory usage, users can utilize subsampling. The good news is that the GPU hist tree
120+
method supports gradient-based sampling, enabling users to set a low sampling rate without
121+
compromising accuracy.
122+
123+
.. code-block:: python
124+
125+
param = {
126+
...
127+
'subsample': 0.2,
128+
'sampling_method': 'gradient_based',
129+
}
130+
131+
For more information about the sampling algorithm and its use in external memory training,
132+
see `this paper <https://arxiv.org/abs/2005.09148>`_.
133+
134+
.. warning::
135+
136+
When GPU is running out of memory during iteration on external memory, user might
137+
recieve a segfault instead of an OOM exception.
77138

139+
*******
140+
Remarks
141+
*******
78142

79-
The above snippet is a simplified version of ``demo/guide-python/external_memory.py``. For
80-
an example in C, please see ``demo/c-api/external-memory/``.
143+
When using external memory with XBGoost, data is divided into smaller chunks so that only
144+
a fraction of it needs to be stored in memory at any given time. It's important to note
145+
that this method only applies to the predictor data (``X``), while other data, like labels
146+
and internal runtime structures are concatenated. This means that memory reduction is most
147+
effective when dealing with wide datasets where ``X`` is larger compared to other data
148+
like ``y``, while it has little impact on slim datasets.
81149

82150
****************
83151
Text File Inputs
84152
****************
85153

86-
There is no big difference between using external memory version and in-memory version.
87-
The only difference is the filename format.
154+
This is the original form of external memory support, users are encouraged to use custom
155+
data iterator instead. There is no big difference between using external memory version of
156+
text input and the in-memory version. The only difference is the filename format.
88157

89158
The external memory version takes in the following `URI <https://en.wikipedia.org/wiki/Uniform_Resource_Identifier>`_ format:
90159

@@ -117,35 +186,3 @@ XGBoost will first load ``agaricus.txt.train`` in, preprocess it, then write to
117186
more notes about text input formats, see :doc:`/tutorials/input_format`.
118187

119188
For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train?format=libsvm#dtrain.cache"``.
120-
121-
122-
**********************************
123-
GPU Version (GPU Hist tree method)
124-
**********************************
125-
External memory is supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).
126-
127-
If you are still getting out-of-memory errors after enabling external memory, try subsampling the
128-
data to further reduce GPU memory usage:
129-
130-
.. code-block:: python
131-
132-
param = {
133-
...
134-
'subsample': 0.1,
135-
'sampling_method': 'gradient_based',
136-
}
137-
138-
For more information, see `this paper <https://arxiv.org/abs/2005.09148>`_. Internally
139-
the tree method still concatenate all the chunks into 1 final histogram index due to
140-
performance reason, but in compressed format. So its scalability has an upper bound but
141-
still has lower memory cost in general.
142-
143-
***********
144-
CPU Version
145-
***********
146-
147-
For CPU histogram based tree methods (``approx``, ``hist``) it's recommended to use
148-
``grow_policy=depthwise`` for performance reason. Iterating over data batches is slow,
149-
with ``depthwise`` policy XGBoost can build a entire layer of tree nodes with a few
150-
iterations, while with ``lossguide`` XGBoost needs to iterate over the data set for each
151-
tree node.

0 commit comments

Comments
 (0)