2
2
Using XGBoost External Memory Version
3
3
#####################################
4
4
5
- XGBoost supports loading data from external memory using builtin data parser. And
6
- starting from version 1.5, users can also define a custom iterator to load data in chunks.
7
- The feature is still experimental and not yet ready for production use. In this tutorial
8
- we will introduce both methods. Please note that training on data from external memory is
9
- not supported by ``exact `` tree method.
5
+ When working with large datasets, training XGBoost models can be challenging as the entire
6
+ dataset needs to be loaded into memory. This can be costly and sometimes
7
+ infeasible. Staring from 1.5, users can define a custom iterator to load data in chunks
8
+ for running XGBoost algorithms. External memory can be used for both training and
9
+ prediction, but training is the primary use case and it will be our focus in this
10
+ tutorial. For prediction and evaluation, users can iterate through the data themseleves
11
+ while training requires the full dataset to be loaded to the memory.
12
+
13
+ During training, there are two different approaches for external memory support available
14
+ in XGBoost, one for CPU-based algorithms like ``hist `` and ``approx ``, another one for the
15
+ GPU-based training algorithm. We will introduce them in the following sections.
10
16
11
- .. warning ::
17
+ .. note ::
12
18
13
- The implementation of external memory uses ``mmap `` and is not tested against system
14
- errors like disconnected network devices (`SIGBUS `). In addition, Windows is not yet
15
- supported.
19
+ Training on data from external memory is not supported by the ``exact `` tree method.
16
20
17
21
.. note ::
18
22
19
- When externel memory is used, the CPU training performance is IO bounded. Meaning, the
20
- training speed is almost exclusively determined by the disk IO speed. For GPU, please
21
- read on and see the gradient-based sampling with external memory. During benchmark, we
22
- used a NVME connected to a PCIE slot, the performance is "usable" with ``hist `` on CPU.
23
+ The implementation of external memory uses ``mmap `` and is not tested against system
24
+ errors like disconnected network devices (`SIGBUS `). In addition, Windows is not yet
25
+ supported.
23
26
24
27
*************
25
28
Data Iterator
@@ -28,8 +31,8 @@ Data Iterator
28
31
Starting from XGBoost 1.5, users can define their own data loader using Python or C
29
32
interface. There are some examples in the ``demo `` directory for quick start. This is a
30
33
generalized version of text input external memory, where users no longer need to prepare a
31
- text file that XGBoost recognizes. To enable the feature, user need to define a data
32
- iterator with 2 class methods ``next `` and ``reset `` then pass it into ``DMatrix ``
34
+ text file that XGBoost recognizes. To enable the feature, users need to define a data
35
+ iterator with 2 class methods: ``next `` and ``reset ``, then pass it into the ``DMatrix ``
33
36
constructor.
34
37
35
38
.. code-block :: python
@@ -73,18 +76,84 @@ constructor.
73
76
74
77
# Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some caveats
75
78
# as noted in following sections.
76
- booster = xgboost.train({" tree_method" : " approx" }, Xy)
79
+ booster = xgboost.train({" tree_method" : " hist" }, Xy)
80
+
81
+
82
+ The above snippet is a simplified version of ``demo/guide-python/external_memory.py ``.
83
+ For an example in C, please see ``demo/c-api/external-memory/ ``. The iterator is the
84
+ common interface for using external memory with XGBoost, you can pass the resulting
85
+ ``DMatrix `` object for training, prediction, and evaluation.
86
+
87
+ It is important to set the batch size based on the memory available. A good starting point
88
+ is to set the batch size to 10GB per batch if you have 64GB of memory. It is *not *
89
+ recommended to set small batch sizes like 32 samples per batch, as this can seriously hurt
90
+ performance in gradient boosting.
91
+
92
+ ***********
93
+ CPU Version
94
+ ***********
95
+
96
+ In the previous section, we demonstrated how to train a tree-based model using the
97
+ ``hist `` tree method on a CPU. This method involves iterating through data batches stored
98
+ in a cache during tree construction. For optimal performance, we recommend using the
99
+ ``grow_policy=depthwise `` setting, which allows XGBoost to build an entire layer of tree
100
+ nodes with only a few batch iterations. Conversely, using the ``lossguide `` policy
101
+ requires XGBoost to iterate over the data set for each tree node, resulting in slower
102
+ performance.
103
+
104
+ If external memory is used, the performance of CPU training is limited by IO
105
+ (input/output) speed. This means that the disk IO speed primarily determines the training
106
+ speed. During benchmarking, we used an NVME connected to a PCIe-4 slot, other types of
107
+ storage can be too slow for practical usage. In addition, your system may perform caching
108
+ to reduce the overhead of file reading.
109
+
110
+ **********************************
111
+ GPU Version (GPU Hist tree method)
112
+ **********************************
113
+
114
+ External memory is supported by GPU algorithms (i.e. when ``tree_method `` is set to
115
+ ``gpu_hist ``). However, the algorithm used for GPU is different from the one used for
116
+ CPU. When training on a CPU, the tree method iterates through all batches from external
117
+ memory for each step of the tree construction algorithm. On the other hand, the GPU
118
+ algorithm concatenates all batches into one and stores it in GPU memory. To reduce overall
119
+ memory usage, users can utilize subsampling. The good news is that the GPU hist tree
120
+ method supports gradient-based sampling, enabling users to set a low sampling rate without
121
+ compromising accuracy.
122
+
123
+ .. code-block :: python
124
+
125
+ param = {
126
+ ...
127
+ ' subsample' : 0.2 ,
128
+ ' sampling_method' : ' gradient_based' ,
129
+ }
130
+
131
+ For more information about the sampling algorithm and its use in external memory training,
132
+ see `this paper <https://arxiv.org/abs/2005.09148 >`_.
133
+
134
+ .. warning ::
135
+
136
+ When GPU is running out of memory during iteration on external memory, user might
137
+ recieve a segfault instead of an OOM exception.
77
138
139
+ *******
140
+ Remarks
141
+ *******
78
142
79
- The above snippet is a simplified version of ``demo/guide-python/external_memory.py ``. For
80
- an example in C, please see ``demo/c-api/external-memory/ ``.
143
+ When using external memory with XBGoost, data is divided into smaller chunks so that only
144
+ a fraction of it needs to be stored in memory at any given time. It's important to note
145
+ that this method only applies to the predictor data (``X ``), while other data, like labels
146
+ and internal runtime structures are concatenated. This means that memory reduction is most
147
+ effective when dealing with wide datasets where ``X `` is larger compared to other data
148
+ like ``y ``, while it has little impact on slim datasets.
81
149
82
150
****************
83
151
Text File Inputs
84
152
****************
85
153
86
- There is no big difference between using external memory version and in-memory version.
87
- The only difference is the filename format.
154
+ This is the original form of external memory support, users are encouraged to use custom
155
+ data iterator instead. There is no big difference between using external memory version of
156
+ text input and the in-memory version. The only difference is the filename format.
88
157
89
158
The external memory version takes in the following `URI <https://en.wikipedia.org/wiki/Uniform_Resource_Identifier >`_ format:
90
159
@@ -117,35 +186,3 @@ XGBoost will first load ``agaricus.txt.train`` in, preprocess it, then write to
117
186
more notes about text input formats, see :doc: `/tutorials/input_format `.
118
187
119
188
For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train?format=libsvm#dtrain.cache" ``.
120
-
121
-
122
- **********************************
123
- GPU Version (GPU Hist tree method)
124
- **********************************
125
- External memory is supported in GPU algorithms (i.e. when ``tree_method `` is set to ``gpu_hist ``).
126
-
127
- If you are still getting out-of-memory errors after enabling external memory, try subsampling the
128
- data to further reduce GPU memory usage:
129
-
130
- .. code-block :: python
131
-
132
- param = {
133
- ...
134
- ' subsample' : 0.1 ,
135
- ' sampling_method' : ' gradient_based' ,
136
- }
137
-
138
- For more information, see `this paper <https://arxiv.org/abs/2005.09148 >`_. Internally
139
- the tree method still concatenate all the chunks into 1 final histogram index due to
140
- performance reason, but in compressed format. So its scalability has an upper bound but
141
- still has lower memory cost in general.
142
-
143
- ***********
144
- CPU Version
145
- ***********
146
-
147
- For CPU histogram based tree methods (``approx ``, ``hist ``) it's recommended to use
148
- ``grow_policy=depthwise `` for performance reason. Iterating over data batches is slow,
149
- with ``depthwise `` policy XGBoost can build a entire layer of tree nodes with a few
150
- iterations, while with ``lossguide `` XGBoost needs to iterate over the data set for each
151
- tree node.
0 commit comments