@@ -22,6 +22,15 @@ GPU-based training algorithm. We will introduce them in the following sections.
22
22
23
23
The feature is still experimental as of 2.0. The performance is not well optimized.
24
24
25
+ The external memory support has gone through multiple iterations and is still under heavy
26
+ development. Like the :py:class: `~xgboost.QuantileDMatrix ` with
27
+ :py:class: `~xgboost.DataIter `, XGBoost loads data batch-by-batch using a custom iterator
28
+ supplied by the user. However, unlike the :py:class: `~xgboost.QuantileDMatrix `, external
29
+ memory will not concatenate the batches unless GPU is used (it uses a hybrid approach,
30
+ more details follow). Instead, it will cache all batches on the external memory and fetch
31
+ them on-demand. Go to the end of the document to see a comparison between
32
+ `QuantileDMatrix ` and external memory.
33
+
25
34
*************
26
35
Data Iterator
27
36
*************
@@ -113,10 +122,11 @@ External memory is supported by GPU algorithms (i.e. when ``tree_method`` is set
113
122
``gpu_hist ``). However, the algorithm used for GPU is different from the one used for
114
123
CPU. When training on a CPU, the tree method iterates through all batches from external
115
124
memory for each step of the tree construction algorithm. On the other hand, the GPU
116
- algorithm concatenates all batches into one and stores it in GPU memory. To reduce overall
117
- memory usage, users can utilize subsampling. The good news is that the GPU hist tree
118
- method supports gradient-based sampling, enabling users to set a low sampling rate without
119
- compromising accuracy.
125
+ algorithm uses a hybrid approach. It iterates through the data during the beginning of
126
+ each iteration and concatenates all batches into one in GPU memory. To reduce overall
127
+ memory usage, users can utilize subsampling. The GPU hist tree method supports
128
+ `gradient-based sampling `, enabling users to set a low sampling rate without compromising
129
+ accuracy.
120
130
121
131
.. code-block :: python
122
132
@@ -134,6 +144,8 @@ see `this paper <https://arxiv.org/abs/2005.09148>`_.
134
144
When GPU is running out of memory during iteration on external memory, user might
135
145
recieve a segfault instead of an OOM exception.
136
146
147
+ .. _ext_remarks :
148
+
137
149
*******
138
150
Remarks
139
151
*******
@@ -142,17 +154,64 @@ When using external memory with XBGoost, data is divided into smaller chunks so
142
154
a fraction of it needs to be stored in memory at any given time. It's important to note
143
155
that this method only applies to the predictor data (``X ``), while other data, like labels
144
156
and internal runtime structures are concatenated. This means that memory reduction is most
145
- effective when dealing with wide datasets where ``X `` is larger compared to other data
146
- like ``y ``, while it has little impact on slim datasets.
157
+ effective when dealing with wide datasets where ``X `` is significantly larger in size
158
+ compared to other data like ``y ``, while it has little impact on slim datasets.
159
+
160
+ As one might expect, fetching data on-demand puts significant pressure on the storage
161
+ device. Today's computing device can process way more data than a storage can read in a
162
+ single unit of time. The ratio is at order of magnitudes. An GPU is capable of processing
163
+ hundred of Gigabytes of floating-point data in a split second. On the other hand, a
164
+ four-lane NVMe storage connected to a PCIe-4 slot usually has about 6GB/s of data transfer
165
+ rate. As a result, the training is likely to be severely bounded by your storage
166
+ device. Before adopting the external memory solution, some back-of-envelop calculations
167
+ might help you see whether it's viable. For instance, if your NVMe drive can transfer 4GB
168
+ (a fairly practical number) of data per second and you have a 100GB of data in compressed
169
+ XGBoost cache (which corresponds to a dense float32 numpy array with the size of 200GB,
170
+ give or take). A tree with depth 8 needs at least 16 iterations through the data when the
171
+ parameter is right. You need about 14 minutes to train a single tree without accounting
172
+ for some other overheads and assume the computation overlaps with the IO. If your dataset
173
+ happens to have TB-level size, then you might need thousands of trees to get a generalized
174
+ model. These calculations can help you get an estimate on the expected training time.
175
+
176
+ However, sometimes we can ameliorate this limitation. One should also consider that the OS
177
+ (mostly talking about the Linux kernel) can usually cache the data on host memory. It only
178
+ evicts pages when new data comes in and there's no room left. In practice, at least some
179
+ portion of the data can persist on the host memory throughout the entire training
180
+ session. We are aware of this cache when optimizing the external memory fetcher. The
181
+ compressed cache is usually smaller than the raw input data, especially when the input is
182
+ dense without any missing value. If the host memory can fit a significant portion of this
183
+ compressed cache, then the performance should be decent after initialization. Our
184
+ development so far focus on two fronts of optimization for external memory:
185
+
186
+ - Avoid iterating through the data whenever appropriate.
187
+ - If the OS can cache the data, the performance should be close to in-core training.
147
188
148
189
Starting with XGBoost 2.0, the implementation of external memory uses ``mmap ``. It is not
149
- yet tested against system errors like disconnected network devices (`SIGBUS `). Also, it's
150
- worth noting that most tests have been conducted on Linux distributions.
190
+ tested against system errors like disconnected network devices (`SIGBUS `). In the face of
191
+ a bus error, you will see a hard crash and need to clean up the cache files. If the
192
+ training session might take a long time and you are using solutions like NVMe-oF, we
193
+ recommend checkpointing your model periodically. Also, it's worth noting that most tests
194
+ have been conducted on Linux distributions.
151
195
152
- Another important point to keep in mind is that creating the initial cache for XGBoost may
153
- take some time. The interface to external memory is through custom iterators, which may or
154
- may not be thread-safe. Therefore, initialization is performed sequentially.
155
196
197
+ Another important point to keep in mind is that creating the initial cache for XGBoost may
198
+ take some time. The interface to external memory is through custom iterators, which we can
199
+ not assume to be thread-safe. Therefore, initialization is performed sequentially. Using
200
+ the `xgboost.config_context ` with `verbosity=2 ` can give you some information on what
201
+ XGBoost is doing during the wait if you don't mind the extra output.
202
+
203
+ *******************************
204
+ Compared to the QuantileDMatrix
205
+ *******************************
206
+
207
+ Passing an iterator to the :py:class: `~xgboost.QuantileDmatrix ` enables direct
208
+ construction of `QuantileDmatrix ` with data chunks. On the other hand, if it's passed to
209
+ :py:class: `~xgboost.DMatrix `, it instead enables the external memory feature. The
210
+ :py:class: `~xgboost.QuantileDmatrix ` concatenates the data on memory after compression and
211
+ doesn't fetch data during training. On the other hand, the external memory `DMatrix `
212
+ fetches data batches from external memory on-demand. Use the `QuantileDMatrix ` (with
213
+ iterator if necessary) when you can fit most of your data in memory. The training would be
214
+ an order of magnitute faster than using external memory.
156
215
157
216
****************
158
217
Text File Inputs
0 commit comments