Replies: 1 comment 1 reply
-
@amaarora I like the idea behind Accelerate, I 100% agree with this approach. Sylvain mentioned it to me a while back when I said I was working with TPUs. I had started working on a device wrapper of my own. I considered using Accelerate instead but there were a few things I prefered about my approach. My device wrapper is called DeviceEnv (kudos to Sylvain and HF for having a much sexier name...). It similarly combines the device id, distributed initialization, and wrapping of some device transfer/DDP etc in a common interface. I split out the optimizer aspect into a different abstraction called Updater that deals with backward / loss scale / grad modification (clip) / step. I have a DeviceEnvXla and DeviceEnvCuda working so far. DeviceEnvXla works with XLA in TPU, GPU, or CPU. I've been running training locally on 2X GPU w/ AMP using XLA and 8x TPU. I am about to add a DeviceEnvDeepSpeed (requires a few mods to how I initialize the model/optimizer (updater)). It's on this branch https://github.com/rwightman/pytorch-image-models/tree/bits_and_tpu ... I was going to msg you and Tanishq this week once I push another commit to squash a few obvious bugs. I still have a number of things to improve. |
Beta Was this translation helpful? Give feedback.
-
Hey @rwightman ! Here's a discussion topic - should we switch to Hugging Face Accelarate for DDP? As someone who's spent quite a bit of time in timm, I am of the belief that it's not straightforward to switch but could perhaps simplify code? This will help get rid of number of operations performed under
if args.distributed:
condition insidetrain.py
. Will also help remove the need for having a Distributed Sampler inside train and eval dataloaders.. (& more?)Or do you think this is not needed at this stage?
Beta Was this translation helpful? Give feedback.
All reactions