Customize the architecture of SyncNet

The config file of SyncNet defines the architectures of audio and visual encoders. Let's first look at an example of an audio encoder:

audio_encoder: # input (1, 80, 52)
  in_channels: 1
  block_out_channels: [32, 64, 128, 256, 512, 1024, 2048]
  downsample_factors: [[2, 1], 2, 2, 1, 2, 2, [2, 3]]
  attn_blocks: [0, 0, 0, 1, 1, 0, 0]
  dropout: 0.0

The above model arch accept a 1 x 80 x 52 image (mel spectrogram) and output a 2048 x 1 x 1 feature map. If the resolution of input image changes, you need to redefine the downsample_factors to make the output looks like D x 1 x 1, so that it can be used to compute cosine similarity. Also reset the block_out_channels, in most cases, deeper networks require larger numbers of channels to store more features. We recommend reading the paper EfficientNet, which discusses how to set the depth and width of CNN networks balancely. The attn_blocks defines whether a certain layer has a self-attention layer, where 1 indicates presence and 0 indicates absence.

Now we look at an example of a visual encoder:

visual_encoder: # input (48, 128, 256)
  in_channels: 48 # (16 x 3)
  block_out_channels: [64, 128, 256, 256, 512, 1024, 2048, 2048]
  downsample_factors: [[1, 2], 2, 2, 2, 2, 2, 2, 2]
  attn_blocks: [0, 0, 0, 0, 1, 1, 0, 0]
  dropout: 0.0

It is important to note that in_channels: it equals num_frames * image_channels. For pixel-space SyncNet, image_channels is 3, while for latent-space SyncNet, image_channels equals the latent_channels of the VAE you are using, typically 4 (SD 1.5, SDXL) or 16 (FLUX, SD3). In the example above, the visual encoder has an input frame length of 16 and is a pixel-space SyncNet, so in_channels is 16 x 3 = 48.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syncnet_arch.md

syncnet_arch.md

Customize the architecture of SyncNet

Files

syncnet_arch.md

Latest commit

History

syncnet_arch.md

File metadata and controls

Customize the architecture of SyncNet