In recent years, we’ve witnessed the rise of foundation models 1 for text (e.g. chatGPT, Llama) and images (e.g. Dalle, Stable Diffusion 2), even video (e.g. Stable Video Diffusion).