Diffusion & Multimodal Generative Models

For a more in-depth report on a particular model that I designed and trained, read this informal technical report.

In a year at Emittance, inc (“Can of Soup”), I was fortunate to move into the space of training and manipulating diffusion models for image generation, tackling in particular the tighter integration of image models with multimodal contexts and vision LLMs. Working on a team of just two engineers, we trained models both for small-scale experimentation and large scale foundational work — including a foundational diffusion model. Our stack used ray for distributed training across clusters of GPUs.

As the person most uniquely tasked with following, explaining, and adopting the latest research, my particular specialities and innovations included the following areas:

novel attention-sharing mechanisms in both DiT and convolution based diffusion models;
vision-language encoding of images to a semantic shared space for image rendering or manipulation (read the technical report);
face customization, employing facial recognition and other inputs;
a mixture of all leading methods in diffusion model customization, including ControlNets, IP-adapter integrations of modalities, multi-region diffusion and inpainting, DDIM inversion, novel inference-time sampling variations, and more;
learning aligned embedding spaces between models for novel assemblies;
spatialized prompting of transformer-based flow matching models, where the input text or embedding can hierarchically target regions of an image precisely;
many experiments to find novel methods for integrating and manipulating natural image inputs to diffusion.

In addition to the above, I employed an original concept of paired diffusion to generate a large-scale editing dataset across a wide range of tasks, the quality of which greatly exceeds any prior art to my knowledge.

I also led regular paper explainers in which I walked through new research, both clarifying and evaluating the potential and limitations.

While all of the work above is proprietary and closed, there is one small open source contribution dating just prior to my joining the company which can shed light on a small area of interest and illustrate my style of work.