Researchers from the University of California, Berkeley, Toyo University, Tokyo Institute of Technology, MIT and the University of Tsukuba jointly open sourced an innovative real-time interactive image generation framework - StreamDiffusion. The technical innovation of StreamDiffusion lies in:Transform traditional sequential denoising into streaming batch denoising, eliminating long waiting and interactive generation methods, and realizing a smooth and ultra-high throughput image generation method.

At the same time, the "residual classifier-free guidance" method is introduced to further improve the efficiency and image quality of stream batch processing.

According to StreamDiffusion's project submission history on Github, it only took 8 days to receive 6,100 stars and become the top open source product. Its performance and popularity are evident. Developers are allowed to use it commercially.

Open source address: https://github.com/cumulo-autumn/StreamDiffusion

Paper address: https://arxiv.org/abs/2312.12491

Demo display: https://github.com/cumulo-autumn/StreamDiffusion/blob/main/assets/demo_03.gif


At present, diffusion models have been widely used in image generation and have been successfully commercialized, such as Midjourney, a benchmark product in this field.

But poor performance in real-time interaction requires long waits, especially in scenarios involving continuous input.

In order to solve these problems, the researchers designed a novel output and input method, which converts the original sequential denoising into a batch denoising process.

Simply put, StreamDiffusion is equivalent to mechanized pipeline operations in the field of large models, changing the single and cumbersome denoising and inference processes into batch processing.

Streaming batch denoising method

Streaming batch denoising is one of the core functions of StreamDiffusion and the key to achieving real-time interaction.

Traditional interactive diffusion models are executed sequentially: one image is input at a time, and after all denoising steps are completed, one result image is output. This process is then repeated to generate more image processing.

This creates a big problem where speed and quality cannot be guaranteed at the same time. In order to generate high-quality images, it is necessary to set up more denoising steps, resulting in slower image generation efficiency and making it impossible to have both the cake and the eat.

The core idea of ​​streaming batch denoising is: when the first image is input to start the denoising step, the second image can be received without waiting for it to be completed to achieve batch processing.

In this way, U-Net only needs to be continuously called to process a batch of features, and it can efficiently implement batch advancement of the image generation pipeline.


In addition, the benefit of the streaming batch denoising method is that each time U-Net is called, multiple images can be pushed forward simultaneously, and U-Net's batch operations are very suitable for GPU parallel computing, so the overall computing efficiency is very high.

Ultimately, the generation time of a single image can be significantly shortened while ensuring quality.

Residual classifier-less guidance

In order to strengthen the influence of prompt conditions on results, diffusion models usually use a strategy called "classifier-free guidance".

In the traditional method, when calculating the negative condition vector, each input latent vector needs to be paired with a negative condition embedding, and U-Net is called for each inference, which consumes a lot of computing power.

To solve this problem, researchers proposed the "residual classifier-free guidance" method.The core method is to assume that there is a "virtual residual noise" vector that is used to approximate the negative condition vector.


First calculate the "positive condition" vector, and then use the positive condition vector to infer the virtual negative condition vector. This avoids the need to additionally call U-Net every time to calculate the real negative condition vector, thus greatly reducing the computing power.

Simply put,It uses the original input image encoding as a negative sample and can be calculated without calling U-Net. The slightly more complicated "one-time negative condition" is to use U-Net to calculate a negative vector once in the first step, and then reuse this vector to approximate all subsequent negative vectors.

Assembly line operation

The function of this module is to make the bottleneck of the entire system no longer the conversion of data formats, but the inference time based on the model itself.

Usually, the input image requires preprocessing such as scaling and format conversion to become a tensor usable by the model; the output tensor also needs post-processing to restore it to the image format, and the entire process consumes a lot of time and computing power.


Pipeline jobs completely separate pre/post-processing from model inference and execute them in parallel in different threads. The input image is preprocessed and entered into the input queue cache;

The output tensor is sent from the output queue and then post-processed into an image. This way the two don't have to wait for each other, thus optimizing the overall process speed.


In addition, this method also plays a role in smoothing the data flow. When an input source failure or communication error prevents new images from being transferred temporarily, the queue can continue to provide previously cached images to ensure smooth operation of the model.

Random similarity filtering

The function of this module is to significantly reduce GPU computing power consumption. When the input images are continuously the same or highly similar, repeated reasoning is meaningless.

Therefore, the similarity filtering module calculates the similarity between the input image and the historical reference frame. If it is higher than the set threshold, subsequent model inference will be skipped with a certain probability;

If it is lower than the threshold, model inference is performed normally and the reference frame is updated. This probabilistic sampling mechanism allows the filtering strategy to throttle the system smoothly and naturally, reducing average GPU usage.


The filtering effect is obvious under static input, and the filtering rate is automatically reduced when dynamic changes are large. The system can adapt to the dynamics of the scene.

In this way, the system inference load can be automatically adjusted under continuous stream input with dynamic changes in complexity, saving GPU computing power consumption.

experimental data

In order to test the performance of StreamDiffusion, the researchers conducted tests on RTX3060 and RTX4090.

In terms of efficiency, it achieves a generated frame rate of more than 91FPS, which is nearly 60 times that of the current state-of-the-art AutoPipeline, and greatly reduces the denoising steps.

In terms of power consumption, under static input, the average power of RTX3060 and RTX4090 is reduced by 2.39 times and 1.99 times respectively.