If you could sum up the current pace of the AI large model circle in one word, "DeepSeek speed" couldn't be more appropriate. Less than five days after the release of the text-only version of V4, the official staged a "three rounds of consecutive price drops" price blitz. Before the industry could recover, the next trump card had already surfaced - the "complete" V4 with multi-modal capabilities, officially entering the countdown to release.
Core researcher personally reveals: Native visual ability is coming
Chen Xiaokang, a core member of the DeepSeek multi-modal team, recently published an article on the X platform, clearly announcing that the "new version of DeepSeek V4" is coming. Combined with the current context, this "new version" is almost certain to be the long-awaited multi-modal version.
Since the launch of V4, the biggest heated discussion and a trace of regret in the industry all point to the same thing: only two text-only models, Flash (fast) and Pro (expert), were launched in the first wave. Pure text capabilities are certainly the foundation, but in today’s large model arena, “native multimodality” has long been the ticket to the top echelons. Without visual understanding of images and videos, the upper limit of the model in real complex scenes will be firmly locked. The launch of the multi-modal version this time is a key step for DeepSeek to make up for the last shortcoming.
The APP has already had a hint: it’s not about running points, it’s about “affordability”

Careful users have discovered that after the recent DeepSeek client update, the model selection bar has quietly added three independent options: "Quick", "Expert" and "Visual". The first two correspond to V4's Flash and Pro, and the "Visual" option, which has been in the "to be activated" state, is obviously an interface reserved for the upcoming new version of multi-modal V4.
As for what rank this full-blooded multi-modal V4 can reach in terms of hard power? No specific data is available yet. However, referring to V4 Pro's dominant performance in the field of plain text, the industry generally predicts that its visual capabilities will be at least firmly in the first echelon - however, this has never been what DeepSeek cares about most.
The real trump card: “bring down” the price of multi-modality
For DeepSeek, which has always taken an unconventional path, blindly rushing into the rankings is not the primary goal. “Bringing the price of large multi-modal models down to cabbage prices so that developers and ordinary users can truly afford them” is its real trump card. Recall that V4 staged "three rounds of price cuts" just five days after its release. If the cost of subsequent multi-modal API calls is also reduced to the floor price, the industry reshuffle will be far more intense than the price war in the plain text field.
One sentence summary: The arrival of the V4 multi-modal version is not only a complementary capability, but also the starting point for DeepSeek to overturn the multi-modal track price table.