On June 18, when you open the DeepSeek web page and APP, almost all users will find that there is an image recognition mode to the right of the previous quick mode and expert mode. This means that many users who have not been tested by grayscale can finally use DeepSeek to process images.

At present, DeepSeek has not officially released a public introduction, and the model interface still displays "image understanding function under internal testing." There is speculation that this time it is a full test push. However, Chen Xiaokang, head of the DeepSeek multi-modal team, mentioned on social media today that the visual mode has been officially launched on web pages and applications, "try these new eyes."

It is worth mentioning that just 5 days ago, Chen Xiaokang followed the hot spot and sent the "green duck leg" of Auntie Goose Leg to DeepSeek for identification. Judging from the reply, DeepSeek was able to identify that it was not a goose leg, and also suggested that the green color may be a food safety hazard. “If there had been DeepSeek back then, there wouldn’t have been a ‘Duck War’ this year.” He joked.

In this comment area, some users asked why the visual function was not yet available. At that time, Chen Xiaokang replied, "Only a small number of users can use grayscale (test)." At the end of April this year, the DeepSeek image recognition mode launched a grayscale test, and it was opened to a wide range of users in May. However, many users still did not use it until this time it seemed that it was open to all users for testing.
How effective is DeepSeek in image recognition? A reporter from China Business News got started and experienced it, and the effects were different in different situations.
I sent DeepSeek an architectural drawing of the Bund in Shanghai and asked where it was. DeepSeek gave a normal answer in 16 seconds. It analyzed the four main buildings and also answered that the white arch bridge is "most likely Zhapu Road Bridge", which is a classic photography angle.

However, DeepSeek may not be able to recognize the popular Cape Verdean goalkeeper Vozinha these days. DeepSeek spent more than a minute thinking deeply. During the thinking process, Cape Verde was mentioned several times, but it could not correspond to the specific player. In the end, it gave a completely wrong answer.

This may be because Woznia was not well-known before and was not included in the large model training data. At the same time, DeepSeek's image recognition mode does not have an online search function, so it cannot identify current hot figures.
The reporter noticed that on social platforms, there was feedback from users who had long been covered by grayscale tests. DeepSeek’s image recognition ability exceeded the average level of domestic models, but compared with top overseas models, there was still a gap in complex image understanding and detailed reasoning.
Specifically, in scenarios such as daily screenshots, error messages, tables, papers, and web page content, DeepSeek's image recognition is basically sufficient and very fast. But if it is a more complex picture, such as a multi-layer logic flow chart or a complex data chart, the accuracy will begin to decrease. However, the above-mentioned users believe that considering the price and openness, DeepSeek is still worth using.
Just on April 30, DeepSeek released a report on multi-modal technology, "Thinking with Visaul Primitives", explaining the details behind multi-modal technology. But soon everyone discovered that the official deleted the multi-modal warehouse and the original text of the paper overnight, and the Github interface was already in a "404" status.
At that time, there were many speculations from the outside world. Some believed that DeepSeek was not ready yet, while others believed that the paper revealed too much information. In the paper, DeepSeek believes that the current multi-modal model collapses on complex tasks not because of invisibility (perception gap), but because of "inaccurate pointing" (citation gap). The future of multimodal intelligence is not just about “seeing more pixels”, but about building a precise and unambiguous reference bridge between language and vision.
At present, DeepSeek has not publicly announced the launch of the image recognition mode. The technical details of this mode and more news still need to wait for the official introduction.