On Wednesday, OpenAI just announced the lifting of the ban on ChatGPT’s multi-modal capabilities. Now, as soon as it goes online, netizens instantly go crazy. Next, let’s take a look at how strong ChatGPT’s image recognition capabilities are.

01

Take a photo and upload it, and the code will be generated instantly

A netizen recorded a video and uploaded a whiteboard picture during a meeting, and then asked ChatGPT to write the code.


Also, you can upload a hand-drawn sketch and ask ChatGPT to create a web page in HTML.


Whoosh whoosh, the code came out every minute.

This is simply the multi-modal capability demonstrated by Greg Brockman when GPT-4 was just released this year.


For another example, take a photo of your todolist notebook.


Then let GPT-4 make a PythonTkinterGUI, and then it was implemented...


02

Ancient scroll manuscripts, translated at a glance

Here is another manuscript drawing from the 17th century alchemist Robert Boyle. Can GPT-4 read it?


This is a piece of cake for it.


In e.g. "Catalan Medicinal Manual on Medicinal Mummies".


ChatGPT can also transcribe and translate.


Benjamin Breen, associate professor of history at UCSC, said,

This will have a significant impact on historians. Imagine a custom multi-modal GPT-4 trained on a specific set of manuscripts. It can not only transcribe, but also translate and classify. (It is this, writing without LLM, that is a big deal in my opinion).


03

The chart summary is also very good 6

You can also command GPT-4 to extract data based on the chart.


Python code can then be created to replicate the chart and make it more chart-like.


Then throw the stock trend chart to it, and it can also analyze and summarize the characteristics.


04

Reading pictures "has a superior IQ"

Give GPT-4 an abstract picture.

It can actually accurately identify the metaphor of "the importance of communication" that these four pictures want to express. This is outrageous.


GPT-4V can even read doctors' handwriting.



Some Japanese netizens directly used Sun Wukong from "Dragon Ball" to take the ChatGPT test.


There are also various "are you human" verification codes.


Upload a piece of your own work, and GPT-4 can also give you suggestions for improvement.


Some netizens discovered that GPT-4V gave the correct answer to this question in the kosmos-1 paper, but there was an error in the reasoning process.


With this feature, children no longer have to do homework.


05

Netizens’ big summary

In addition to the above experience, some netizens wrote a long article introducing their own test of GPT-4V.


Test one:Visual Q&A

Give me an emoticon and see how well GPT-4V understands it?


GPT-4V successfully explains why it is interesting and mentions the individual components of the picture and how they are connected.

It is worth noting that GPT-4V is able to read and respond to the bracketed comments provided.

Still, GPT-4V made a mistake, labeling it "NVIDIABURGER" instead of "GPU."

Then, test it again with a coin, a photo of an American penny. GPT-4V is able to successfully identify the coin's origin and denomination.


But if it's a picture of multiple coins and asking GPT-4V, how much money do I have?

At this point, it can only identify the number of coins, but not the currency type.


Test 2: OCR recognition

Capture text images from web pages and upload them. GPT-4V can read the content very well.


Test 3: Math OCR

Mathematical OCR is a special form of optical character recognition that targets mathematical equations.

A netizen asked GPT-4V a mathematical problem and presented it in the form of a screenshot of the document.

This problem involves calculating the length of a zipper line given 2 angles, with the prompt "solve it" on the image.



The model identifies problems that can be solved using trigonometry, identifies the functions to be used, and provides a step-by-step walkthrough of how to solve the problem. GPT-4V then provides the correct answer to the question.

Having said that, the GPT-4V system card states that the model may be missing mathematical symbols.

Different tests, including tests with equations or expressions written by hand on paper, may indicate a model's insufficient ability to answer math questions.

Test 4: Object Detection

Let GPT-4V detect a dog in an image and provide x_min, y_min, x_max, and y_max values ​​related to the dog's position. The bounding box coordinates returned by GPT-4V do not match the dog's position.


Although GPT-4V is very powerful at answering image questions, this model cannot replace fine-tuning object detection models when you want to know where an object is in the image.

Test 5: Verification code

GPT-4V was found to be able to recognize images containing verification codes, but often failed the test.

In an example of selecting traffic light grids, GPT-4V selected fewer grids containing traffic lights.


Test 6: Crossword Puzzles and Sudoku

In the Sudoku test, GPT-4V recognized the game but misunderstood the structure of the board and therefore returned inaccurate results.


By the way, the ChatGPT networking function is back.