GPT-4o can be made faster by adding money. The new function "predictive output" completes the original 23-second task in 7 seconds

OpenAI has a new function that directly allows ChatGPT to outputTake off at speed!This function is called"prediction output"(PredictedOutputs), with its blessing, GPT-4o can be faster than the originalup to 5 times. Take programming as an example to feel this feeling:

Why so fast? To sum it up in one sentence:

Skip what you already know and don't have to rebuild it from scratch.

Therefore, "predictive output" is particularly suitable for the following tasks:

Update blog post in documentation

Iterate over previous responses

Rewrite code in existing files

And FactoryAI, which cooperated with OpenAI to develop this function, also showed their data on programming tasks:

Judging from the experimental results, the response time of GPT-4o with the blessing of "predicted output" is 2-4 times faster than before, while maintaining high accuracy.

And the official also stated:

A programming task that originally took 70 seconds to complete now only takes 20 seconds.

It is worth noting that the "prediction output" function currently only supports two models, GPT-4o and GPT-4omini, and is in the form of API.

For developers, this can be said to be good news.

Netizens tested online

As soon as the news came out, many netizens couldn't sit still, and started testing it.

For exampleFounder of FirecrawlEric Ciarla used "predictive output" to experience converting blog posts into SEO (search engine optimization) content, and then he said:

It's really super fast.

It's as simple as adding a prediction parameter to your API call.

Another netizen gave a prompt on top of the existing code:

changethedetailstoberandompiecesoftext.

Change the details to a random text snippet.

Let’s feel the speed:

Some netizens also posted their actual measured data:

All in all, fast, really fast.

How?

OpenAI also introduces the technical details of "prediction output" in its official documentation.

OpenAI believes that in some cases, most of the output of the LLM is known in advance.

If you ask the model to make only minor modifications to some text or code, you can use "predict output" to take existing content as prediction input and get significantly lower latency.

For example, suppose you wantRefactor a piece of C# code, change the Username attribute to Email:

You can reasonably assume that most of the file's contents will not be modified (such as the class's docstring, some existing properties, etc.).

By passing in an existing class file as predictive text, you can regenerate the entire file faster.

Using "predicted output" to generate tokens will greatly reduce the latency of these types of requests.

However, OpenAI officials also gave several notes on the use of "predicted output".

The first is that we just mentioned that only GPT-4o and GPT-4o-mini series models are supported.

Secondly, the following API parameters are not supported when using prediction output:

nvaluesgreaterthan1

logprobs

presence_penaltygreaterthan0

frequency_penaltygreaterthan0

audiooptions

modalitiesotherthantext

max_completion_tokens

tools-functioncallingisnotsupported

In addition, in this document, OpenAI also summarizes several delay optimization methods in addition to "prediction output".

Including "accelerate processing of tokens", "generate fewer tokens", "use fewer input tokens", "reduce requests", "parallelize" and so on.

The document link is at the end of the article, interested friends can check it out~

OneMoreThing

Although the output speed has become faster, there is another note on OpenAI that has triggered discussion among netizens:

When providing a prediction, any tokens provided that are not part of the final completion are charged at completion token rates.

When providing a forecast, any non-final completion tokens provided are charged at the completion token rate.

Some netizens also posted his test results:

"Predicted output" not used: 5.2 seconds, 0.1555 cents

"Predicted output" used: 3.3 seconds, 0.2675 cents

Well, it’s faster and more expensive.

OpenAI official documentation:

https://platform.openai.com/docs/guides/latency-optimization#use-predicted-outputs

Reference links:

[1]https://x.com/OpenAIDevs/status/1853564730872607229

[2]https://x.com/romainhuet/status/1853586848641433834

[3]https://x.com/GregKamradt/status/1853620167655481411