OpenAI launches new solution to directly cut inference costs in half

In previously undisclosed news, OpenAI engineers revealed to some internal colleagues earlier this month that through a series of new technical optimizations, they had found a way toModel inference running costs are reduced by more than halfplan.

openai-has-reportedly-found-a-way-to-cut-inference-costs-in-v0-7vqlfnnfrgah1.webp

After engineers applied this new technology to the ChatGPT scenario where visitors who had not registered a free/paid account accessed ChatGPT, the computing power of NVIDIA graphics cards required during peak periods was only a few hundred yuan. This number was lower than expected. (Of course, OpenAI has set a call frequency limit for this type of anonymous visitors, and the overall usage of ChatGPT by this group is not high.)

At present, OpenAI has not disclosed the specific technical details used for this efficiency improvement. The industry speculates that commonly used optimization methods include: model quantification, key value caching (allowing the model to remember past calculation information and avoid repeated operations), request batch processing (responding to user queries in batches instead of processing them one by one), scheduling requests to low-power lightweight models or model sub-modules to complete responses, etc.

However, when OpenAI launches a new generation model with larger parameters later this year, the cost reduction effect brought by this batch of optimization technologies may be weakened, because the running cost of the large parameter model itself will be significantly higher.

This type of inference optimization technology is calledComputing power doubling technology, which is also the focus of major AI laboratories. Anthropic CEO Dario Amodei has publicly mentioned the concept in podcasts since at least mid-2023. He said at the time that the company strictly limited the scope of internal personnel who knew a single set of computing power optimization solutions. Once the relevant technology was copied by peers, it would give other AI laboratories a competitive advantage. (Computing power doubling technology can also refer to various efficiency optimization methods in the model training phase.)

The importance of this type of optimization technology has become increasingly prominent. Currently, leading AI R&D companies are generally facing a shortage of server computing power. Even if a company signs a contract to build a new or lease a data center, it often takes months or even years from the start of the project to the official launch. (OpenAI is also working with Broadcom to self-develop special chips for large model operation, trying to further reduce inference costs, with the goal of achieving cost reductions compared to Nvidia's commercial chips.)

After the implementation of OpenAI's technology optimization, the market is also paying great attention to how companies will deal with the saved computing power costs. On the one hand, OpenAI can pass on dividends to users: either increase the ChatGPT call limit for paying subscribers, or lower the pricing of model interfaces open to developers. Nowadays, the call price of the old version of the model has dropped to a fraction of the original price, and inference optimization is one of the core reasons.

This will further consolidate OpenAI's market positioning as a cost-effective model service provider. Recently, competing product Anthropic has been controversial due to high model pricing - even though its model output effect is better.

On the other hand, OpenAI can also choose to use cost reduction income to increase its gross profit margin, while the company's gross profit margin is mainly determined by the cost of inference computing power. OpenAI's gross profit margin in the first quarter of this year was 39%, an increase from 33% in the same period last year, but there is still a big gap from the target gross profit margin of 52% at the end of the year.

To hit its annual target, the company needs to achieve an average gross margin of 56% for the rest of the year. Anthropic's revenue surged sharply in the first half of this year, and it is expected to achieve unexpected profits this quarter, which fully confirms the speed of gross profit margin improvement during the industry boom cycle.

At this stage, OpenAI does not have an absolute say in pricing, but this inference optimization technology will significantly broaden its path to gross profit margin improvement.