MetaPlatforms executives told Reuters in an interview that the company used public posts on Facebook and Instagram to train some of the features of its new Meta artificial intelligence virtual assistant, but excluded private posts shared only with family and friends in an effort to respect consumer privacy.
Meta also does not use private chats on its messaging service as training data for its models and has taken steps to filter private details from the public dataset used for training, Nick Clegg, Meta's president of global affairs, said on the sidelines of the company's annual Connect conference this week.
"We've tried to exclude data sets where personal information is overwhelmingly present," Clegg said, adding that the "vast majority" of the data Meta uses for training is publicly available.
Citing LinkedIn as an example, he pointed out that Meta intentionally does not use the site's content due to privacy concerns.
Clegg's comments come as technology companies including Meta, OpenAI and Alphabet's Google have been criticized for using information scraped from the internet to train their artificial intelligence models without permission.
The two companies are weighing how to deal with private or copyrighted material that their artificial intelligence systems may copy in the process, while also facing lawsuits from authors accusing them of copyright infringement.
Chief Executive Officer Mark Zuckerberg unveiled the company's first batch of consumer-facing artificial intelligence tools at Meta's annual product conference "Connect" on Wednesday, with MetaAI being the most important product. This year’s conference focused on artificial intelligence, unlike previous conferences that focused on augmented reality and virtual reality.
Meta says the assistant uses a custom model based on the powerful Llama2 large-scale language model, which was made available for commercial use in July this year, as well as a new model called Emu that generates images based on text prompts.
The product will be able to generate text, audio and images, and will be able to access real-time information through cooperation with Microsoft's Bing search engine. Public Facebook and Instagram posts used to train MetaAI also include text and photos.
A spokesperson for Meta told Reuters that the posts were used to train Emu's image generation capabilities, while the chat feature was based on Llama2 with the addition of publicly available annotated datasets.
Interaction with MetaAI may also be used to improve future features, the spokesperson said. Meta imposes security restrictions on what MetaAI tools can generate, such as prohibiting the creation of realistic images of public figures.
Regarding copyrighted material, Clegg said he expected "a significant amount of litigation" over whether "creative content falls within the existing fair use doctrine," which allows for limited use of protected works for purposes such as commentary, research and parody.
Some companies with image generation tools make it easy to replicate iconic characters like Mickey Mouse, while others pay for the footage or intentionally avoid including it in their training data.
OpenAI, for example, signed a six-year deal this summer with content provider Shutterstock to use the company's library of images, videos and music for training.
When asked whether Meta took any such steps to avoid copying copyrighted images, a Meta spokesperson noted that the new terms of service prohibit user-generated content that violates privacy and intellectual property rights.