Microsoft artificial intelligence researchers accidentally exposed dozens of terabytes of internal sensitive data

Microsoft artificial intelligence researchers accidentally exposed dozens of terabytes of sensitive data, including private keys and passwords, when they released an open source training data bucket on GitHub. In a research note shared with TechCrunch, cloud security startup Wiz said it discovered a GitHub repository belonging to Microsoft's artificial intelligence research unit as part of its ongoing work on the accidental exposure of cloud-hosted data.

This GitHub repository provides open source code and artificial intelligence models for image recognition, and it instructs readers to download the model from an Azure Storage URL. However, Wiz discovered that the URL was configured to grant permissions to the entire storage account, thereby mistakenly exposing more private data.

The data included 38TB of sensitive information, including personal backups of the PCs of two Microsoft employees. The data also contained other sensitive personal data, including passwords and keys for Microsoft services and more than 30,000 internal Microsoft Teams messages from hundreds of Microsoft employees.

According to Wiz, the URLs that exposed this data from 2020 were also incorrectly configured to allow "Full Control" instead of "Read-Only" permissions, meaning anyone who knew where to look could potentially delete, replace, and inject malicious content.

Wiz pointed out that the storage account was not directly exposed. Instead, Microsoft AI developers included an over-permissioned Shared Access Signature (SAS) token in the URL. SAS tokens are a mechanism used by Azure to allow users to create shareable links granting access to Azure storage account data.

Ami Luttwak, co-founder and chief technology officer of Wiz, said: "Artificial intelligence has unlocked huge potential for technology companies. However, as data scientists and engineers race to put new artificial intelligence solutions into production, the massive data they handle requires additional security checks and protection measures. With many development teams needing to process large amounts of data, share data with peers, or collaborate on public open source projects, cases like Microsoft's are increasingly difficult to monitor and avoid."

Wiz said it shared its findings with Microsoft on June 22, and Microsoft revoked the SAS tokens two days later on June 24. Microsoft said it completed its investigation into potential organizational impact on August 16.

"No customer data was exposed, and no other internal services were at risk as a result of this issue," Microsoft Security Response said in a blog post shared ahead of publication.

Microsoft said that based on Wiz's findings, it has expanded GitHub's Secrets Scanning service, which monitors changes to all public open source code to prevent clear-text exposure of credentials and other secrets, including any SAS tokens that may have excessive permission expirations or permissions.