CERN is one of the most ambitious engineering and scientific undertakings in human history. The Large Hadron Collider (LHC) is the world's largest and most energetic particle accelerator, and scientists use it to analyze evidence of the structure of the subatomic world - in the process, the LHC is able to produce tens of petabytes of data every year.
CERN recently had to upgrade its backend IT systems in preparation for the LHC's new experimental phase (LHC Operation 3). It is expected that by the end of 2025, this phase will generate 1PB of data every day. Previous database systems are no longer adequate to handle the "high cardinality" data produced by the collider's major experiments, such as CMS.
The Compact Muon Solenoid (CMS) is a universal detector at the Large Hadron Collider with a broad physics program. It includes the study of the Standard Model, including the Higgs boson, and the search for extra dimensions and particles that might make up dark matter. CERN calls the experiment one of the largest scientific collaborations in history, with about 5,500 people from 241 institutions in 54 different countries participating.
CMS and other Large Hadron Collider experiments underwent a major upgrade phase from 2018 to 2022 and are now ready to resume colliding subatomic particles during the three-year Operational Phase 3 data collection period. During the shutdown, CERN experts also made significant upgrades to the detector systems and computing infrastructure that support CMS.
Brij Kishor Jashal, a scientist working with CMS, mentioned that his team collected 30TB of data in 30 days to monitor the performance of the infrastructure. He explained that this stage of operation results in higher luminosity, resulting in a significant increase in data volume. Previous back-end monitoring systems relied on the open source time series database (TSDB) InfluxDB and the monitoring database Prometheus, which utilized compression algorithms to efficiently process this data.
However, InfluxDB and Prometheus encountered performance, scalability, and reliability issues, especially when dealing with high cardinality data. High cardinality refers to the prevalence of duplicate values and the ability to redeploy the application multiple times in new instances. To address these challenges, the CMS monitoring team chose to replace InfluxDB and Prometheus with the VictoriaMetrics TSDB database.
Now, VictoriaMetrics is both the back-end storage and monitoring system of CMS, effectively solving the cardinality problem encountered before. Jashal noted that the CMS team is currently satisfied with the performance of the cluster and services. While there is still room for scalability, these services are running in "high availability mode" within CMS's dedicated Kubernetes cluster to provide higher reliability guarantees. CERN's data center relies on OpenStack services, which run on a cluster of ruggedized x86 machines.
access:
Alibaba Cloud - Universal vouchers up to 1888 yuan available immediately