Google recently open sourced a new multi-language text vectorizer called RETVec on GoogleColab. This vectorizer has been deployed on Gmail to improve the recognition rate of spam and phishing emails while reducing the false positive rate. Google says RETVec is trained to resist character-level operations, including insertions, deletions, misspellings,Homographs, LEET replacement, etc. This model is trained on top of a new character encoder that can effectively encode all UTF-8 characters and words.
Why train such a model? Because Gmail sends and receives tens of millions of emails every day, and if it contains various types of spam, it may be in the billions, and spammers will circumvent Google's detection system, such as using homographs.
RETVec supports more than 100 languages and is designed to help build more flexible and efficient text classification on the server and device, while also being more powerful and efficient.
According to Google's own statistics, after applying RETVec to Gmail, the spam detection rate increased by 38% compared with the baseline, the false positive rate was reduced by 19.4%, and the tensor processing unit (TPU) usage was reduced by 83%.
Google engineers say that models trained using RETVec exhibit faster inference speeds due to their compact representation. Smaller models can reduce computational costs and reduce latency, which is critical for models on large-scale systems and devices.
Vectorization is a method in NLP, or natural language processing, that is used to map words or phrases in the vocabulary to corresponding digital expressions in order to perform further analysis, such as sentiment analysis, text classification, and named entity recognition.