Building ML Solutions on Top of Pre-trained Embeddings with Noisy/Imbalanced Data

Ching (Chingis)
5 min readFeb 29, 2024

With the boom of LLM and/or multi-modal embeddings building solutions on top of them has become popular for many applications. Being able to reuse an embedding model across multiple projects is important to reduce development and maintenance costs. However, this might be a challenging task given the domain of your problem. Fine-tuning the models might not always be available due to time or any other limitations. Off-the-shelf embedding might not be discriminative enough for your data, which might lead to a lot of noise and/or overfitting.

In this blog, I would like to share some tips and tricks that have helped me build solutions on top of frozen off-the-shelf embeddings. While some of them might seem general and obvious, I also tried to leave as much supporting information as possible to provide a more comprehensive overview and explanations.

General Tricks

First, the easy step is to start with your hyper-parameter tuning, which includes your learning rate, weight decay (L2 regularization), gradient clipping, and other buzzwords that control the magnitude of the update steps. This is a bit tedious process; however, it’s a crucial step before you start implementing other solutions on top of that.

--

--

Ching (Chingis)

I am a passionate student. I enjoy studying and sharing my knowledge. Follow me/Connect with me and join my journey.