Handling imbalance dataset without altering the original dataset

Hi,

i tried several methods to handle imbalanced datasets

i used a simple single neuron ANN as a logistic regression model and a churn dataset

under- and oversampling worked well (especially SMOTE) but to get deeper understanding i wonder if there are even simpler ways to do than altering the original dataset?

my questions (sorry, too many of them):

  1. is changing the threshold value after training a model a usual and proper/professional way to handle imbalance dataset classification problem?

    • if so, how to do it? just run over the trained model and test data modifying threshold and calculating f1?
      is the best threshold where F1 score is the highest?
  2. is it a good idea to try to find a model with best AUC score using wandb sweeps and then find the best threshold value of that model maximizing the F1 score?

  3. what if i train a model with a threshold value other than 0.5

    • doing wandb sweeps finding the best AUC or f1 score?
  4. applying class weights will help to improve recall but in return precision will decrease

    • how to make it right? how to maximize F1 score?
  5. does it improve my model if i add one or more hidden layers to it?

thank you!

can you help me? @bhutanisanyam1

Hi @teamtom

These are all good questions and I wouldn’t say there’s concrete answers that apply in all cases.

For tuning the output of your model, choosing a threshold value is generally a tradeoff between False Positives versus False Negatives. It’s a consideration you’ll want to make depending on the task. If getting the highest F1 is all you care about, then tuning your threshold to achieve this is ok. The only extra consideration here is to be sure to not tune to a test set, because that’s considered bad practise.

I would say for questions 1-4, they’re all pretty similar and the answer is generally that it depends how you want your system to behave.

For question 5, for questions like these where you’re considering your model architecture, I would say you are better off using off the shelf architectures and understanding how they work by reading the papers that introduced them. The Timm library is a good resource to find image architectures and papers. Generally, more model capacity is a good thing, but deeper models can cause issues like vanishing gradients so there are approaches to overcome these like skip connections etc. For those reasons, it can often be a good idea to choose off-the-shelf architectures that’ll fit on your machine.