Try These Best Practices for Training Machine Learning Models

Try These Best Practices for Training Machine Learning Models
This post was published on the now-closed HuffPost Contributor platform. Contributors control their own work and posted freely to our site. If you need to flag this entry as abusive, send us an email.
fandijki/Getty Images

What are some best practices for training machine learning models? originally appeared on Quora: the place to gain and share knowledge, empowering people to learn from others and better understand the world.

Answer by Xavier Amatriain, VP of Engineering, on Quora:

What are some best practices for training machine learning models? This is a very broad question. However, I will try to summarize some of the best practices I have come across for training and testing ML models. Note that I am focusing my answer on ML models that are trained on user-generated data for user-facing products. This excludes other domains, for example: image classification or language understanding. Also, I am focusing only on training and testing and avoiding many other important issues such as feature engineering, which also affect the process. Metrics

  • You should pick an offline optimization metric that correlates as well as possible to the product objectives. Many times, a good proxy for the product objectives can be an online A/B test result or some other online metric.
  • You can only know that a metric correlates well to online A/B tests by running different experiments and tracking offline metrics. (E.g. Metrics that tend to correlate well to ranking-related problems are recall@n, NDCG, or MRR (mean reciprocal rank)).
  • A good metric:

- Should allow to easily compare different models.

- Should be as easy to understand and interpret as possible.

  • It is a good idea to track your metric(s) per user segment you care about (e.g. new users, stale users, very active users, locales...).
  • Measure your metric on the test set (not training, not validation).
  • It is good to track metrics offline that give you a sense of how much you are “changing” the experience. E.g. You can track Pearson correlation between your new model ranking and the existing one. If you are not changing much, it might not be worth to AB test at all.
  • In many cases, you might need to track not one, but several offline metrics and decide how to tradeoff between them. As a general rule of thumb, it is good to decide on the tradeoff beforehand.

Training data

  • If you remove a subset of users from your dataset, you should expect the resulting model to be worse for that “kind” of user.
  • It is important to capture temporal variations in your training set. For example, if you just train on yesterday's data, you might be training on a Sunday and then trying to apply the results to a Monday. Generally speaking, it is good to get your training data for at least a full week and avoid special holidays where behavior might change dramatically.
  • You likely don't need all the data you have. It is many times a good idea to reduce your training data if you can manage to keep metrics constant and reduce training time.
  • On the other hand, adding more (non-redundant) data could improve accuracy of a more complex model with more parameters so you should be careful not to drive yourself to a local optimum by reducing data just because your model is too simple and not increasing it when trying a more complex one.
  • Random sampling is not always the best way to reduce your data. By random sampling, you might be getting too many datapoints from your most active users. You might need to stratify the sampling so you capture the variance of users, items, or time.
  • You can weight training data to push the model to value some actions more than others. For example, if you think an action on high quality content is worth twice as much as one on regular content, you can multiply the high quality one by two.
  • Also, beware that natural biases in the actions might require some form of subsample or weighting to counter this. E.g. your dataset might have many more clicks than likes, but you want the model to weight them similarly.
  • The training data is also biased by the existing model. Ideally, you'd want to introduce some randomness or exploration into what you present so to get less biased samples. How to do that without affecting engagement is outside the scope of this answer though.

Negative sampling

  • The simple idea of treating everything that is shown but not “clicked” as a negative usually does not work. The reason it does not work is that this teaches the model that what it decided that was somewhat good is actually worse that what wasn't shown (presentation bias).
  • Many times random sampling from the “catalog” is a suitable strategy if your catalog is large enough. The assumption here is that a random sample from your specific catalog is very unlikely to be a positive example. Note that “catalog” here is defined as the subset of elements that your algorithm could consider.
  • You can combine negative samples from the impressions with random samples and use a hyperparameter to tune the proportion of each.
  • Users with low activity won't click on things by default. Treating them as negatives makes the problems outlined above even worse.
  • Different models deal better or worse with class imbalance, but as a general rule of thumb you roughly need as many positives as negatives. You probably have many more negatives than positives. So, yes, it is ok to down-sample.

Hyperparameter tuning

  • You should do hyperparameter tuning with a validation set that is different from training and testing.
  • Not all your hyperparameters have the same range. Some will almost never move while others can change from one day to the other due to covariate shift.
  • Ideally, you want to tune your hyperparameters every time that you retrain your model. However, you might not want to tune them all since it is expensive.
  • Grid search is simple, but highly ineffective. There are better ways (e.g. Bayesian Optimization).

Overfitting and Time traveling

  • You should always check your hyperparameters for possible signs of overfitting (e.g. a regularization lambda close to zero).
  • Use a test set from the “future”. E.g. Try to predict most recent clicks from older ones. Do not random sample your test set from the same batch as your training set.
  • Make sure your training features do not contain data from the “future” (aka time traveling). While this might be easy and obvious in some cases, it can get tricky. E.g. you might be using the output of another model developed by someone else as the input of yours. This external model might include information from the date of your test set.
  • If your test metric becomes really good all of the sudden, ask yourself what you might be doing wrong. Chances are you are time travelling or overfitting in some way.

This question originally appeared on Quora - the place to gain and share knowledge, empowering people to learn from others and better understand the world. You can follow Quora on Twitter, Facebook, and Google+. More questions:

Go To Homepage

Before You Go

Popular in the Community