摘要: In this second article on adversarial validation we get to the meat of the matter: what we can do when train and test sets differ. Will we be able to make a better validation set?
摘要: Many data science competitions suffer from a test set being markedly different from a training set (a violation of the “identically distributed” assumption). It is then difficult to make a representative validation set. We propose a method for selecting training examples most similar to test examples and using them as a validation set. The core of this idea is training a probabilistic classifier to distinguish train/test examples.