How Training and Testing Works in Data Mining / Machine Learning

Imagine you wanted to come up with a “classifier” to predict whether Georgia would win a given American football game. So you might get (training) data from all their games from the previous two years. Then you might come up with rules based on that data. For example, if the quarterback throws for 300+ yards, the running back runs for 100+ yards, and the defense gets 2+ interceptions, then you think they will win. Then you test those rules on the same (training) data, and it turns out that voila! it is correct 100% of the time. But you don’t really know whether that classifier (set of rules) is generalizable because you only tested it on data you had already looked at. So you could apply the rules to 2009 data (test set) and see how well it performs there. But it could be that the team is so different in 2009 from last year that these rules don’t hold. So while the classifier “fit” the training data well, in fact it “overfit” the training data because it didn’t also “fit” the test data. That’s why it’s important to hold out a test set so you can evaluate the classifier.

Your training data wouldn’t necessarily have to be from 2007-2008. You could pick five games from each of the last 10 seasons and have that be your training data. And then you could pick five other games (non overlapping) from each those seasons and make that your test set. The important thing is that they don’t overlap. Or you could use cross validation in which 1 game from each of the last 10 seasons is used as the test set, and the remaining 9 games from those seasons are used as the training set. This is repeated 10 times, each time using a different game from each seasons for the test set and the remaining games for the training set.

Leave a Reply