What if I tell you, “You’ll never have enough data”?
Okay here is an machine learning problem for you, given few data points from an unknown binary function f of 3 variables a, b, c, find the value of f at new data points.
What do you think is the answer? The pattern that comes to my mind is
f(a, b, c) = c as it is true for all the given values. But if you think harder you’ll recognize f(a, b, c) = (a.b)+c also fit the data, where .
is the and operator and +
is the or operator. But both the functions have different output.
Does having more data helps? Well it doesn’t as the value of both the candidate functions is same at new data point. Since there are only 8 data points on binary function with 3 variables, this is all the data we can have!
If you assume both the patterns are equally possible then you have no reason to choose one over another as the true value of f.
What is happening here?
Lets say our input comes from set X and output label in set Y. Then there exists |Y|^|X| functions from X → Y . Each of these functions is a potential function we are dealing with, who’s output we have to predict.
In case we are dealing with a Boolean function on n Boolean variables then |X| = 2^n and |Y| = 2, hence there are 2^(2^n) possible functions which might have produced the data. Each data point can eliminate half of possible possibilities, i.e., the patterns which doesn’t satisfy the observation. Hence we’ll need log2(2^(2^n)) = 2^n observations, that is all the data points you can get with a Boolean function of n variable, including the data point you need to make a prediction on. So you’ll never have enough data. The average performance of possible patterns on unknown data points will also be the same, so you have no reason to choose one pattern over another.
This as a special case of the no free lunch theorem, which says that the average performance of any algorithm you use over all underlying supervised learning problems is same as random. To say in another way, if you assume that all the underlying function can be anything, then you have no predictive power at all.
But of course the conditions for “no free lunch theorem” does not occur in real life for example, the space of natural images is of piece wise continuous functions, instead of all possible functions on 1024x1024 pixels (or whatever resolution images you are dealing with). Though catch is that the amount of data you need can still be so large it will be practically impossible to get that right amount of data. Suppose I take an AES encryption function with an unknown key of size 256 bits, then take a very large set of images, and label the images with the first bit of the cipher text from the encryption, now how many images do I need to predict the label of a new image? The number of images are I need are of order of 2¹²⁸, which is so large that it is practically impossible to get that amount of data.
Any “real” life example, we cannot have enough data to solve the problem as a supervised ML problem? I’ll say any complex system, be it triple pendulum location, a non trivial prediction in social systems or genetics or health as a function of diet. In Complex systems there can be lot of factors and these factors interact with many different and strange ways. Take diet, what matters is not just what you eat, but in what quantity, what proportion, in what form, at what time and also on your body type, genetics, climate. Taking vitamins though pills is not same as taking them through fruits, my mom will tell you that drinking milk in the morning is different from drinking milk in the night, and these are factors which we know, we might find out more in the years to come. And we know what happens when you don’t have enough data.