The No-Free-Lunch Theorm

In this section we prove that there is no universal learner. We do this by showing that no learner can succeed on all learning tasks, as formalized in the following theorem:

Theorem 5.1 (No-Free-Lunch) Let be any learning algorithm for the task of binary classification with respect to the 0 - 1 loss over a domain . Let be any number smaller than , respecting a training size. Then, there exists a distribution over such that:

  1. There exists a function with

  2. With probability of at least 1/7 over the choice of we have that

Note that the two condition is based on PAC model instead of agnostic PAC model. So the theorem only deals with PAC learnability (i.e. hypothesis class with no bias is not PAC learnable).

This theorem states that for every learner, there exists a task on which it fails, even though that task can be successfully learned by another learner. A trivial successful learner in this case would be an ERM learner with the hypothesis class , or more generally, ERM with respect to any finite hypothesis class that contains and whose sample size satisfies the equation .

Before proving the theorem, let’s first see how it relates to the need of prior knowledge.

COROLLARY 5.2 Let be an infinite domain set and let be the set of all functions from to . Then is not PAC learnable.

Proof: Assume, by way of contradiction, that is PAC learnable. Then by definition, there must be some learning algorithm and an integer , such that for any distribution over , if the realizability assumption holds (i.e. , ), then with probability greater than when is applied to sample of size , generated i.i.d. by , . We can arbitrarily choose some and . However, applying the No-Free-Lunch theorem, since , for every learning algorithm, there exists a distribution such that with probability greater than , , which leads to the desired contradiction.

Proof of the No-Free-Lunch Theorem

The goal of the proof is to design a distribution over such that:

  1. There exists a function with

  2. With probability of at least 1/7 over the choice of we have that

To do so, we intentionally choose a special family of distribution:

Let be a subset of of size , then there will be functions from to , denoted as . For each such function, let be a distribution over defined by:

Clearly, , hence, satisfies the first condition of the theorem. To prove that the second condition holds, we will show that:

This guarantees that we could always design a distribution (i.e. the one leading to max expectation error) such that for every algorithm ,

It can be verified that the preceding suffices for showing
, which finishes our proof.

We should point out that the learner’s holding no bias enables us to freely choose a distribution from . If the learner restrict the hypothesis class within some subset of , we might not be able to choose a distribution that satisfies the first condition.

We now turn to prove the equation:

There are possible sequences of examples from , denoted by . Also, if , we denote by the sequence containing the instances in labeled by the function , namely, .

If we decide to challenge the learner with , the possible training sets it can receive are . By our definition of , all these training sets have the same probability of being sampled [1]. Therefore,

Using the facts that "maximum" is larger than "average" and that "average" is larger than "minimum", we have:

So, currently, our target is lower bounded by the average true error (over all distributions) given some samples . The intuition of the following steps is that since the learner sees only training data, it has no information on the unseen data. Hence, for each unseen data, the best error rate it can achieve is .

Fix some sample and let be the unseen examples. Since , it holds that . Therefore,

Hence,

For each , we can partition all the functions in into disjoint pairs, where for a pair , , if and only if (i.e. the mapping of and differs only for ). Since is out of sample data, it follows that , and hence . Therefore, ,
which yields
.

Therefore,

which finishes the proof.


1. The easiness for calculation of expectation is the reason that we choose uniform distribution family.

results matching ""

    No results matching ""