The No-Free-Lunch Theorm
In this section we prove that there is no universal learner. We do this by showing that no learner can succeed on all learning tasks, as formalized in the following theorem:
Theorem 5.1 (No-Free-Lunch)
Let be any learning algorithm for the task of binary classification with respect to the 0 - 1 loss over a domain
. Let
be any number smaller than
, respecting a training size. Then, there exists a distribution
over
such that:
-
There exists a function
with
-
With probability of at least 1/7 over the choice of
we have that
Note that the two condition is based on PAC model instead of agnostic PAC model. So the theorem only deals with PAC learnability (i.e. hypothesis class with no bias is not PAC learnable).
This theorem states that for every learner, there exists a task on which it fails, even though that task can be successfully learned by another learner. A trivial successful learner in this case would be an ERM learner with the hypothesis class , or more generally, ERM with respect to any finite hypothesis class that contains
and whose sample size satisfies the equation
.
Before proving the theorem, let’s first see how it relates to the need of prior knowledge.
COROLLARY 5.2
Let be an infinite domain set and let
be the set of all functions from
to
. Then
is not PAC learnable.
Proof:
Assume, by way of contradiction, that is PAC learnable. Then by definition, there must be some learning algorithm
and an integer
, such that for any distribution
over
, if the realizability assumption holds (i.e.
,
), then with probability greater than
when
is applied to sample
of size
, generated i.i.d. by
,
. We can arbitrarily choose some
and
. However, applying the No-Free-Lunch theorem, since
, for every learning algorithm, there exists a distribution
such that with probability greater than
,
, which leads to the desired contradiction.
Proof of the No-Free-Lunch Theorem
The goal of the proof is to design a distribution over
such that:
-
There exists a function
with
-
With probability of at least 1/7 over the choice of
we have that
To do so, we intentionally choose a special family of distribution:
Let be a subset of
of size
, then there will be
functions from
to
, denoted as
. For each such function, let
be a distribution over
defined by:
Clearly, , hence,
satisfies the first condition of the theorem.
To prove that the second condition holds, we will show that:
This guarantees that we could always design a distribution (i.e. the one leading to max expectation error) such that for every algorithm
,
It can be verified that the preceding suffices for showing
, which finishes our proof.
We should point out that the learner’s holding no bias enables us to freely choose a distribution from . If the learner restrict the hypothesis class within some subset of
, we might not be able to choose a distribution that satisfies the first condition.
We now turn to prove the equation:
There are possible sequences of
examples from
, denoted by
. Also, if
, we denote by
the sequence containing the instances in
labeled by the function
, namely,
.
If we decide to challenge the learner with , the possible training sets it can receive are
. By our definition of
, all these training sets have the same probability of being sampled [1]. Therefore,
Using the facts that "maximum" is larger than "average" and that "average" is larger than "minimum", we have:
So, currently, our target is lower bounded by the average true error (over all distributions) given some samples . The intuition of the following steps is that since the learner sees only
training data, it has no information on the unseen data. Hence, for each unseen data, the best error rate it can achieve is
.
Fix some sample and let
be the unseen examples. Since
, it holds that
. Therefore,
Hence,
For each , we can partition all the functions in
into
disjoint pairs, where for a pair
,
,
if and only if
(i.e. the mapping of
and
differs only for
). Since
is out of sample data, it follows that
, and hence
. Therefore,
,
which yields
.
Therefore,
which finishes the proof.