General PAC Learning
In the previous PAC learning model, we hold realizability assumption and focus on binary classification. In the following, we will generalize the PAC learning model in this two aspects.
Releasing the Realizability Assumption
The realizability assumption is two-fold.
-
We assume that there is an unique "target labeling function"
determining the labels.
-
We assume that
such that
.
Both assumption might not be realistic.
-
The labels might not be determined solely by the feature we extract. It is possible that we get two data points with same
but different
.
-
The hypothesis in the hypothesis class might not be strong enough to fit all data.
To relax the realizability assumption, we will replace the "target labeling function" with a data-labels generating distribution.
-
Formally, from now on, let
be a probability distribution over
.
-
is a joint distribution over domain points and labels.
-
can be viewed as being composed of two parts:
-
A distribution
over unlabeled domain points (sometimes called the marginal distribution)
-
A conditional probability over labels for each domain point,
.
-
Clearly, such modeling allows for two data-points that share the same feature to belong to different categories.
The Empirical and the True Error Revised
-
Previously, the true error is defined as:
-
Since we now no longer have
, we redefine the true error as:
-
The definition of the empirical error remains the same:
, where
Agnostic PAC Learnability
-
Previously, "approximately correct" is expressed in terms of
,
-
Due to the relaxation of realizability assumption, the learner can no longer guarantee arbitrary small true error
-
Instead, the best a learner can do is to get the minimum true error achievable by the hypothesis class:
.
-
For binary classification, the minimum true error could be achieved by the Bayes Optimal Predictor:
-
Hence,
-
The learning algorithm is expected to find a predictor that is as good as the Bayes Optimal Predictor to make the minimum possible true error.
Based on the new definition of the generalization error, we could formally define the agnostic PAC learning model as:
Definition 3.3 (Agnostic PAC learnability) A hypothesis class is agnostic PAC learnable if there exists a function
and a learning algorithm with the following property: For every
, for every distribution
over
, when running the algorithm on
i.i.d examples generated by
will get a hypothesis
such that, with probability of at least
(over the choice of all possible
-tuple sample),
.
-
If the realizability assumption holds, agnostic PAC learning provides the same guarantee as PAC learning
-
When the realizability assumption does not hold, no learner can guarantee an arbitrarily small error.
-
Nevertheless, under the definition of agnostic PAC learning, a learner can still declare success if its error is not much larger than the best error achievable by a predictor from the class
.
The Scope of Learning Problems Modeled
We next extend our model so that it can be applied to a wide variety of learning tasks in addition to binary classification task.
-
Multiclass Classification: We can simply generalize the binary classification task, where
, to multiclass classification by enabling
to be a larger finite set. An example is document classification
-
Regression: In this task, one wishes to find some simple pattern in the data — a functional relationship between the
and
components of the data.
-
Like classification task, the learner still gets a finite sequence of
pairs and is required to output a function from
to
.
-
The loss of a hypothesis
should be defined differently. A plausible option is expected square difference:
Generalized Loss Functions
To accommodate a wide range of learning tasks, we generalize our formalism of the loss of a hypothesis follows:
-
Given any set
(hypothesis class) and some domain
, let
be any function from
to the set of nonnegative real numbers,
-
For classification and regression tasks,
-
For unsupervised learning tasks, where the true labels are not accessible,
.
-
The risk function is then defined to be the expected loss of a classifier:
-
The empirical risk is also defined as the expected loss over a given sample:
-
The loss functions used in the preceding examples of classification and regression tasks are as follows:
-
0-1 loss (
):
-
Square Loss:
-
-
We should note that for classification, the expectation definition coincides with the previous definitioin. Proof:
To summarize, we formally define agnostic PAC learnability for general loss functions as:
Definition 3.4 (Agnostic PAC Learnability for General Loss Functions) A hypothesis class is agnostic PAC learnable with respect to a set
and a loss function
, if there exists a function
and a learning algorithm with the following property: For every
, for every distribution
over
, when running the algorithm on
i.i.d examples generated by
will get a hypothesis
such that, with probability of at least
(over the choice of all possible
-tuple sample),
, where