General PAC Learning

In the previous PAC learning model, we hold realizability assumption and focus on binary classification. In the following, we will generalize the PAC learning model in this two aspects.

Releasing the Realizability Assumption

The realizability assumption is two-fold.

We assume that there is an unique "target labeling function" $f(x)$ determining the labels.
We assume that $\exists h^* \in \mathcal{H}$ such that $\mathbb{P}_{x\sim \mathcal{D}}[h^*(x)=f(x)]=1$ .

Both assumption might not be realistic.

The labels might not be determined solely by the feature we extract. It is possible that we get two data points with same $x$ but different $y$ .
The hypothesis in the hypothesis class might not be strong enough to fit all data.

To relax the realizability assumption, we will replace the "target labeling function" $f(x)$ with a data-labels generating distribution.

Formally, from now on, let $\mathcal{D}$ be a probability distribution over $\chi \times \mathcal{Y}$ .
$\mathcal{D}$ is a joint distribution over domain points and labels.
$\mathcal{D}$ can be viewed as being composed of two parts:
A distribution $\mathcal{D}_x$ over unlabeled domain points (sometimes called the marginal distribution)
A conditional probability over labels for each domain point, $\mathcal{D}((x,y)|x)$ .
Clearly, such modeling allows for two data-points that share the same feature to belong to different categories.

The Empirical and the True Error Revised

Previously, the true error is defined as:
$L_{ \mathcal{D},f }(h) \mathop{=}^{def} \mathop{\mathbb{P}}_{x\sim \mathcal{D}}[h(x)\neq f(x)]\mathop{=}^{def} \mathcal{D}(\{x: h(x)\neq f(x)\})$
Since we now no longer have $f(x)$ , we redefine the true error as:
$L_{ \mathcal{D} }(h) \mathop{=}^{def} \mathop{\mathbb{P}}_{(x,y)\sim \mathcal{D}}[h(x)\neq y]\mathop{=}^{def} \mathcal{D}(\{(x,y): h(x)\neq y\})$
The definition of the empirical error remains the same:
$L_S(h) := \frac{|\{i \in [m]: h(x_i)\neq y_i\}|}{m}$ , where $[m]=\{1,\dots,m\}$

Agnostic PAC Learnability

Previously, "approximately correct" is expressed in terms of $L_{( \mathcal{D}, f )}(h)\le \epsilon$ ,
Due to the relaxation of realizability assumption, the learner can no longer guarantee arbitrary small true error
Instead, the best a learner can do is to get the minimum true error achievable by the hypothesis class: $L_{ \mathcal{D} }(h) \le \mathop{min}_{h' \in \mathcal{H}} L_{ \mathcal{D} }(h') + \epsilon$ .
For binary classification, the minimum true error could be achieved by the Bayes Optimal Predictor: $f_{\mathcal{D}}(x) = \left\{\begin{matrix} 1 & if\ \mathbb{P}[y=1|x]>1/2 \\ 0 & otherwise \end{matrix}$
Hence, $\mathop{min}_{h' \in \mathcal{H}}L_{ \mathcal{D}}(h') = L_{ \mathcal{D}}(f_{ \mathcal{D} })$
The learning algorithm is expected to find a predictor that is as good as the Bayes Optimal Predictor to make the minimum possible true error.

Based on the new definition of the generalization error, we could formally define the agnostic PAC learning model as:

Definition 3.3 (Agnostic PAC learnability) A hypothesis class $\mathcal{H}$ is agnostic PAC learnable if there exists a function $m_{ \mathcal{H} }: (0,1)^2 \rightarrow \mathbb{N}$ and a learning algorithm with the following property: For every $\epsilon, \delta \in (0,1)$ , for every distribution $\mathcal{D}$ over $\chi \times \mathcal{Y}$ , when running the algorithm on $m \ge m_{ \mathcal{H} }(\epsilon, \delta)$ i.i.d examples generated by $\mathcal{D}$ will get a hypothesis $h$ such that, with probability of at least $1-\delta$ (over the choice of all possible $m$ -tuple sample), $L_{ \mathcal{D}}(h)\le \mathop{min}_{h' \in \mathcal{H}}L_{ \mathcal{D}}(h') + \epsilon$ .

If the realizability assumption holds, agnostic PAC learning provides the same guarantee as PAC learning
When the realizability assumption does not hold, no learner can guarantee an arbitrarily small error.
Nevertheless, under the definition of agnostic PAC learning, a learner can still declare success if its error is not much larger than the best error achievable by a predictor from the class $\mathcal{H}$ .

The Scope of Learning Problems Modeled

We next extend our model so that it can be applied to a wide variety of learning tasks in addition to binary classification task.

Multiclass Classification: We can simply generalize the binary classification task, where $\mathcal{Y}=\{0,1\}$ , to multiclass classification by enabling $\mathcal{Y}$ to be a larger finite set. An example is document classification
Regression: In this task, one wishes to find some simple pattern in the data — a functional relationship between the $\chi$ and $\mathcal{Y}$ components of the data.
Like classification task, the learner still gets a finite sequence of $(x,y)$ pairs and is required to output a function from $\chi$ to $\mathcal{Y}$ .
The loss of a hypothesis $h$ should be defined differently. A plausible option is expected square difference: $L_{ \mathcal{D} }(h) \mathop{=}^{def} \mathop{\mathbb{E}}_{(x,y)\sim \mathcal{D}}(h(x)-y)^2$

Generalized Loss Functions

To accommodate a wide range of learning tasks, we generalize our formalism of the loss of a hypothesis follows:

Given any set $\mathcal{H}$ (hypothesis class) and some domain $Z$ , let $\ell$ be any function from $\mathcal{H}\times Z$ to the set of nonnegative real numbers, $\ell: \mathcal{H}\times Z \rightarrow \mathbb{R}_+$
For classification and regression tasks, $Z = \chi \times \mathcal{Y}$
For unsupervised learning tasks, where the true labels are not accessible, $Z = \chi$ .
The risk function is then defined to be the expected loss of a classifier:
$L_{ \mathcal{D} }(h) \mathop{=}^{def} \mathop{\mathbb{E}}_{z \sim \mathcal{D}}[\ell(h,z)]$
The empirical risk is also defined as the expected loss over a given sample: $L_S(h) \mathop{=}^{def}\frac{1}{m}\mathop{\sum}_{i=1}^{m}\ell(h,z_i),\ where\ S=(z_1,\dots,z_m)\in Z^m$
The loss functions used in the preceding examples of classification and regression tasks are as follows:
- 0-1 loss ( $Z=\chi \times \mathcal{Y}$ ):
  $\ell_{0-1}(h,(x,y))\mathop{=}^{def}\left\{ \begin{matrix} 0 & if\ h(x)= y\\ 1 & if\ h(x)\ne y \end{matrix}$
- Square Loss: $\ell_{sq}(h,(x,y))\mathop{=}^def(h(x)-y)^2$
We should note that for classification, the expectation definition coincides with the previous definitioin. Proof:
$\mathop{\mathbb{E}}_{(x,y)\sim \mathcal{D}}[\ell_{0-1}(h,(x,y))] = 1\times \mathop{\mathbb{P}}_{(x,y)\sim \mathcal{D}}[h(x)\ne y]+0 = \mathop{\mathbb{P}}_{(x,y)\sim \mathcal{D}}[h(x)\ne y]$

To summarize, we formally define agnostic PAC learnability for general loss functions as:

Definition 3.4 (Agnostic PAC Learnability for General Loss Functions) A hypothesis class $\mathcal{H}$ is agnostic PAC learnable with respect to a set $Z$ and a loss function $\ell: \mathcal{H}\times Z\rightarrow \mathbb{R}_+$ , if there exists a function $m_{ \mathcal{H} }: (0,1)^2 \rightarrow \mathbb{N}$ and a learning algorithm with the following property: For every $\epsilon, \delta \in (0,1)$ , for every distribution $\mathcal{D}$ over $Z$ , when running the algorithm on $m \ge m_{ \mathcal{H} }(\epsilon, \delta)$ i.i.d examples generated by $\mathcal{D}$ will get a hypothesis $h$ such that, with probability of at least $1-\delta$ (over the choice of all possible $m$ -tuple sample), $L_{ \mathcal{D}}(h)\le \mathop{min}_{h' \in \mathcal{H}}L_{ \mathcal{D}}(h') + \epsilon$ , where $L_{D}(h)= \mathbb{E}_{z\sim \mathcal{D}} [\ell(h,z)]$

General PAC Learning Model