Vapnik-Chervonenkis 이론

VC 이론(일명 VC 이론)은 블라디미르 증기닉과 알렉세이 체르보넨키스에 의해 1960-1990년에 개발되었다.그 이론은 통계적 관점에서 학습 과정을 설명하려는 계산적 학습 이론의 한 형태다.

VC이론은 통계학습 이론과 경험적 과정과 관련이 있다.리처드 M. 그 중에서도 더들리와 블라디미르 증기닉은 VC 이론을 경험적 과정에 적용했다.

소개

VC 이론은 최소한 4가지 부분을 다룬다(통계학 학습^[1] 이론의 본질에서 설명됨).

학습 과정의 일관성 이론
- 경험적 위험 최소화 원칙에 근거한 학습 프로세스의 일관성을 위한 (필요하고 충분한) 조건은 무엇인가?
학습과정의 융합율에 대한 비증상 이론
- 학습 과정의 융합 속도는 얼마나 빠른가?
학습 과정의 일반화 능력 제어 이론
- 어떻게 하면 학습 과정의 융합(일반화 능력) 속도를 조절할 수 있을까?
학습기계 구성 이론
- 어떻게 하면 일반화 능력을 제어할 수 있는 알고리즘을 구성할 수 있을까?

VC이론은 통계학습 이론의 주요 지국이다.통계 학습 이론의 주요 적용 분야 중 하나는 학습 알고리즘에 대한 일반화 조건을 제공하는 것이다.이러한 관점에서 VC 이론은 일반화의 특성화를 위한 대안적 접근방식인 안정성과 관련이 있다.

또한 VC계급이 지수화한 프로세스의 경우 VC이론과 VC차원이 실증 프로세스 이론에서 중요한 역할을 한다.거의 틀림없이 이것들은 VC 이론의 가장 중요한 응용 프로그램이며, 일반화를 입증하는 데 사용된다.경험적 과정과 VC 이론에서 널리 사용되는 몇 가지 기법이 도입될 것이다.이 논의는 주로 '약체적 융합과 경험적 과정'이라는 책에 바탕을 두고 있다. 통계에 응용 프로그램 포함.^[2]

경험적 프로세스에서의 VC 이론 개요

경험적 과정의 배경

Let $X_{1},\ldots ,X_{n}$ be independent, identically distributed random elements of a measurable space $({\mathcal {X}},{\mathcal {A}})$ . For any measure $Q$ on $({\mathcal {X}},{\mathcal {A}})$ , and any 측정 가능한 $f:{\mathcal {X}}\to \mathbf {R}$ f $f:{\mathcal {X}}\to \mathbf {R}$ : $f:{\mathcal {X}}\to \mathbf {R}$ → $f:{\mathcal {X}}\to \mathbf {R}$ ${\$ 정의

[\displaystyle Qf=\int fdQ}

여기서 측정 가능성 문제는 무시되며, 자세한 ${\mathcal {F}}$ 은 F ${\$ 을(를) 측정 가능한 함수의 $f:{\mathcal {X}}\to \mathbf {R}$ $f:{\mathcal {X}}\to \mathbf {R}$ : $f:{\mathcal {X}}\to \mathbf {R}$ → R ${\$ f $:{\mathcal {X}\to \mathbf {R}$ 을(를) 참조하고 다음을 $f:{\mathcal {X}}\to \mathbf {R}$ 정의하십시오.

\Q\ _{\mathcal {F}=\supp\{\vert Qf\vert \\in {\f\mathcal {F}}

경험적 측정 정의

\mathb {P} _{n}=n^{-1}\sum _{i=1}^{n}\delta _{X_{i},

여기서 $Δ$ 는 Dirac 측정값을 나타낸다.경험적 조치는 다음과 ${\mathcal {F}}\to \mathbf {R}$ ${\mathcal {F}}\to \mathbf {R}$ 으로 지도 $F$ → R {\ $displaystyle {\mathcal {F}\$ mathbf { $R}$ 에 $제공$ 되도록 유도한다 $.$

f\mapsto \mathb {P} _{n}f={\frac {1}{n1}{n}(f(X_{1})+...+f(X_{n})

이제 $P$ 가 알 수 없는 데이터의 진정한 분포의 기초라고 가정합시다.경험적 프로세스 이론은 다음과 같은 문장이 유지되는 ${\mathcal {F}}$ 클래스 ${\mathcal {F}}$ ${\$ 을(를) 식별하는 것을 목표로 한다.

대수의 균일한 법칙:
$\mathb{P} _{n}-P\{\mathcal{F}{\underset{n}{\}}},$

즉,

n\to \infty

→

n\to \infty

{\displaystyle n\to \infit

n\to \infty

$\왼쪽 {\frac {1}{n}}(f(X_{1})+...+f(X_{n})-\int fdP\right \to 0$

모든

f\in {\mathcal {F}}

f\in {\mathcal {F}}

f\in {\mathcal {F}}

{\

에 대해

f\in {\mathcal {F}}

균일하게 .

균일한 중앙 한계 정리:

\mathb {G} _{n}={\sqrt{n}}(\mathb {P} _{n}-P)\rigarrow \mathb {G},\quad {\text{in}\line}\nft({\\mathcal {F})})}}}}}

이 전자의 경우 F{\displaystyle{{F\mathcal}에서}}, 그리고 후자의 경우(가설 ∀\, 저녁밥을 먹다 f∈ Ff())− Pf<>∞{\displaystyle\forall x,\sup \nolimits_{{{\mathcalf\in F}}}\vert f())-Pf\vert<>\infty})클래스 F{\displaystyle{{F\mathcal}}}Glivenko-Cantelli라고 불리는 수업이다. ca은Led Donsker 또는 P-Donsker.돈스커 클래스는 슬루츠키의 정리를 응용하여 확률로 글리벤코 칸텔리다.

이러한 진술은 $표준$ LLN에 의한 단일 f ${\displaystyle f$ 에 대해 참이며, 규칙성 조건에서의 CLT 논거에 대해서는 모든 $f\in {\mathcal {F}}$ $f\in {\mathcal {F}}$ F ${\$ 에 대해 공동 진술이 이루어지고 있기 때문에 경험적 프로세스의 난이도가 발생한다 $f\in {\mathcal {F}}$ 그러면 ${\mathcal {F}}$ 으로 F ${\$ {\{\ $f}$ $F}}}$ 은(는) 너무 클 수 없으며 ${\mathcal {F}}$ , ${\mathcal {F}}$ ${\$ 의 기하학이 매우 중요한 역할을 한다는 ${\mathcal {F}}$ 것이 밝혀졌기 때문이다.

함수 집합 ${\mathcal {F}}$ ${\$ 의 크기를 측정하는 한 가지 방법은 소위 커버링 번호를 사용하는 것이다 ${\mathcal {F}}$ .커버 번호

N(\varepsilon ,{\mathcal {F},\\cdot \ )

세트 ${\mathcal {F}}$ ${\$ ${\\$ $f$ $mathcal$ {F $}을($ 를) 커버하는 데 필요한 $\{g:\|g-f\|<\varepsilon \}$ 최소 공 수 $:\$ g\ $g-f\ <\varepsilon \}}$ 이다( ${\mathcal {F}}$ 여기서 ${\mathcal {F}}$ F ${\$ 에 기본 규범이 있다고 가정함). ${\mathcal {F}}$ 엔트로피는 커버 번호의 로그다.

아래에는 두 가지 충분한 조건이 제공되어 있으며, 이 조건 하에서 ${\mathcal {F}}$ 된 F ${\$ {\ $mathcal {F}$ 이 ${\mathcal {F}}$ (가) Glivenko-Cantelli 또는 Donsker임을 증명할 수 있다.

$P^{\ast }F<\infty$ ${\mathcal {F}}$ ${\$ 이(가) $P^{\ast }F<\infty$ $P^{\ast }F<\infty$ F $P^{\ast }F<\infty$ < $P^{\ast }F<\infty$ ${\displaystyle$ P $^{\ast }F<\ft }}$ 과(와) 같은 봉투 $F$ 로 P를 측정할 수 있으면 $P^{\ast }F<\infty$ P-글리벤코-칸텔리(P-Glivenvenko-Cantelli)이다 ${\mathcal {F}}$ .

\for \varepsilon >0\quad \sup \nolimits_{Q}N(\varepsilon \ \F\},{Q},{\mathcal {F},L_{1}(Q)\infuly .

다음 조건은 유명한 더들리의 정리 버전이다. ${\mathcal {F}}$ ${\$ 이(가) 다음과 같은 함수의 클래스인 ${\mathcal {F}}$ 경우

\int _{0}^{\infit }\suppy \nollimits _{Q}{\sqrt {\\log N\(\varepsilon \ \ \{Q,2},{Q}\mathcal {F},L_{2}(Q)\rig)\rig)\right)}}}d\varepsilon <\flatty

$P^{\ast }F^{2}<\infty$ ${\mathcal {F}}$ ${\$ $P^{\ast }F^{2}<\infty$ 은 $P^{\ast }F^{2}<\infty$ 는) P $measure$ F $P^{\ast }F^{2}<\infty$ 2 < don {\ $displaystyle P^{\ast }{{{2$ }}와 같은 모든 확률 측정 P에 대해 P-Donsker이다 ${\mathcal {F}}$ $P^{\ast }F^{2}<\infty$ 마지막 적분에서 표기법이란 뜻이다 $.$

\|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}

\|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}

\|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}

\|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}

, 2

\|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}

= (

\|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}

f

\|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}

\|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}

Q

\|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}

) 1

\|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}

{\

\

f\ _{Q,2}=\left(\int

f

^{2}dQ\right)^{\frac {1}{1

}{2

\|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}

대칭화

경험적 과정을 어떻게 결합할 것인가에 대한 대부분의 주장은 대칭, 최대 및 집중 불평등과 체인에 의존한다.대칭성은 보통 증명의 첫 번째 단계로, 경계 경험적 손실 기능에 대한 많은 기계 학습 증빙(다음 섹션에서 논의되는 VC 불평등의 증빙 포함)에 사용되기 때문에 여기에 제시한다.

경험적 프로세스를 고려하십시오.

f\mapsto(\mathb {P} _{n}-P)f={\dfrac {1}{n}\sum _{i=1}^{n}(f(X_{i}-PF)}}

경험적 과정과 다음과 같은 대칭적 프로세스 사이에 연관성이 있는 것으로 확인됨:

f\mapsto \mathb {P} _{n}^{0}f={\dfrac {1}{n}\sum _{i=1}^{n}\varepsilon _{i}f(X_{i})}}

대칭화된 프로세스는 Rademacher 프로세스로, 조건부로 $X_{i}$ $X_{i}$ i ${\$ 에 있다 $X_{i}$ 따라서 호프딩의 불평등에 의한 가우스 이하의 과정이다.

보조정리(Symmetrization).모든 비감소, 볼록 $φ:$ $R$ → R $및$ 측정 가능한 함수의 클래스 ${\mathcal {F}}$ ${\$

\mathb {E} \mathb {P}_{n}-P\{\mathcal{F}}\leq \mathb {E}\P}\Phi \좌측(2\좌측\\\\\\\{n}^{0}\우측)\\\\\mathcal {F

Symmetrization 보조정리(Symmetrization)의 증빙은 $X_{i}$ $X_{i}$ X i ${\$ 의 독립복사를 도입하고, 이러한 복사본으로 LHS의 내부 기대치를 대체하는 것에 의존한다.젠센의 불평등을 적용한 후, 기대치를 바꾸지 않고 서로 다른 부호를 도입할 수 있었다(이름을 대칭으로 함).그 증거는 교훈적인 성질 때문에 아래에서 찾을 수 있다.

증명

Introduce the "ghost sample" $Y_{1},\ldots ,Y_{n}$ to be independent copies of $X_{1},\ldots ,X_{n}$ . For fixed values of $X_{1},\ldots ,X_{n}$ one has:

\ \mathbb {P} _{n}-P\ _{\mathcal {F}}=\sup _{f\in {\mathcal {F}}}{\dfrac {1}{n}}\left \sum _{i=1}^{n}f(X_{i})-\mathbb {E} f(Y_{i})\right \leq \mathbb {E} _{Y}\sup _{f\in {\mathcal {F}}}{\dfrac {1}{n}}\left \sum _{i=1}^{n}f(X_{i})-f(Y_{i})\right

그러므로 옌센의 불평등에 의해 다음과 같이 된다.

\Phi (\ \mathbb {P} _{n}-P\ _{\mathcal {F}})\leq \mathbb {E} _{Y}\Phi \left(\left\ {\dfrac {1}{n}}\sum _{i=1}^{n}f(X_{i})-f(Y_{i})\right\ _{\mathcal {F}}\right)

$X$ $X$ 에 대한 기대:

\mathbb {E} \Phi (\ \mathbb {P} _{n}-P\ _{\mathcal {F}})\leq \mathbb {E} _{X}\mathbb {E} _{Y}\Phi \left(\left\ {\dfrac {1}{n}}\sum _{i=1}^{n}f(X_{i})-f(Y_{i})\right\ _{\mathcal {F}}\right)

Note that adding a minus sign in front of a term $f(X_{i})-f(Y_{i})$ doesn't change the RHS, because it's a symmetric function of $X$ and $Y$ . Therefore, the RHS remains the same under "sign perturbation":

\mathb {E} \Phi \left(\왼쪽\\\\\dfrac {1}{n1}\sum _{i}^{n}e_{n}}}\f(X_{i}-{i}\right)\right\{\{\\\\\\mathcal {F

임의의 $(e_{1},e_{2},\ldots ,e_{n})\in \{-1,1\}^{n}$ ( $(e_{1},e_{2},\ldots ,e_{n})\in \{-1,1\}^{n}$ 1, $(e_{1},e_{2},\ldots ,e_{n})\in \{-1,1\}^{n}$ 2, $(e_{1},e_{2},\ldots ,e_{n})\in \{-1,1\}^{n}$ …, $(e_{1},e_{2},\ldots ,e_{n})\in \{-1,1\}^{n}$ $(e_{1},e_{2},\ldots ,e_{n})\in \{-1,1\}^{n}$ )에 대해 $(e_{1},e_{2},\ldots ,e_{n})\in \{-1,1\}^{n}$ { $(e_{1},e_{2},\ldots ,e_{n})\in \{-1,1\}^{n}$ - $(e_{1},e_{2},\ldots ,e_{n})\in \{-1,1\}^{n}$ , $(e_{1},e_{2},\ldots ,e_{n})\in \{-1,1\}^{n}$ n ${\$ \{ $11\}^{n}}}.$ 따라서 $(e_{1},e_{2},\ldots ,e_{n})\in \{-1,1\}^{n}$

\mathbb {E} \Phi (\ \mathbb {P} _{n}-P\ _{\mathcal {F}})\leq \mathbb {E} _{\varepsilon }\mathbb {E} \Phi \left(\left\ {\dfrac {1}{n}}\sum _{i=1}^{n}\varepsilon _{i}\left(f(X_{i})-f(Y_{i})\right)\right\ _{\mathcal {F}}\right)

마지막으로 첫 번째 삼각형 불평등을 사용한 후 $\Phi$ $\Phi$ 의 볼록도를 사용하면 다음과 같은 결과를 $\Phi$ 얻을 수 있다.

\mathbb {E} \Phi (\ \mathbb {P} _{n}-P\ _{\mathcal {F}})\leq {\dfrac {1}{2}}\mathbb {E} _{\varepsilon }\mathbb {E} \Phi \left(2\left\ {\dfrac {1}{n}}\sum _{i=1}^{n}\varepsilon _{i}f(X_{i})\right\ _{\mathcal {F}}\right)+{\dfrac {1}{2}}\mathbb {E} _{\varepsilon }\mathbb {E} \Phi \left(2\left\ {\dfrac {1}{n}}\sum _{i=1}^{n}\varepsilon _{i}f(Y_{i}\right\ _{\mathcal {F}\right)

RHS에 대한 마지막 두 표현이 동일한 경우, 이것이 증거를 마무리한다.

경험적 CLT를 입증하는 일반적인 방법으로는 먼저 대칭화를 사용하여 경험적 프로세스를 $\mathbb {P} _{n}^{0}$ $\mathbb {P} _{n}^{0}$ ${\$ 에 전달한 다음 $\mathbb {P} _{n}^{0}$ Rademacher 프로세스가 좋은 특성을 가진 단순한 프로세스라는 사실을 이용하여 데이터에 대해 조건부로 논쟁한다.

VC 연결

세트 ${\mathcal {F}}$ ${\$ 의 특정 조합 속성과 ${\mathcal {F}}$ 엔트로피 숫자 사이에는 매혹적인 연관성이 있는 것으로 나타났다.균일한 커버 번호는 Vapnik-Chervonenkis 클래스 세트 또는 짧은 VC 세트의 개념에 의해 제어될 수 있다.

Consider a collection ${\mathcal {C}}$ of subsets of the sample space ${\mathcal {X}}$ . ${\mathcal {C}}$ is said to pick out a certain subset $W$ of the finite set ${\displaystyle S=\{x_{1},\ldots ,$ $x_{n}\}\subset {\mathcal {X}}}$ if $W=S\cap C$ for some $C\in {\mathcal {C}}$ . ${\mathcal {C}}$ is said to shatter $S$ if it picks out each of its $2 n$ subsets.VC-지수(적절한 선택 분류자 집합에 대해 VC 치수 + 1과 유사) $($ $V({\mathcal {C}})$ $){\$ $displaystyle$ V $({\mathcal$ ${C}})$ 는 $V({\mathcal {C}})$ 크기 n $집합$ 이 ${\mathcal {C}}$ ${\$ { $C}$ 에 의해 산산조각 나지 않는 가장 작은 $n$ 이다 ${\mathcal {C}}$ ${\mathcal {C}}$

그런 다음 Sauer의 보조정리자는 VC 클래스 ${\mathcal {C}}$ ${\$ $\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})$ ( $\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})$ , $\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})$ $\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})$ , $\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})$ … $\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})$ , x $){\displaystyle \Delta _{n}({\$ mathcal { $C$ $},x_{1},\ldots,x_$ ${n$ $}}}}})$ 을 $\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})$ 만족한다고 ${\mathcal {C}}$ 명시한다 $.$

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})\leq \sum _{j=0}^{V({\mathcal {C}})-1}{n \choose j}\leq \left({\frac {ne}{V({\mathcal {C}})-1}}\right)^{V({\mathcal {C}})-1}

즉, 다항식 번호 O( $O(n^{V({\mathcal {C}})-1})$ V ( $O(n^{V({\mathcal {C}})-1})$ ) $O(n^{V({\mathcal {C}})-1})$ - 1 $O(n^{V({\mathcal {C}})-1})$ ) ${\displaystyle O(n^{{\mathcal{C}-1})$ 가 아니라 부분 집합의 다항식 번호 O $O(n^{V({\mathcal {C}})-1})$ V $({\$ {C}-1})이다.직관적으로 이것은 유한 VC-지수가 ${\mathcal {C}}$ ${\$ 이(가) 명백한 단순 구조를 가지고 ${\mathcal {C}}$ 있음을 암시한다는 것을 의미한다.

소위 VC 하위그래프 클래스에 대해 유사한 바운드를 표시할 수 있다(다른 상수, 동일한 속도).f:X는 기능은 6.2.1→ R{\displaystyle f:{{\mathcal X}}X×R{\displaystyle{{X\mathcal}의}은 부분 그래프. 하위 집합}}가:{(x, t):밀폐된<>이름())}{\displaystyle\와 같이{(x,t):t<, f())\}}\mathbf{R}\times.}}가 VCs라고 불린다\mathbf{R}F{\displaystyle{{F\mathcal}의 컬렉션 \toubgraph class 모든 서브그래프가 VC 클래스를 형성하는 경우.

${\mathcal {I}}_{\mathcal {C}}=\{1_{C}:C\in {\mathcal {C}}\}$ 함수 ${\mathcal {I}}_{\mathcal {C}}=\{1_{C}:C\in {\mathcal {C}}\}$ = ${\mathcal {I}}_{\mathcal {C}}=\{1_{C}:C\in {\mathcal {C}}\}$ { ${\mathcal {I}}_{\mathcal {C}}=\{1_{C}:C\in {\mathcal {C}}\}$ ${\mathcal {I}}_{\mathcal {C}}=\{1_{C}:C\in {\mathcal {C}}\}$ : ${\mathcal {I}}_{\mathcal {C}}=\{1_{C}:C\in {\mathcal {C}}\}$ ${\mathcal {I}}_{\mathcal {C}}=\{1_{C}:C\in {\mathcal {C}}\}$ ${\mathcal {I}}_{\mathcal {C}}=\{1_{C}:C\in {\mathcal {C}}\}$ ${\mathcal {I}}_{\mathcal {C}}=\{1_{C}:C\in {\mathcal {C}}\}$ ${\displaystyle {\mathcal {I}_{\mathcal{C}=\{1_{C}:$ 이산 경험적 유형의 측정 $Q$ ( $L_{1}(Q)$ 모든 확률 측정 $Q$ 에 대해 동등하게)에 $L_{1}(Q)$ 대한 $L_{1}(Q)$ $L_{1}(Q)$ ( $L_{1}(Q)$ Q $L_{1}(Q)$ ) $L_{1}(Q)$ 의 ${\mathcal {I}}_{{{\mathcal {C}}}}=\{1_{C}:C\in {\mathcal {C}}\}$ $C\in {\mathcal{C}\}}.$ $r\geq 1$ r $r\geq 1$ 1 $r\geq 1$ {\ $displaystyle r\geq$ 1 $r\geq 1$ 에 대해 다음과 같이 상당히 주목할 수 있다.

N(\varepsilon ,{\mathcal {I}}_{\mathcal {C}},L_{r}(Q))\leq KV({\mathcal {C}})(4e)^{V({\mathcal {C}})}\varepsilon ^{-r(V({\mathcal {C}})-1)}

Further consider the symmetric convex hull of a set ${\mathcal {F}}$ : $\operatorname {sconv} {\mathcal {F}}$ being the collection of functions of the form $\sum _{i=1}^{m}\alpha _{i}f_{i}$ with $\sum _{i=1}^{m}|\alpha _{i}|\leq 1$ $\sum _{i=1}^{m}|\alpha _{i}|\leq 1$ ${\displaystyle \sum _{i=1}^{m} \alpha _{i} \leq$ 1 $\sum _{i=1}^{m}|\alpha _{i}|\leq 1$ 그렇다면

N\left(\varepsilon \ \F\{Q,2},{\mathcal {F},L_{2}(Q)\right)\leq C\varepsilon ^{-V}

${\mathcal {F}}$ 은 F ${\$ 의 볼록 선체에 유효하다 ${\mathcal {F}}$

{\displaystyle \log N\left(\varepsilon \F\ _{Q,2},\operatorname {sconv} {\mathcal {F},L_{2}(Q)\leq K\varepsilon ^{-{\frac {2V}{V+2}}:

이 사실의 중요한 결과는 다음과 같다.

{\frac {2V}{V+2}}>2,

엔트로피 적분이 수렴할 정도로 충분하므로 클래스 $\operatorname {sconv} {\mathcal {F}}$ $\operatorname {sconv} {\mathcal {F}}$ F ${\$ 은(는) P-Donsker가 될 것이다 $\operatorname {sconv} {\mathcal {F}}$ .

마지막으로 VC-하위그래프 클래스의 예를 고려한다.Any finite-dimensional vector space ${\mathcal {F}}$ of measurable functions $f:{\mathcal {X}}\to \mathbf {R}$ is VC-subgraph of index smaller than or equal to $\dim({\mathcal {F}})+2$ .

증명: take $n=\dim({\mathcal {F}})+2$ = $n=\dim({\mathcal {F}})+2$ $n=\dim({\mathcal {F}})+2$ $n=\dim({\mathcal {F}})+2$ ) $n=\dim({\mathcal {F}})+2$ + 2 $n=\dim({\mathcal{F}})+2$ 포인트 $n=\dim({\mathcal {F}})+2$ $(x_{1},t_{1}),\ldots ,(x_{n},t_{n})$ $(x_{1},t_{1}),\ldots ,(x_{n},t_{n})$ , $(x_{1},t_{1}),\ldots ,(x_{n},t_{n})$ $(x_{1},t_{1}),\ldots ,(x_{n},t_{n})$ ) $(x_{1},t_{1}),\ldots ,(x_{n},t_{n})$ , $(x_{1},t_{1}),\ldots ,(x_{n},t_{n})$ , , , $(x_{1},t_{1}),\ldots ,(x_{n},t_{n})$ ( $(x_{1},t_{1}),\ldots ,(x_{n},t_{n})$ $(x_{1},t_{1}),\ldots ,(x_{n},t_{n})$ , $(x_{1},t_{1}),\ldots ,(x_{n},t_{n})$ $(x_{1},t_{1}),\ldots ,(x_{n},t_{n})$ ) ${\displaystystyle (x_{1},t_{1}),\ldots ,(x_{n},t_{n$ 벡터:

{\displaystyle(f(x_{1}),\ldots,f(x_{n})--(t_{1},\ldots,t_{n}}}

$R$ 의ⁿ $n$ - $1차원$ 아공간 안에 있다.이 아공간과 직교하는 벡터인 ≠ $0$ 을 취한다.따라서 다음과 같다.

\sum \sum _{a_{i}-t_{i}=\sum _{a_{i}<0}(-a_{i}-t_{i})(f(x_{i})-{i}),\quad \f\in {\math {F}}}}}}

;0\}}. 이후 약간 f{\displaystyle f}가 S){(x, 나는 하루에 500파운드):f()나는)>는 과목은 나는}{\displaystyle S=\{(x_{나는},t_{나는}):f(x_{나는})>, t_{나는}\}}는기 위해서는 함축한 내용은 이 세트를 선택할 수 없는 집합 S){(x, 나는 하루에 500파운드):나는입니다.;0}{\displaystyle S=\{(x_{나는},t_{나는}):a_{나는}&gt을 고려해 보세요그 루프 취급 계통 hat엄격히 긍정적이지만 RHS는 부정적이다.

VC 하위그래프 클래스라는 개념의 일반화가 있다. 예를 들어 의사차원 개념도 있다.관심 있는 독자는 조사할^[4] 수 있다.

VC 부등식

기계학습에 더 흔한 비슷한 설정이 고려된다.Let ${\mathcal {X}}$ ${\$ 은 ${\mathcal {X}}$ (는) 형상공간이고 Y ${\mathcal {Y}}=\{0,1\}$ = ${\mathcal {Y}}=\{0,1\}$ { ${\mathcal {Y}}=\{0,1\}$ ${\mathcal {Y}}=\{0,1\}$ } ${\displaystyle {Y}=\0,$ 1 ${\mathcal {Y}}=\{0,1\}$ $f:{\mathcal {X}}\to {\mathcal {Y}}$ f : $f:{\mathcal {X}}\to {\mathcal {Y}}$ → $:{\$ {\ $mathcal}}}}$ 을 분류기라고 한다 $f:{\mathcal {X}}\to {\mathcal {Y}}$ . ${\mathcal {F}}$ ${\$ 을(를) 분류자 집합으로 한다 ${\mathcal {F}}$ .이전 절과 마찬가지로 분쇄 계수(성장 함수라고도 함)를 정의하십시오.

S({\mathcal {F},n)=\max _{x_{1},\ldots,x_{n}}\{{(f_{1}),\ldots,f(x_{n}),f\in {\mathcal {F}\}}}}}}}

${\mathcal {F}}$ 서 F ${\$ 의 각 기능과 ${\mathcal {F}}$ 함수가 1인 집합 사이에 1:1이 있다는 점을 유의하십시오.따라서 ${\mathcal {C}}$ 는 C ${\$ {\ $mathcal {C}$ 을(를) 모든 $f\in {\mathcal {F}}$ $f\in {\mathcal {F}}$ F $f\in {\mathcal {F}}$ {\ $displaystyle$ f\ $in$ {\ $mathcal {F}$ 에 대해 위의 매핑을 통해 얻은 하위 집합의 집합으로 ${\mathcal {C}}$ 정의할 수 있다 $f\in {\mathcal {F}}$ 따라서 이전 섹션의 관점에서 분쇄 계수는 정밀하다.

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

,

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

n

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

(

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

1

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

,

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

… ,

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

x

\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})

)

{\displaystyle \max _{x_{1},\ldots,x_{n}\Delta _{n}({\mathcal {C},x_{1},\ldots,x_{n

${\mathcal {C}}$ 등가성은 S(F $S({\mathcal {F}},n)$ , $){\displaystyle S({\mathcal{F},n)}$ 가 집합 C ${\$ 이(가) 유한한 VC-지수를 갖는다면 ${\mathcal {C}}$ 충분히 큰 $n$ 에 대해 $n$ 의 다항식이 될 것임을 $S({\mathcal {F}},n)$ 암시한다.

$D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ D $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ = $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ { $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ ( $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ , $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ ) $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ , $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ … $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ , $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ ( $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ n $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ , $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ m $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ ) $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ ${\displaystyle D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},$ $Y_{m}\}}$ 은(는) 관찰된 데이터 집합이다 $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ .데이터가 알 수 없는 확률 분포 P $P_{XY}$ $P_{XY}$ ${\$ 에 $R(f)=P(f(X)\neq Y)$ 생성된다고 가정하고 $P_{XY}$ $R(f)=P(f(X)\neq Y)$ 된 0/1 손실이 되도록 $R(f)=P(f(X)\neq Y)$ R $R(f)=P(f(X)\neq Y)$ ( $R(f)=P(f(X)\neq Y)$ ) $R(f)=P(f(X)\neq Y)$ = P $R(f)=P(f(X)\neq Y)$ ( $R(f)=P(f(X)\neq Y)$ ( X $R(f)=P(f(X)\neq Y)$ ) $R(f)=P(f(X)\neq Y)$ Y $R(f)=P(f(X)\neq Y)$ ) ${\displaystyle R(f)=P(X)\neq$ Y $)$ 를 정의한다. $P_{XY}$ P X $P_{XY}$ ${\$ 입니다. $Y}}$ 은(는) 일반적으로 알려져 있지 $P_{XY}$ 않으며, $){\displaystyle R(f)}$ 에 접근할 수 없다 $R(f)$ 그러나 경험적 위험은 다음과 같이 주어진다.

{\r}_{n}(f)={\dfrac {1}{n}\sum _{i=1}^{n}\mathb {I}(f(X_{i}\neq Y_{i})}}

확실히 평가할 수 있다.그러면 하나는 다음과 같은 정리를 가지고 있다.

정리(VC 불평등)

이항 분류 및 0/1 손실 함수의 경우 다음과 같은 일반화 한계가 있다.

{\reasoned}P\left(\sup _{f\in {\mathcal {F}}}\left {\hat {R}}_{n}(f)-R(f)\right >\varepsilon \right)&\leq 8S({\mathcal {F}},n)e^{-n\varepsilon ^{2}/32}\\\mathbb {E} \left[\sup _{f\in {\mathcal {F}}}\left {\hat {R}}_{n}(f)-R(f)\right \right]&\leq 2{\sqrt {\dfrac {\log S({\mathcal {F}},n)+\log 2}{n}}}\end{aligned}}

즉, VC 불평등은 샘플이 증가함에 따라 ${\mathcal {F}}$ ${\$ 이(가) 유한한 VC 차원을 갖는다면 ${\mathcal {F}}$ 경험적 0/1 위험은 예상된 0/1 위험의 좋은 대용물이 된다고 말하고 있다. $S({\mathcal {F}},n)$ , $S({\mathcal {F}},n)$ ) $S({\mathcal{F},n)$ 이( $가$ ) n에서 다항식으로 증가한다면 $S({\mathcal {F}},n)$ 두 불평등의 RHS는 모두 0으로 수렴된다는 점에 유의하십시오.

이 프레임워크와 경험적 프로세스 프레임워크 사이의 연관성은 명백하다.여기서 하나는 수정된 경험적 과정을 다루고 있다.

\왼쪽 {\hat {R}_{n}-R\right _{\mathcal {F}

하지만 놀랄 것도 없이 아이디어는 똑같다.VC 불평등의 (첫 번째 부분)의 증거는 대칭성에 의존한 다음, 집중 불평등(특히 회핑의 불평등)을 이용한 데이터에 조건부로 논증한다.관심 있는 독자는 '정리 12.4와 12.5'라는 책을 확인할 수 있다.

참조

^ Vapnik, Vladimir N (2000). The Nature of Statistical Learning Theory. Information Science and Statistics. Springer-Verlag. ISBN 978-0-387-98780-4.
Vapnik, Vladimir N (1989). Statistical Learning Theory. Wiley-Interscience. ISBN 978-0-471-03003-4.
^ van der Vaart, Aad W.; Wellner, Jon A. (2000). Weak Convergence and Empirical Processes: With Applications to Statistics (2nd ed.). Springer. ISBN 978-0-387-94640-5.
^ Gyorfi, L.; Devroye, L.; Lugosi, G. (1996). A probabilistic theory of pattern recognition (1st ed.). Springer. ISBN 978-0387946184.
다음 문서에서 참조를 참조하십시오.리처드 M. 더들리, 경험적 과정, Breaked 세트.
^ Pollard, David (1990). Empirical Processes: Theory and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics Volume 2. ISBN 978-0-940600-16-4.
Bousquet, O.; Boucheron, S.; Lugosi, G. (2004). "Introduction to Statistical Learning Theory". In O. Bousquet; U. von Luxburg; G. Ratsch (eds.). Advanced Lectures on Machine Learning. Lecture Notes in Artificial Intelligence. Vol. 3176. Springer. pp. 169–207.
Vapnik, V.; Chervonenkis, A. (2004). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory Probab. Appl. 16 (2): 264–280. doi:10.1137/1116025.

[1]

[2]

[4]

Search