계수형 간결성 문제

컴퓨터 과학에서 계수형 간결성 문제^[1](응용수학에서도 카디널리티 추정 문제로 알려져 있음)는 반복되는 원소가 있는 데이터 스트림에서 구별되는 원소의 수를 찾는 문제다.이것은 수많은 어플리케이션에서 잘 알려진 문제다.요소는 라우터를 통과하는 패킷의 IP 주소, 웹 사이트의 고유한 방문자, 대형 데이터베이스의 요소, DNA 시퀀스의 모티브 또는 RFID/센서 네트워크의 요소를 나타낼 수 있다.

형식 정의

인스턴스:A stream of elements

x_{1},x_{2},\ldots ,x_{s}

with repetitions, and an integer

m

. Let

n

be the number of distinct elements, namely

n= \left\{{x_{1},x_{2},\ldots ,x_{s

}\right\}

그리고 이 요소들을

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

{

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

,

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

e

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

,

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

…,

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

e

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

{\

\{

e_{1},

e_{

2},\ldots,

e_{

n}\n

}\right

\}}

이 되도록 하십시오

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

목표:

m

m

저장

m

단위만 사용하여

n

n

n

의

\widehat {n}

추정치

{\widehat {n}}

{\widehat {n}}

{n

을(를) 찾으십시오(

m\ll n

서 m

m\ll n

{\

displaystyle

m

\ll

n

m\ll n

카디널리티 추정 문제의 예로는 스트림: $a,$ $b,$ $a,$ $c,$ $d,$ $d,$ $d$ , d ${\displaystyle a,b,a,c,$ $n=|\left\{{a,b,c,d}\right\}|=4$ $,d},$ d $a,b,a,c,d,b,d$ 이 $n=|\left\{{a,b,c,d}\right\}|=4$ 에서는 $n=|\left\{{a,b,c,d}\right\}|=4$ = { $n=|\left\{{a,b,c,d}\right\}|=4$ ,b,c $n=|\left\{{a,b,c,d}\right\}|=4$ }, $n=|\left\{{a,b,c,d}\right\}|=4$ = 4 {\ $displaysty$ n= $\{a,b,c,d}\}$ = $4$

순진한 해결책

그 문제에 대한 순진한 해결책은 다음과 같다.

카운터,  $c$ , to 0,  $c\leftarrow 0$   $c\leftarrow 0$   $c\leftarrow 0$ 을 초기화한다. c ← 0  ${\displaystyle c\leftarrow$  0 $c\leftarrow 0$  삽입과 멤버십을 빠르게 수행할 수 있는 해시 테이블이나 검색 트리 등 효율적인 사전 데이터 구조  $D$ 를 초기화한다.각 요소  $x_{i}$   $x_{i}$   ${\$ 에 대해 구성원 자격 조회가 발행된다 $x_{i}$ If  $x_{i}$  is not a member of  $D$  ( $x_{i}\notin D$ ) Add  $x_{i}$  to  $D$           Increase  $c$  by one,  $c\leftarrow c+1$       Otherwise ( $x_{i}\in D$ ) do nothing.출력  $n=c$ =  $n=c$   $n=c$ .

구별되는 원소의 수가 너무 많지 않은 한, $D$ 는 메인 메모리에 들어맞고 정확한 답을 검색할 수 있다.그러나 이 접근방식은 경계 저장소에 대해 확장되지 않으며, 각 요소 x $x_{i}$ ${\$ 에 대해 수행된 계산을 최소화해야 하는 $x_{i}$ 경우.이 경우, 정해진 수의 저장 단위를 사용하는 여러 스트리밍 알고리즘이 제안되었다.

HyperLog 알고리즘

스트리밍 알고리즘

경계 스토리지 제약 조건을 처리하기 위해 스트리밍 알고리즘은 무작위화를 사용하여 고유한 수의 요소( $n$ ${\displaystyle n$ 에 대한 비정확한 추정을 생성한다 $.$ 최첨단 추정기는 해시함수 $h(e_{j})$ ( $h(e_{j})$ e $){\displaystyle$ $e_{j}$ 모든 요소를 해시함수를 사용하여 저차원 데이터 스케치로 $e_{j}$ 해시한다. $Playstyle h(e_{j$ 서로 다른 기법은 그들이 저장하는 데이터 스케치에 따라 분류될 수 있다.

최소/최대 스케치

최소/최대 스케치는^[2]^[3] 최소/최대 해시 값만 저장한다.알려진 최소/최대 스케치 추정기의 예:Chassaing 등에서는 문제에 대한 최소-분산 불편 추정기인 최대 스케치를 제시한다.연속적인 최대 스케치 추정기는 최대우도 추정기다.실제로 선택한 추정기는 HyperLog 알고리즘이다.^[6]

그러한 추정자 뒤에 있는 직관은 각각의 스케치가 원하는 수량에 대한 정보를 담고 있다는 것이다.For example, when every element $e_{j}$ is associated with a uniform RV, $h(e_{j})\sim U(0,1)$ , the expected minimum value of $h(e_{1}),h(e_{2}),\ldots ,h(e_{n})$ is ${\display$ $스타일$ 1 $/(n+1)}$ . $h(e_{j})$ 는 e $h(e_{j})$ $e_{j}$ ${\displaystyle h(e_{j}})$ 가 e j ${\$ 의 모든 모양에 대해 동일함을 $h(e_{j})$ 보장한다 $e_{j}$ 따라서 중복의 존재는 극단적 순서 통계의 가치에 영향을 미치지 않는다.

최소/최대 스케치 외에 다른 추정 기법이 있다.Flajolet 외 에 의한 계수-간결한 추정에 관한 첫 번째 논문은 비트 패턴 스케치를 설명한다^[7].이 경우 원소는 비트 벡터로 해시되며 스케치는 모든 해시 값의 논리적 OR을 보유한다.이 문제에 대한 최초의 무증상-공간적-시간적-최적 알고리즘은 다니엘 M. 케인, 젤라니 넬슨, 데이비드 P에 의해 주어졌다.우드러프.^[8]

맨 아래-m 스케치

Bottom-m 스케치는 $m$ 인m {\ $displaystyle$ m} 최소값을 유지하는 최소 스케치의 $m\geq 1$ 로서 $m\geq 1$ 여기서 $m\geq 1$ $\$ 1 {\ $displaystyle$ m\geq $1}$ 을(를 $)$ 참조한다 $m\geq 1$ count-distick 추정 알고리즘의 이론적 개요는 Cosma et al.^[2]을 참조하고, 비교 시뮬레이션 결과를 통한 실제 개요는^[10] Metwally를 참조한다.

가중 카운트-간결함 문제

가중치 버전에서 각 요소는 가중치와 연관되며 목표는 가중치의 총 합계를 추정하는 것이다.형식적으로.

인스턴스:A stream of weighted elements

x_{1},x_{2},\ldots ,x_{s}

with repetitions, and an integer

m

. Let

n

be the number of distinct elements, namely

{\displaystyle n= \left\{{x_{1},x_{2},\ldo

ts ,x_{s}}\right

이 요소들을

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

{

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

,

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

e

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

,

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

{\

j

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

w_{j}

으로 w j

e_{j}

{\

displaystystyle w_{j

}

을

}

의 무게로 한다

w_{j}

e_{j}

목표:

m

{\

displaystyle

m

\

widehat

{w}

의

\widehat {w}

w=\sum _{{j=1}}^{{n}}w_{j}

w=\sum _{j=1}^{n}w_{j}

w

w=\sum _{j=1}^{n}w_{j}

=

w=\sum _{j=1}^{n}w_{j}

w=\sum _{j=1}^{n}w_{j}

j

w=\sum _{j=1}^{n}w_{j}

=

w=\sum _{j=1}^{n}w_{j}

w=\sum _{j=1}^{n}w_{j}

{\

w

=\sum _{j=1}^{n1}w_{j}}}}}

의

m

{\widehat {w}}

를 찾으십시오.

m\ll n

서 m

\

{\

displaystystyle m\ll

n

m\ll n

An example of an instance for the weighted problem is: $a(3),b(4),a(3),c(2),d(3),b(4),d(3)$ . For this instance, $e_{1}=a,e_{2}=b,e_{3}=c,e_{4}=d$ , the weights are $w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3$ $w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3$ = $w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3$ , $w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3$ 3 $w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3$ = $w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3$ , $w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3$ $w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3$ = $w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3$ ${\displaystyle w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3$ $\sum {w_{j}}=12$ ∑ $\sum {w_{j}}=12$ = $\sum {w_{j}}=12$ ${\displaystyle \sum$ {w_ ${j}}=12$

애플리케이션 예로서, $x_{1},x_{2},\ldots ,x_{s}$ 1, $x_{1},x_{2},\ldots ,x_{s}$ 2 $x_{1},x_{2},\ldots ,x_{s}$ , $x_{1},x_{2},\ldots ,x_{s}$ $x_{1},x_{2},\ldots ,x_{s}$ ${\$ 는 서버가 수신한 IP 패킷일 수 있다 $x_{1},x_{2},\ldots ,x_{s}$ .각 패킷은 $n$ $n$ IP $n$ 흐름 $e_{1},e_{2},\ldots ,e_{n}$ $e_{1},e_{2},\ldots ,e_{n}$ , $e_{1},e_{2},\ldots ,e_{n}$ e $e_{1},e_{2},\ldots ,e_{n}$ , $e_{1},e_{2},\ldots ,e_{n}$ e $e_{1},e_{2},\ldots ,e_{n}$ ${\$ 중 하나에 속한다 $e_{1},e_{2},\ldots ,e_{n}$ $w_{j}$ j ${\$ 중량은 서버의 $e_{j}$ $e_{j}$ $e_{j}$ j ${\$ 에 의해 부과되는 부하일 수 있다 $w_{j}$ .따라서 $\sum _{j=1}^{n}{w_{j}}$ = $\sum _{j=1}^{n}{w_{j}}$ $\sum _{j=1}^{n}{w_{j}}$ $\sum _{j=1}^{n}{w_{j}}$ ${\$ 는 $x_{1},x_{2},\ldots ,x_{s}$ $x_{1},x_{2},\ldots ,x_{s}$ , x $x_{1},x_{2},\ldots ,x_{s}$ , $x_{1},x_{2},\ldots ,x_{s}$ $x_{1},x_{2},\ldots ,x_{s}$ {\ $displaystyle x_{1},\dots ,x_{s}}}$ 이 $x_{1},x_{2},\ldots ,x_{s}$ 속하는 모든 흐름에 의해 서버에 부과되는 총 부하를 나타낸다 $\sum _{j=1}^{n}{w_{j}}$ .

가중 카운트-간결함 문제 해결

가중치 없는 문제에 대한 모든 극단적 순서 통계 추정기(최소/최대 스케치)는 가중된 문제에 대한 추정기로 일반화할 수 있다.^[11]예를 들어 Cohen 등이 제안한 가중 추정기는 가중 문제를 해결하기 위해 연속적인 최대 스케치 추정기를 확장했을 때 얻을 수 있다.^[5]특히 HyperLog 알고리즘을 확장해 가중 문제를 해결할 수 있다.확장된 HyperLog 알고리즘은 가중 문제에 대해 알려진 다른 모든 알고리즘 중에서 통계 정확도와 메모리 사용 측면에서 최고의 성능을 제공한다.

참고 항목

참조

^ Ullman, Jeff; Rajaraman, Anand; Leskovec, Jure. "Mining data streams" (PDF). {{cite journal}}:Cite 저널은 필요로 한다. journal=(도움말)
^ ^a ^b Cosma, Ioana A.; Clifford, Peter (2011). "A statistical analysis of probabilistic counting algorithms". Scandinavian Journal of Statistics. arXiv:0801.3552.
^ Giroire, Frederic; Fusy, Eric (2007). 2007 Proceedings of the Fourth Workshop on Analytic Algorithmics and Combinatorics (ANALCO). pp. 223–231. CiteSeerX 10.1.1.214.270. doi:10.1137/1.9781611972979.9. ISBN 978-1-61197-297-9.
^ Chassaing, Philippe; Gerin, Lucas (2006). "Efficient estimation of the cardinality of large data sets". Proceedings of the 4th Colloquium on Mathematics and Computer Science. arXiv:math/0701347. Bibcode:2007math......1347C.
^ ^a ^b Cohen, Edith (1997). "Size-estimation framework with applications to transitive closure and reachability". J. Comput. Syst. Sci. 55 (3): 441–453. doi:10.1006/jcss.1997.1534.
^ ^a ^b Flajolet, Philippe; Fusy, Eric; Gandouet, Olivier; Meunier, Frederic (2007). "HyperLoglog: the analysis of a near-optimal cardinality estimation algorithm" (PDF). Analysis of Algorithms.
^ Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications" (PDF). J. Comput. Syst. Sci. 31 (2): 182–209. doi:10.1016/0022-0000(85)90041-8.
^ Kane, Daniel M.; Nelson, Jelani; Woodruff, David P. (2010). "An Optimal Algorithm for the Distinct Elements Problem". Proceedings of the 29th Annual ACM Symposium on Principles of Database Systems (PODS).
^ Cohen, Edith; Kaplan, Haim (2008). "Tighter estimation using bottom k sketches" (PDF). PVLDB.
^ Metwally, Ahmed; Agrawal, Divyakant; Abbadi, Amr El (2008), Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic, Proceedings of the 11th international conference on Extending Database Technology: Advances in Database Technology, CiteSeerX 10.1.1.377.4771
^ Cohen, Reuven; Katzir, Liran; Yehezkel, Aviv (2014). "A Unified Scheme for Generalizing Cardinality Estimators to Sum Aggregation". Information Processing Letters. 115 (2): 336–342. doi:10.1016/j.ipl.2014.10.009.

[1] Ullman, Jeff; Rajaraman, Anand; Leskovec, Jure. "Mining data streams" (PDF). {{cite journal}}:Cite 저널은 필요로 한다. journal=(도움말)

[cosma2011-2] Cosma, Ioana A.; Clifford, Peter (2011). "A statistical analysis of probabilistic counting algorithms". Scandinavian Journal of Statistics. arXiv:0801.3552.

[3] Giroire, Frederic; Fusy, Eric (2007). 2007 Proceedings of the Fourth Workshop on Analytic Algorithmics and Combinatorics (ANALCO). pp. 223–231. CiteSeerX 10.1.1.214.270. doi:10.1137/1.9781611972979.9. ISBN 978-1-61197-297-9.

[4] Chassaing, Philippe; Gerin, Lucas (2006). "Efficient estimation of the cardinality of large data sets". Proceedings of the 4th Colloquium on Mathematics and Computer Science. arXiv:math/0701347. Bibcode:2007math......1347C.

[edithCohen-5] Cohen, Edith (1997). "Size-estimation framework with applications to transitive closure and reachability". J. Comput. Syst. Sci. 55 (3): 441–453. doi:10.1006/jcss.1997.1534.

[hyperloglog-6] Flajolet, Philippe; Fusy, Eric; Gandouet, Olivier; Meunier, Frederic (2007). "HyperLoglog: the analysis of a near-optimal cardinality estimation algorithm" (PDF). Analysis of Algorithms.

[7] Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications" (PDF). J. Comput. Syst. Sci. 31 (2): 182–209. doi:10.1016/0022-0000(85)90041-8.

[optimalf0-8] Kane, Daniel M.; Nelson, Jelani; Woodruff, David P. (2010). "An Optimal Algorithm for the Distinct Elements Problem". Proceedings of the 29th Annual ACM Symposium on Principles of Database Systems (PODS).

[9] Cohen, Edith; Kaplan, Haim (2008). "Tighter estimation using bottom k sketches" (PDF). PVLDB.

[10] Metwally, Ahmed; Agrawal, Divyakant; Abbadi, Amr El (2008), Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic, Proceedings of the 11th international conference on Extending Database Technology: Advances in Database Technology, CiteSeerX 10.1.1.377.4771

[11] Cohen, Reuven; Katzir, Liran; Yehezkel, Aviv (2014). "A Unified Scheme for Generalizing Cardinality Estimators to Sum Aggregation". Information Processing Letters. 115 (2): 336–342. doi:10.1016/j.ipl.2014.10.009.

[1]

[2]

[3]

[6]

[7]

[8]

[10]

[11]

[5]

Search