차발-산코프 상수

수학에서 Chvatal-Sankoff 상수는 무작위 문자열의 가장 긴 공통 부분들의 길이를 설명하는 수학 상수다.이러한 상수의 존재는 증명되었지만, 정확한 값은 알 수 없다.그들은 1970년대 중반부터 조사를 시작한 바클라프 차바탈과 데이비드 산코프의 이름을 따서 이름 지어졌다.^[1]^[2]

각 양의 정수 k에 대해 $\gamma _{k}$ $\gamma _{k}$ k ${\$ 가 1개씩 있는데, 여기서 k는 임의 문자열이 그려지는 알파벳의 문자 수입니다.이 숫자들의 순서는 k의 제곱근에 반비례적으로 증가한다.^[3]그러나 일부 저자들은 2진수 알파벳에 대해 이런 식으로 정의된 상수인 $\gamma _{2}$ $\gamma _{2}$ ${\$ 를 참조하기 위해 "Chval-Sankoff 상수"를 쓴다.^[4]

배경

두 문자열 S와 T의 공통 하위열은 문자열이 S와 T에서 모두 같은 순서로 나타나는 문자열이다(꼭 연속해서 나타나는 것은 아님).가장 긴 공통부분을 계산하는 문제는 컴퓨터 과학에서 잘 연구되어 왔다.다항식 시간에 동적 프로그래밍으로 해결할 수 있다.^[5] 이 기본 알고리즘에는 작은 알파벳(Method of Four Russian),^[6] 차이가 적은 문자열,^[7] 일치하는 문자 쌍이 거의 없는 문자열 등에 대한 추가 속도 업이 있다.^[8]더 복잡한 형태의 편집 거리에 대한 이 문제와 그것의 일반화는 생물정보학(DNA와 단백질 서열 비교와 진화 나무 재구성에), 지질학(지층학), 컴퓨터 과학(데이터 비교 및 수정 제어)을 포함하는 영역에 중요한 응용을 가진다.^[7]

이미 Chvátal과 Sankoff에 의해 주어진 무작위 문자열의 가장 긴 공통 부분들을 연구하는 한 가지 동기는 무작위가 아닌 문자열에서 가장 긴 공통 부분들의 계산을 교정하는 것이다.그러한 계산이 무작위로 얻을 수 있는 것보다 상당히 긴 부분 집합을 반환하는 경우, 이 결과로부터 경기가 유의미하거나 유의미하다는 것을 유추할 수 있다.^[1]

정의와 존재

Chval-Sankoff 상수는 다음과 같은 무작위 프로세스의 동작을 설명한다.매개변수 n과 k를 지정하면 동일한 k-심볼 알파벳에서 두 개의 길이-n 문자열 S와 T를 선택하고, 각 문자열의 각 문자는 다른 모든 문자와 독립적으로 임의로 균일하게 선택한다.이 두 문자열 중 가장 긴 공통 부분 집합을 계산하고 $\lambda _{n,k}$ , $\lambda _{n,k}$ ${\$ 값을 이 부분 집합의 길이인 임의 변수가 되도록 $\lambda _{n,k}$ 한다.그러면 $\lambda _{n,k}$ , $\lambda _{n,k}$ ${\$ 의 기대값은 n에 비례하는 (최대 저차 항)이며 $\lambda _{n,k}$ , k번째 Chvátal-Sankoff 상수 $\gamma _{k}$ $\gamma _{k}$ ${\$ 는 $\cHB _{k}$ 비례의 상수이다.^[2]

More precisely, the expected value $E[\lambda _{n,k}]$ is superadditive: for all m and n, ${\displaystyle E[\lambda _{m+n,k}]\geq E[\lambda _{m,k}]+$ $E[\lambda _{n,k$ 길이 m + n의 줄을 길이 m과 n의 하위 문자열로 쪼개서 그 하위 문자열 중에서 가장 긴 공통의 하위 문자열이 발견되면 전체 문자열의 공통 하위 문자열을 얻기 위해 함께 연결될 수 있기 때문이다.그 한계는^[9] 마이클 페케트의 보조정리로부터 따온 것이다.

\gamma _{k}=\lim _{n\to \inflt }{\frac {E[\lambda _{n,k}}}}{n}}}}}}

존재하며, $E[\lambda _{n,k}]/n$ $E[\lambda _{n,k}]/n$ [ $E[\lambda _{n,k}]/n$ n $E[\lambda _{n,k}]/n$ , $E[\lambda _{n,k}]/n$ $E[\lambda _{n,k}]/n$ / n ${\displaystyle$ E $[\lambda _{n,k}]/n}$ 의 우월성과 같다 $E[\lambda _{n,k}]/n$ 이러한 제한 값 $\gamma _{k}$ $\gamma _{k}$ ${\$ 는 $\cHB _{k}$ Chval-Sankoff 상수이다.^[2]

경계

차바탈-산코프 상수의 정확한 값은 아직 알려지지 않았지만, 엄격한 상한과 하한은 증명되었다.

왜냐하면 $\gamma _{k}$ $\gamma _{k}$ $\gamma _{k}$ {\ $displaystyle$ \ $gamma$ $_{$ $E[\lambda _{n,k}]$ $}$ 는 $\cHB _{k}$ 각각 유한 확률 분포에만 의존하는 $E[\lambda _{n,k}]$ E $E[\lambda _{n,k}]$ [ $E[\lambda _{n,k}]$ , k $]{\$ $displaystyle$ $E$ $[\$ $lambda$ $_{$ $E[\lambda _{n,k}]$ $,$ $k}}}}}}}$ 의 정확한 $E[\lambda _{n,k}]$ 을 계산하는 한 방법이 될 것이다 $\cHB _{k}$ $E[\lambda _{n,k}]$ $[\displaystyle E[\lambda _{n,k$ 그러나 이 방법은 n 단위로 기하급수적으로 확장되므로 n의 작은 값에 대해서만 실행할 수 있어 하한선이 약하다.Vlado Danchik은 박사 논문에서 결정론적 유한 자동화를 사용하여 두 입력 문자열의 기호를 판독하고 이러한 입력의 공통적인 연속성을 생성하는 대안적 접근법을 개척했다.무작위 입력에 대한 이 자동화의 동작은 마코프 체인으로 분석할 수 있는데, 마코프 체인의 안정 상태는 n의 큰 값에 대한 공통 부속성의 요소를 찾는 속도를 결정한다.이 비율은 반드시 Chval-Sankoff 상수에 대한 하한이다.^[10]Danchik의 방법을 사용함으로써, 상태 공간이 두 입력 문자열에서 가장 최근의 h 문자를 버퍼링하는 자동화와, 이 접근방식의 값비싼 정상 상태 마르코프 체인 분석을 피하기 위한 추가 기법으로, 루에커(2009)는 $\gamma _{2}\geq 0.788071$ $\gamma _{2}\geq 0.788071$ with $\gamma _{2}\geq 0.788071$ 0 $\gamma _{2}\geq 0.788071$ ${\di$ 를 증명하는 n = 15로 컴퓨터 분석을 수행할 수 있었다. $splaystyle \cHB_{2}\geq$ 0 $.788071$ .

비이항 알파벳에도 유사한 방법이 일반화될 수 있다.k의 다양한 값에 대해 이러한 방식으로 얻은 하한은 다음과 같다.^[4]

k	${\$ 에 대한 하한값
2	0.788071
3	0.671697
4	0.599248
5	0.539129
6	0.479452
7	0.44502
8	0.42237
9	0.40321
10	0.38656

Danchik & Paterson(1995)도 자동이식적 방법을 사용하여 Chvátal-Sankoff 상수에 대한 상한을 증명했고, 다시 Lueker(2009)는 컴퓨터 계산을 통해 이러한 결과를 확장했다.그가 획득한 상한은 $\gamma _{2}\leq 0.826280$ $\gamma _{2}\leq 0.826280$ $\gamma _{2}\leq 0.826280$ $\gamma _{2}\leq 0.826280$ ${\displaystyle \gamma _{2}\leq$ 0. $826280}$ 이었다 $\gamma _{2}\leq 0.826280$ 이 결과는 J. Michael Steel이 이 값이 상한 값보다 크기 때문에 $\gamma _{2}=2/(1+{\sqrt {2}})$ 2 = $\gamma _{2}=2/(1+{\sqrt {2}})$ / ( $\gamma _{2}=2/(1+{\sqrt {2}})$ + $\gamma _{2}=2/(1+{\sqrt {2}})$ ) $\gamma _{2}=2/(1+{\sqrt {2}})$ {\ $displaystyle \gamma _$ {2 $}=2/(1+{\sqrt{2}})}$ 라고 추측하는 것을 반증했다.^[11]비강제적 수치 증거는 $\gamma _{2}$ $\gamma _{2}$ ${\$ }}이 약 $0 .811$ $0.811$ 이며 $\gamma _{2}$ $0.811$ 하한보다 상한에 가깝다는 것을 시사한다.^[12]

k가 무한대로 갈 때 한계에서 $\gamma _{k}$ { $\gamma _{k}$ k {\ $displaystyle \gamma _{$ k}는 k의 제곱근에 반비례적으로 성장한다 $\gamma _{k}$ .더 정확히 ^[3]말하자면

\lim_{k\to \infit }\propert_{k}{\sqrt{k}=2.

LCS 길이 분포

또한 이 값에 대한 기대 연구를 일반화하면서 가장 긴 공통 부분들의 값 분포에 대한 연구가 있었다.예를 들어, 길이 n의 임의 문자열 중 가장 긴 공통 문자열의 길이에 대한 표준 편차는 n의 제곱근에 비례하는 것으로 알려져 있다.^[13]

이러한 종류의 분석을 수행하는 데 있어 한 가지 복잡한 점은 서로 다른 위치 쌍의 문자가 서로 일치하는지 여부를 설명하는 랜덤 변수가 서로 독립적이지 않다는 것이다.기호가 서로 동일한지 여부에 의해 제어되지 않고 대신 확률 1/k와 (k - 1)/k가 0인 독립 랜덤 변수에 의해 허용되는 가장 긴 공통 연속성 문제의 수학적으로 단순화.가장 긴 공통 부분 길이의 늑골은 트레이시-위덤 분포에 의해 제어된다.^[14]

참조

^ ^a ^b Chvatal, Václáv; Sankoff, David (1975), "Longest common subsequences of two random sequences", Journal of Applied Probability, 12 (2): 306–315, doi:10.2307/3212444, JSTOR 3212444, MR 0405531.
^ ^a ^b ^c Finch, Steven R. (2003), "5.20.2 Common Subsequences", Mathematical Constants, Encyclopedia of Mathematics and its Applications, Cambridge University Press, pp. 384–385, ISBN 9780521818056.
^ ^a ^b Kiwi, Marcos; Loebl, Martin; Matoušek, Jiří (2005), "Expected length of the longest common subsequence for large alphabets", Advances in Mathematics, 197 (2): 480–498, arXiv:math/0308234, doi:10.1016/j.aim.2004.10.012, MR 2173842.
^ ^a ^b Kiwi, M.; Soto, J. (2009), "On a speculated relation between Chvátal–Sankoff constants of several sequences", Combinatorics, Probability and Computing, 18 (4): 517–532, arXiv:0810.1066, doi:10.1017/S0963548309009900, MR 2507735, S2CID 10882010.
^ Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001), "15.4", Introduction to Algorithms (2nd ed.), MIT Press and McGraw-Hill, pp. 350–355, ISBN 0-262-53196-8.
^ Masek, William J.; Paterson, Michael S. (1980), "A faster algorithm computing string edit distances", Journal of Computer and System Sciences, 20 (1): 18–31, doi:10.1016/0022-0000(80)90002-1, MR 0566639.
^ ^a ^b Sankoff, David; Kruskal, Joseph B. (1983), Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley.
^ Hunt, James W.; Szymanski, Thomas G. (1977), "A fast algorithm for computing longest common subsequences", Communications of the ACM, 20 (5): 350–353, doi:10.1145/359581.359603, MR 0436655, S2CID 3226080.
^ Fekete, M. (1923), "Über die Verteilung der Wurzeln bei gewissen algebraischen Gleichungen mit ganzzahligen Koeffizienten", Mathematische Zeitschrift (in German), 17 (1): 228–249, doi:10.1007/BF01504345, S2CID 186223729.
^ Dančík, Vlado; Paterson, Mike (1995), "Upper bounds for the expected length of a longest common subsequence of two binary sequences", Random Structures & Algorithms, 6 (4): 449–458, doi:10.1002/rsa.3240060408, MR 1368846.
^ Lueker, George S. (2009), "Improved bounds on the average length of longest common subsequences", Journal of the ACM, 56 (3), A17, doi:10.1145/1516512.1516519, MR 2536132, S2CID 7232681.
^ Dixon, John D. (2013), Longest common subsequences in binary sequences, arXiv:1307.2796, Bibcode:2013arXiv1307.2796D.
^ Lember, Jüri; Matzinger, Heinrich (2009), "Standard deviation of the longest common subsequence", The Annals of Probability, 37 (3): 1192–1235, arXiv:0907.5137, doi:10.1214/08-AOP436, MR 2537552, S2CID 8143348.
^ Majumdar, Satya N.; Nechaev, Sergei (2005), "Exact asymptotic results for the Bernoulli matching model of sequence alignment", Physical Review E, 72 (2): 020901, 4, arXiv:q-bio/0410012, Bibcode:2005PhRvE..72b0901M, doi:10.1103/PhysRevE.72.020901, MR 2177365, PMID 16196539, S2CID 11390762.

[cs75-1] Chvatal, Václáv; Sankoff, David (1975), "Longest common subsequences of two random sequences", Journal of Applied Probability, 12 (2): 306–315, doi:10.2307/3212444, JSTOR 3212444, MR 0405531.

[finch-2] Finch, Steven R. (2003), "5.20.2 Common Subsequences", Mathematical Constants, Encyclopedia of Mathematics and its Applications, Cambridge University Press, pp. 384–385, ISBN 9780521818056.

[klm05-3] Kiwi, Marcos; Loebl, Martin; Matoušek, Jiří (2005), "Expected length of the longest common subsequence for large alphabets", Advances in Mathematics, 197 (2): 480–498, arXiv:math/0308234, doi:10.1016/j.aim.2004.10.012, MR 2173842.

[ks09-4] Kiwi, M.; Soto, J. (2009), "On a speculated relation between Chvátal–Sankoff constants of several sequences", Combinatorics, Probability and Computing, 18 (4): 517–532, arXiv:0810.1066, doi:10.1017/S0963548309009900, MR 2507735, S2CID 10882010.

[5] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001), "15.4", Introduction to Algorithms (2nd ed.), MIT Press and McGraw-Hill, pp. 350–355, ISBN 0-262-53196-8.

[6] Masek, William J.; Paterson, Michael S. (1980), "A faster algorithm computing string edit distances", Journal of Computer and System Sciences, 20 (1): 18–31, doi:10.1016/0022-0000(80)90002-1, MR 0566639.

[tw-7] Sankoff, David; Kruskal, Joseph B. (1983), Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley.

[8] Hunt, James W.; Szymanski, Thomas G. (1977), "A fast algorithm for computing longest common subsequences", Communications of the ACM, 20 (5): 350–353, doi:10.1145/359581.359603, MR 0436655, S2CID 3226080.

[9] Fekete, M. (1923), "Über die Verteilung der Wurzeln bei gewissen algebraischen Gleichungen mit ganzzahligen Koeffizienten", Mathematische Zeitschrift (in German), 17 (1): 228–249, doi:10.1007/BF01504345, S2CID 186223729.

[10] Dančík, Vlado; Paterson, Mike (1995), "Upper bounds for the expected length of a longest common subsequence of two binary sequences", Random Structures & Algorithms, 6 (4): 449–458, doi:10.1002/rsa.3240060408, MR 1368846.

[11] Lueker, George S. (2009), "Improved bounds on the average length of longest common subsequences", Journal of the ACM, 56 (3), A17, doi:10.1145/1516512.1516519, MR 2536132, S2CID 7232681.

[12] Dixon, John D. (2013), Longest common subsequences in binary sequences, arXiv:1307.2796, Bibcode:2013arXiv1307.2796D.

[13] Lember, Jüri; Matzinger, Heinrich (2009), "Standard deviation of the longest common subsequence", The Annals of Probability, 37 (3): 1192–1235, arXiv:0907.5137, doi:10.1214/08-AOP436, MR 2537552, S2CID 8143348.

[14] Majumdar, Satya N.; Nechaev, Sergei (2005), "Exact asymptotic results for the Bernoulli matching model of sequence alignment", Physical Review E, 72 (2): 020901, 4, arXiv:q-bio/0410012, Bibcode:2005PhRvE..72b0901M, doi:10.1103/PhysRevE.72.020901, MR 2177365, PMID 16196539, S2CID 11390762.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

Search

차발-산코프 상수

네임스페이스

더

목차

배경

정의와 존재

경계

LCS 길이 분포

참조