KMP算法详解 + 模板题 poj 3461 Oulipo

KMP

KMP算法是一种改进的字符串匹配算法,由D.E.Knuth,J.H.Morris和V.R.Pratt同时发现,因此人们称它为 克努特——莫里斯——普拉特 操作(简称KMP算法)。KMP算法的关键是利用匹配失败后的信息,尽量减少模式串与主串的匹配次数以达到快速匹配的目的。具体实现就是实现一个next()函数,函数本身包含了模式串的局部匹配信息。时间复杂度O(m+n)。(摘自百度百科)
KMP是一种很神奇的算法,我们可以再很短的时间内求两个串的匹配。

1.比如说这样两个串
a:    abcdabcdeababead
b:    abcdead
我们要匹配a和b两个串。那么我们就逐位比较。
2.当我们发现有不能匹配时,暴力的思想就是将位置往后移一位,重新匹配,但这样的复杂度奇高,而KMP算法的想法是,设法利用这个已知信息,不要把”搜索位置”移回已经比较过的位置,继续把它向后移,这样就提高了效率。
如:abababcdeababead
ababead
此时b串的第五位e和a串中的第五位‘a’并不相同,而我们发现b串中abab的2-前缀和2-后缀相同所以我们可以直接从后缀的ab这里开始匹配

kmp

一直这样,求到正好匹配成功
所以怎么求对要匹配的串的失配值呢??
我们定义一个字符串 fail 值定义为最大的 K 值 (0 ≤ K ≤ N − 1) 满足该字符串 K-前缀等于 K-后缀
fail 数组表示一个字符串所有前缀的 fail 值组成的数组
如:
abcabc 0,0,0,1,2,3
aaaaaa 0,1,2,3,4,5
假设已经求出 fail[1..t-1],现在要求 fail[t]
现在考虑 t = 5 的情况,设 fail[t-1]=p
ababa
由于 fail[4]=2
ababa
ababa
绿色部分同时向后一位得
ababa
ababa
求出了 fail[5]

void init(int m)//s2表示需要匹配的串,m表示s2的长度
{
	fail[1] = 0;
	for (int i = 2; i <= m; i++)
	{
		int p = fail[i - 1];
		while (p && s2[i] != s2[p + 1])
			p = fail[p];
		if (s2[i] == s2[p+1])
			p++;
		fail[i] = p;
	}
}

KMP:

void _kmp(int n, int m)
{
	int p = 0;
	for (int i = 1; i <= n; i++)
	{
		while (p && s1[i] != s2[p + 1])
			p = fail[p];
		if (s1[i] == s2[p + 1])
			p++;
		if (p == m)
		{
			printf("%d\n", i - p + 1);
			p = fail[p];
		}
	}
}

KMP模板:

#include<iostream>
#include<cstdio>
#include<cstring>
#include<algorithm>
using namespace std;
int fail[233333];
char s2[2333333];
char s1[2333333];
void init(int m)
{
	fail[1] = 0;
	for (int i = 2; i <= m; i++)
	{
		int p = fail[i - 1];
		while (p && s2[i] != s2[p + 1])
			p = fail[p];
		if (s2[i] == s2[p+1])
			p++;
		fail[i] = p;
	}
}
void _kmp(int n, int m)
{
	int p = 0;
	for (int i = 1; i <= n; i++)
	{
		while (p && s1[i] != s2[p + 1])
			p = fail[p];
		if (s1[i] == s2[p + 1])
			p++;
		if (p == m)
		{
			printf("%d\n", i - p + 1);
			p = fail[p];
		}
	}
}
int main()
{
	scanf("%s", s1+1);
	scanf("%s", s2+1);
	int n = strlen(s1+1);
	int m = strlen(s2+1);
	init(m);
	_kmp(n, m);
	return 0;
}

 
接下来看一道模板题: POJ 3461

Oulipo

Time Limit: 1000MS Memory Limit: 65536K
Total Submissions: 37059 Accepted: 14964

Description

The French author Georges Perec (1936–1982) once wrote a book, La disparition, without the letter 'e'. He was a member of the Oulipo group. A quote from the book:

Tout avait Pair normal, mais tout s’affirmait faux. Tout avait Fair normal, d’abord, puis surgissait l’inhumain, l’affolant. Il aurait voulu savoir où s’articulait l’association qui l’unissait au roman : stir son tapis, assaillant à tout instant son imagination, l’intuition d’un tabou, la vision d’un mal obscur, d’un quoi vacant, d’un non-dit : la vision, l’avision d’un oubli commandant tout, où s’abolissait la raison : tout avait l’air normal mais…

Perec would probably have scored high (or rather, low) in the following contest. People are asked to write a perhaps even meaningful text on some subject with as few occurrences of a given “word” as possible. Our task is to provide the jury with a program that counts these occurrences, in order to obtain a ranking of the competitors. These competitors often write very long texts with nonsense meaning; a sequence of 500,000 consecutive 'T's is not unusual. And they never use spaces.
So we want to quickly find out how often a word, i.e., a given string, occurs in a text. More formally: given the alphabet {'A', 'B', 'C', …, 'Z'} and two finite strings over that alphabet, a word W and a text T, count the number of occurrences of W in T. All the consecutive characters of W must exactly match consecutive characters of T. Occurrences may overlap.

Input

The first line of the input file contains a single number: the number of test cases to follow. Each test case has the following format:

  • One line with the word W, a string over {'A', 'B', 'C', …, 'Z'}, with 1 ≤ |W| ≤ 10,000 (here |W| denotes the length of the string W).
  • One line with the text T, a string over {'A', 'B', 'C', …, 'Z'}, with |W| ≤ |T| ≤ 1,000,000.

Output

For every test case in the input file, the output should contain a single number, on a single line: the number of occurrences of the word W in the text T.

Sample Input

3
BAPC
BAPC
AZA
AZAZAZA
VERDI
AVERDXIVYERDIAN

Sample Output

1
3
0

题解:

这道题的题意是给你两个串,问第一个串在第二个串中出现的次数
很明显,这是一道KMP的裸题。
我们只需要在每匹配成功一次,ans++即可。

#include<iostream>
#include<cstdio>
#include<cstring>
#include<algorithm>
using namespace std;
int ans = 0;
int fail[233333];
char s2[2333333];
char s1[2333333];
void init(int m)
{
	fail[1] = 0;
	for (int i = 2; i <= m; i++)
	{
		int p = fail[i - 1];
		while (p && s2[i] != s2[p + 1])
			p = fail[p];
		if (s2[i] == s2[p+1])
			p++;
		fail[i] = p;
	}
}
void _kmp(int n, int m)
{
	int p = 0;
	for (int i = 1; i <= n; i++)
	{
		while (p && s1[i] != s2[p + 1])
			p = fail[p];
		if (s1[i] == s2[p + 1])
			p++;
		if (p == m)
		{
			ans++;
			p = fail[p];
		}
	}
}
int main()
{
	int t;
	cin >> t;
	while (t--)
	{
		ans = 0;
		memset(fail, 0, sizeof(fail));
		scanf("%s", s2+1);
		scanf("%s", s1+1);
		int m = strlen(s2+1);
		int n = strlen(s1+1);
		init(m);
		_kmp(n, m);
		cout << ans << endl;
	}
	return 0;
}

 

上一篇
下一篇