KMP算法详解＋模板题 poj 3461 Oulipo

KMP

KMP算法是一种改进的字符串匹配算法，由D.E.Knuth，J.H.Morris和V.R.Pratt同时发现，因此人们称它为克努特——莫里斯——普拉特操作（简称KMP算法）。KMP算法的关键是利用匹配失败后的信息，尽量减少模式串与主串的匹配次数以达到快速匹配的目的。具体实现就是实现一个next()函数，函数本身包含了模式串的局部匹配信息。时间复杂度O(m+n)。（摘自百度百科）
KMP是一种很神奇的算法，我们可以再很短的时间内求两个串的匹配。

1.比如说这样两个串
a: abcdabcdeababead
b: abcdead
我们要匹配a和b两个串。那么我们就逐位比较。
2.当我们发现有不能匹配时，暴力的思想就是将位置往后移一位，重新匹配，但这样的复杂度奇高，而KMP算法的想法是，设法利用这个已知信息，不要把”搜索位置”移回已经比较过的位置，继续把它向后移，这样就提高了效率。
如：abababcdeababead
ababead
此时b串的第五位e和a串中的第五位‘a’并不相同，而我们发现b串中abab的2-前缀和2-后缀相同所以我们可以直接从后缀的ab这里开始匹配

一直这样，求到正好匹配成功
所以怎么求对要匹配的串的失配值呢？？
我们定义一个字符串 fail 值定义为最大的 K 值 (0 ≤ K ≤ N − 1) 满足该字符串 K-前缀等于 K-后缀
fail 数组表示一个字符串所有前缀的 fail 值组成的数组
如：
abcabc 0,0,0,1,2,3
aaaaaa 0,1,2,3,4,5
假设已经求出 fail[1..t-1]，现在要求 fail[t]
现在考虑 t = 5 的情况，设 fail[t-1]=p
ababa
由于 fail[4]=2
ababa
ababa
绿色部分同时向后一位得
ababa
ababa
求出了 fail[5]

void init(int m)//s2表示需要匹配的串，m表示s2的长度
{
	fail[1] = 0;
	for (int i = 2; i <= m; i++)
	{
		int p = fail[i - 1];
		while (p && s2[i] != s2[p + 1])
			p = fail[p];
		if (s2[i] == s2[p+1])
			p++;
		fail[i] = p;
	}
}

KMP：

void _kmp(int n, int m)
{
	int p = 0;
	for (int i = 1; i <= n; i++)
	{
		while (p && s1[i] != s2[p + 1])
			p = fail[p];
		if (s1[i] == s2[p + 1])
			p++;
		if (p == m)
		{
			printf("%d\n", i - p + 1);
			p = fail[p];
		}
	}
}

KMP模板：

#include<iostream>
#include<cstdio>
#include<cstring>
#include<algorithm>
using namespace std;
int fail[233333];
char s2[2333333];
char s1[2333333];
void init(int m)
{
	fail[1] = 0;
	for (int i = 2; i <= m; i++)
	{
		int p = fail[i - 1];
		while (p && s2[i] != s2[p + 1])
			p = fail[p];
		if (s2[i] == s2[p+1])
			p++;
		fail[i] = p;
	}
}
void _kmp(int n, int m)
{
	int p = 0;
	for (int i = 1; i <= n; i++)
	{
		while (p && s1[i] != s2[p + 1])
			p = fail[p];
		if (s1[i] == s2[p + 1])
			p++;
		if (p == m)
		{
			printf("%d\n", i - p + 1);
			p = fail[p];
		}
	}
}
int main()
{
	scanf("%s", s1+1);
	scanf("%s", s2+1);
	int n = strlen(s1+1);
	int m = strlen(s2+1);
	init(m);
	_kmp(n, m);
	return 0;
}

接下来看一道模板题： POJ 3461

Oulipo

Time Limit: 1000MS		Memory Limit: 65536K
Total Submissions: 37059		Accepted: 14964

Description

The French author Georges Perec (1936–1982) once wrote a book, La disparition, without the letter 'e'. He was a member of the Oulipo group. A quote from the book:

Tout avait Pair normal, mais tout s’affirmait faux. Tout avait Fair normal, d’abord, puis surgissait l’inhumain, l’affolant. Il aurait voulu savoir où s’articulait l’association qui l’unissait au roman : stir son tapis, assaillant à tout instant son imagination, l’intuition d’un tabou, la vision d’un mal obscur, d’un quoi vacant, d’un non-dit : la vision, l’avision d’un oubli commandant tout, où s’abolissait la raison : tout avait l’air normal mais…

Perec would probably have scored high (or rather, low) in the following contest. People are asked to write a perhaps even meaningful text on some subject with as few occurrences of a given “word” as possible. Our task is to provide the jury with a program that counts these occurrences, in order to obtain a ranking of the competitors. These competitors often write very long texts with nonsense meaning; a sequence of 500,000 consecutive 'T's is not unusual. And they never use spaces.
So we want to quickly find out how often a word, i.e., a given string, occurs in a text. More formally: given the alphabet {'A', 'B', 'C', …, 'Z'} and two finite strings over that alphabet, a word W and a text T, count the number of occurrences of W in T. All the consecutive characters of W must exactly match consecutive characters of T. Occurrences may overlap.

Input

The first line of the input file contains a single number: the number of test cases to follow. Each test case has the following format:

One line with the word W, a string over {'A', 'B', 'C', …, 'Z'}, with 1 ≤ |W| ≤ 10,000 (here |W| denotes the length of the string W).
One line with the text T, a string over {'A', 'B', 'C', …, 'Z'}, with |W| ≤ |T| ≤ 1,000,000.

Output

For every test case in the input file, the output should contain a single number, on a single line: the number of occurrences of the word W in the text T.

Sample Input

3
BAPC
BAPC
AZA
AZAZAZA
VERDI
AVERDXIVYERDIAN

Sample Output

1
3
0

题解：

这道题的题意是给你两个串，问第一个串在第二个串中出现的次数
很明显，这是一道KMP的裸题。
我们只需要在每匹配成功一次，ans++即可。

#include<iostream>
#include<cstdio>
#include<cstring>
#include<algorithm>
using namespace std;
int ans = 0;
int fail[233333];
char s2[2333333];
char s1[2333333];
void init(int m)
{
	fail[1] = 0;
	for (int i = 2; i <= m; i++)
	{
		int p = fail[i - 1];
		while (p && s2[i] != s2[p + 1])
			p = fail[p];
		if (s2[i] == s2[p+1])
			p++;
		fail[i] = p;
	}
}
void _kmp(int n, int m)
{
	int p = 0;
	for (int i = 1; i <= n; i++)
	{
		while (p && s1[i] != s2[p + 1])
			p = fail[p];
		if (s1[i] == s2[p + 1])
			p++;
		if (p == m)
		{
			ans++;
			p = fail[p];
		}
	}
}
int main()
{
	int t;
	cin >> t;
	while (t--)
	{
		ans = 0;
		memset(fail, 0, sizeof(fail));
		scanf("%s", s2+1);
		scanf("%s", s1+1);
		int m = strlen(s2+1);
		int n = strlen(s1+1);
		init(m);
		_kmp(n, m);
		cout << ans << endl;
	}
	return 0;
}