Here's a nice dynamic programming exercise: find a polynomial-time algorithm that takes as input a sequence of $$n$$ symbols, and a number $$k,$$ and produces as output the number of distinct $$k$$-character subsequences.

For instance, if the input is the string "food" and the number $$k=2,$$ the output should be $$4.$$ There are four distinct two-character subsequences of "food": they are "fo", "fd", "oo", and "od".

ext_1343285:
2012-08-09T12:36:04Z

Construct a suffix tree, "cut" it at level $$k$$ (counting characters, not nodes) and count the leaves.

11011110:
2012-08-09T15:46:04Z

Subsequences, not substrings.

sgkruk:
2012-08-09T18:07:53Z
11011110:
2012-08-09T18:18:22Z

So what is the running time of this?

sgkruk:
2012-08-09T18:40:47Z

Oops. I wrote too fast. This is $$O(2^n)$$ Sorry.

gareth_rees:
2012-08-09T23:05:45Z
Let $$\theta(S,k)$$ be the number of distinct k-character subsequences in the string $$S=s_1\dots s_n.$$ Clearly $$\theta(S,k)=1$$ if $$n=k$$ or $$k=0$$ and $$\theta(S,k)=0$$ if $$n\lt k.$$ Otherwise, suppose that $$S$$ contains $$m$$ distinct characters $$c_1, \dots, c_m$$ such that $$c_i$$ first appears in $$S$$ at position $$p(i).$$ Then $\theta(S,k) = \sum \theta(s_{p(i)+1}\dots s_n, k-1).$ This can be computed by dynamic programming in time $$O(n^2 k).$$ So in your example, \begin{align*} \theta(\text{"food"}, 2) &= \theta(\text{"ood"}, 1) + \theta(\text{"od"}, 1) + \theta(\text{""}, 1)\\ &= (\theta(\text{"od"}, 0) + \theta(\text{""}, 0)) + (\theta(\text{"d"}, 0) + \theta(\text{""}, 0)) + 0\\ &= (1 + 1) + (1 + 1)\\ &= 4. \end{align*}
11011110:
2012-08-09T23:27:15Z
Yes. I had a slightly different formulation in which I was counting the number of sets of $$k$$ positions, ending at some position $$i,$$ that are the lexicographically-first set of positions for the string they form. But I think it amounts to the same thing.
ext_73227:
2012-08-14T20:52:51Z

Cute! Do you know about the problem of finding the most common subsequence? For example in AAABBBCCC, the most common subsequence is ABC, which occurs 27 times.

I found a couple of external links related to the problem that you mentioned, but not this other one: http://www.sciencedirect.com/science/article/pii/S0304397508006245 http://www.spoj.pl/problems/CSUBSEQS/

It would be nice to put this in P or NPC. I'm also wondering: - there might be a tie for the most common subsequence (there are 4 in OOZE that appear twice) but is the longest of the most common ones unique? - is this map unimodal?: $$L$$ → maximum frequency of any length-$$L$$ subsequence

11011110:
2012-08-14T21:08:37Z

I didn't know about that problem before your comment, but it looks like an interesting one. Maybe worth asking again on http://cstheory.stackexchange.com/? I wouldn't be surprised if it turns out to be NPC, though.

ext_73227:
2012-08-14T23:36:09Z

Thanks for the reply, I have disproved my structural questions and posted the remaining main problem here: http://cstheory.stackexchange.com/questions/12307/commonest-subsequence

11011110:
2012-08-14T23:58:56Z

I voted +1 for your question there. Are the counterexamples for the structural questions easy to describe?

ext_73227:
2012-08-15T03:25:36Z

"12312312" disproves unimodality, since the maximums for length two to four are are 6 "12"s, 4 of several length-3 strings, and 5 of several length-4 strings

"crash after this" disproves uniqueness of the longest most frequent one, since the max frequency is 4, but there are two longest ones 'cra this' and 'crafthis'

I didn't try to distill them much, beyond finding them, though.