# Counting distinct subsequences

Here's a nice dynamic programming exercise: find a polynomial-time algorithm that takes as input a sequence of \( n \) symbols, and a number \( k, \) and produces as output the number of distinct \( k \)-character subsequences.

For instance, if the input is the string "food" and the number \( k=2, \) the output should be \( 4. \) There are four distinct two-character subsequences of "food": they are "fo", "fd", "oo", and "od".

### Comments:

**ext_1343285**:

**2012-08-09T12:36:04Z**

Construct a suffix tree, "cut" it at level \( k \) (counting characters, not nodes) and count the leaves.

**11011110**:

**2012-08-09T15:46:04Z**

Subsequences, not substrings.

**sgkruk**:

**2012-08-09T18:07:53Z**

**11011110**:

**2012-08-09T18:18:22Z**

So what is the running time of this?

**sgkruk**:

**2012-08-09T18:40:47Z**

Oops. I wrote too fast. This is \( O(2^n) \) Sorry.

**gareth_rees**:

**2012-08-09T23:05:45Z**

Let \( \theta(S,k) \) be the number of distinct

*k*-character subsequences in the string \( S=s_1\dots s_n. \) Clearly \( \theta(S,k)=1 \) if \( n=k \) or \( k=0 \) and \( \theta(S,k)=0 \) if \( n\lt k. \) Otherwise, suppose that \( S \) contains \( m \) distinct characters \( c_1, \dots, c_m \) such that \( c_i \) first appears in \( S \) at position \( p(i). \) Then \[ \theta(S,k) = \sum \theta(s_{p(i)+1}\dots s_n, k-1). \] This can be computed by dynamic programming in time \( O(n^2 k). \) So in your example, \[ \begin{align*} \theta(\text{"food"}, 2) &= \theta(\text{"ood"}, 1) + \theta(\text{"od"}, 1) + \theta(\text{""}, 1)\\ &= (\theta(\text{"od"}, 0) + \theta(\text{""}, 0)) + (\theta(\text{"d"}, 0) + \theta(\text{""}, 0)) + 0\\ &= (1 + 1) + (1 + 1)\\ &= 4. \end{align*} \]

**11011110**:

**2012-08-09T23:27:15Z**

Yes. I had a slightly different formulation in which I was counting the number of sets of \( k \) positions, ending at some position \( i, \) that are the lexicographically-first set of positions for the string they form. But I think it amounts to the same thing.

**ext_73227**:

**2012-08-14T20:52:51Z**

Cute! Do you know about the problem of finding the most common subsequence? For example in AAABBBCCC, the most common subsequence is ABC, which occurs 27 times.

I found a couple of external links related to the problem that you mentioned, but not this other one: http://www.sciencedirect.com/science/article/pii/S0304397508006245 http://www.spoj.pl/problems/CSUBSEQS/

It would be nice to put this in P or NPC. I'm also wondering: - there might be a tie for the most common subsequence (there are 4 in OOZE that appear twice) but is the longest of the most common ones unique? - is this map unimodal?: \( L \) → maximum frequency of any length-\( L \) subsequence

**11011110**:

**2012-08-14T21:08:37Z**

I didn't know about that problem before your comment, but it looks like an interesting one. Maybe worth asking again on http://cstheory.stackexchange.com/? I wouldn't be surprised if it turns out to be NPC, though.

**ext_73227**:

**2012-08-14T23:36:09Z**

Thanks for the reply, I have disproved my structural questions and posted the remaining main problem here: http://cstheory.stackexchange.com/questions/12307/commonest-subsequence

**11011110**:

**2012-08-14T23:58:56Z**

I voted +1 for your question there. Are the counterexamples for the structural questions easy to describe?

**ext_73227**:

**2012-08-15T03:25:36Z**

"12312312" disproves unimodality, since the maximums for length two to four are are 6 "12"s, 4 of several length-3 strings, and 5 of several length-4 strings

"crash after this" disproves uniqueness of the longest most frequent one, since the max frequency is 4, but there are two longest ones 'cra this' and 'crafthis'

I didn't try to distill them much, beyond finding them, though.