Counting distinct subsequences

Here's a nice dynamic programming exercise: find a polynomial-time algorithm that takes as input a sequence of \( n \) symbols, and a number \( k, \) and produces as output the number of distinct \( k \)-character subsequences.

For instance, if the input is the string "food" and the number \( k=2, \) the output should be \( 4. \) There are four distinct two-character subsequences of "food": they are "fo", "fd", "oo", and "od".

Comments:

ext_1343285:
2012-08-09T12:36:04Z

Construct a suffix tree, "cut" it at level \( k \) (counting characters, not nodes) and count the leaves.

11011110:
2012-08-09T15:46:04Z

Subsequences, not substrings.

sgkruk:
2012-08-09T18:07:53Z

(defun dsubseq (seq k)
  ;; (dsubseq '(F O O D ) 2)
  ;; 4
  (labels ((f (seq k)
             ;;  (format t "~%(f ~a ~a)" seq k)
             (cond ((= k 0) nil)
                   ((= (length seq) k) (list seq))
                   ((= k 1) (mapcar (lambda (x) (list x)) seq))
                   (t (mapcon (lambda (right) 
                                (mapcar (lambda (subseq)
                                          (cons (car right) subseq))
                                        (f (cdr right) (1- k))))
                              seq))))))
  (length (remove-duplicates (f seq k) :test #'equal)))

11011110:
2012-08-09T18:18:22Z

So what is the running time of this?

sgkruk:
2012-08-09T18:40:47Z

Oops. I wrote too fast. This is \( O(2^n) \) Sorry.

gareth_rees:
2012-08-09T23:05:45Z
Let \( \theta(S,k) \) be the number of distinct k-character subsequences in the string \( S=s_1\dots s_n. \) Clearly \( \theta(S,k)=1 \) if \( n=k \) or \( k=0 \) and \( \theta(S,k)=0 \) if \( n\lt k. \) Otherwise, suppose that \( S \) contains \( m \) distinct characters \( c_1, \dots, c_m \) such that \( c_i \) first appears in \( S \) at position \( p(i). \) Then \[ \theta(S,k) = \sum \theta(s_{p(i)+1}\dots s_n, k-1). \] This can be computed by dynamic programming in time \( O(n^2 k). \) So in your example, \[ \begin{align*} \theta(\text{"food"}, 2) &= \theta(\text{"ood"}, 1) + \theta(\text{"od"}, 1) + \theta(\text{""}, 1)\\ &= (\theta(\text{"od"}, 0) + \theta(\text{""}, 0)) + (\theta(\text{"d"}, 0) + \theta(\text{""}, 0)) + 0\\ &= (1 + 1) + (1 + 1)\\ &= 4. \end{align*} \]

11011110:
2012-08-09T23:27:15Z
Yes. I had a slightly different formulation in which I was counting the number of sets of \( k \) positions, ending at some position \( i, \) that are the lexicographically-first set of positions for the string they form. But I think it amounts to the same thing.

ext_73227:
2012-08-14T20:52:51Z

Cute! Do you know about the problem of finding the most common subsequence? For example in AAABBBCCC, the most common subsequence is ABC, which occurs 27 times.

I found a couple of external links related to the problem that you mentioned, but not this other one: http://www.sciencedirect.com/science/article/pii/S0304397508006245 http://www.spoj.pl/problems/CSUBSEQS/

It would be nice to put this in P or NPC. I'm also wondering: - there might be a tie for the most common subsequence (there are 4 in OOZE that appear twice) but is the longest of the most common ones unique? - is this map unimodal?: \( L \) → maximum frequency of any length-\( L \) subsequence

11011110:
2012-08-14T21:08:37Z

I didn't know about that problem before your comment, but it looks like an interesting one. Maybe worth asking again on http://cstheory.stackexchange.com/? I wouldn't be surprised if it turns out to be NPC, though.

ext_73227:
2012-08-14T23:36:09Z

Thanks for the reply, I have disproved my structural questions and posted the remaining main problem here: http://cstheory.stackexchange.com/questions/12307/commonest-subsequence

11011110:
2012-08-14T23:58:56Z

I voted +1 for your question there. Are the counterexamples for the structural questions easy to describe?

ext_73227:
2012-08-15T03:25:36Z

"12312312" disproves unimodality, since the maximums for length two to four are are 6 "12"s, 4 of several length-3 strings, and 5 of several length-4 strings

"crash after this" disproves uniqueness of the longest most frequent one, since the max frequency is 4, but there are two longest ones 'cra this' and 'crafthis'

I didn't try to distill them much, beyond finding them, though.