Chaotic evaluation of Sudoku difficulty

In my papers Nonrepetitive paths and cycles in graphs with application to Sudoku and Solving single-digit Sudoku subproblems I've argued (as have others) for judging the difficulty of Sudoku puzzles by mimicking human reasoning and looking at what sorts of deductions were necessary in the solution. But a recent paper on the arXiv takes a very different approach.

It's titled The Chaos Within Sudoku (arXiv:1208.0370), by Mária Ercsey-Ravasz and Zoltán Toroczkai, and it reports on a system for solving Sudoku puzzles based on transforming a discrete problem (with variables in {-1,1}) into a relaxation in which the variables are allowed to vary continuously on the real number line, setting up a system of differential equations that behaves chaotically except near a valid solution, and then letting the resulting dynamic system run. It's not magic; I assume it will take exponential time on some instances; but they say it eventually will find the solution. More precisely, there is some exponential distribution on the amount of time it will take for this system to switch from chaotic dynamics to convergence to a solution ("exponential" here being unfortunately completely unrelated to the exponential time we'd expect to see for the search for a solution to an NP-complete problem). And what Ercsey-Ravasz and Toroczkai seem to be proposing to do is to use (the logarithm of) the convergence time as a stand-in for the difficulty of a puzzle.

Since seeing this I've corresponded with the authors and shown them a set I've collected of the hardest puzzles my Sudoku software has generated, the ones it can't solve itself without backtracking and therefore rates as "impossible". On the scale of Ercsey-Ravasz and Toroczkai, these puzzles are only moderately difficult, with numerical scores mostly ranging between 2.0 and 2.5; in contrast, the hardest puzzles they collected from other sources have scores in the range from 3.0 to 3.6. I'm left wondering why this is. Some possibilities that come to mind: Maybe it's because we're measuring different things: I'm not looking for puzzles that score highly on their scale, so I'm not finding them. Maybe my program is too weak, and sets the threshold of impossibility too low; if I had a solver with even stronger deduction rules, and if I then collected problems that were impossible for it, I'd get higher scores. Or maybe the randomized search method my program uses to generate puzzles is not focused enough, and a more clever puzzle generation algorithm would yield more consistently difficult puzzles.