ebisu: Half-life does not meaningfully increase after reps in some conditions

Apologies in advance if this is a non-issue. My hunch is this is a failure on my part to understand the methods or documentation.

I’m coming from Anki and Memrise. After each rep, Anki will predict your “half-life” increases significantly, maybe doubles or more. Memrise seems to follow a similar pattern.

When testing how ebisu might behave in similar conditions, it seems that review spacing increases, but only very gradually.

The following short code assumes I’m reviewing a card every time it hits around 75% success. Suppose I succeed every single time. The code then prints the ratio between the prior review period and the next review period.

import ebisu

model = (3, 3, 1)

m2pd = ebisu.modelToPercentileDecay
ur = ebisu.updateRecall

new_model = model
last_test_time = 0

for i in range(30):
    test_time = m2pd(new_model, 0.75)
    new_model = ur(new_model, 1, 1, test_time)
    if i > 2:
        print(round(test_time / last_test_time, 3))
    last_test_time = test_time

Based on ebbinghaus’s work and my own performance in Anki, I’d expect those review periods to more than double every time, but I’m not seeing that. The ratios maybe back off by 110%, usually lower.

I take your point from another comment that you don’t like scheduling reviews, that ebisu’s strength is that it frees you from scheduling.

But this seems like it would still be an issue even with unscheduled reviews. It would predict that very strong memories are in the worst decile much more quickly than it should.

I’m probably just missing a core aspect of the algorithm, so sorry for the confusion. Maybe you manually double t after each review or something, or just use t as a coefficient to some other backoff function, I’m not sure.

Would appreciate a heads up as to where I went wrong, or let me know this behavior is just expected. Maybe the algorithm is purely backwards looking and doesn’t try to take into account a rep’s ability to strengthen a memory.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 17 (8 by maintainers)

Commits related to this issue

Most upvoted comments

I apologize for being literally this guy:

Flash from Zootopia

when it comes to figuring out a solution to this issue (closely related to #43). I experimented with some heavyweight solutions, before stepping back and trying to see the big picture and coming up with https://github.com/fasiha/ebisu-likelihood-analysis/blob/main/demo.py

Before talking about that, here’s a recap of the problem. I’ve given some explanation for how I see the problem that @cyphar saw and raised in #43 there but here’s how I see the fundamental problem.

Suppose we learn a fact at midnight and model our memory with Ebisu model [a, b, t], i.e., recall probability t hours after midnight is Beta(a, b). Then one hour later we do a quiz, and call ebisu.updateRecall to get a new [a2, b2, t2] model. I didn’t realize this all these years until @cyphar patiently broke it down for me on Reddit a few months ago, but the new posterior model still only refers to recall t2 hours after midnight. It doesn’t encode our belief as of now, after an hour has elapsed. Ebisu generates an increasingly accurate estimate of recall after midnight without ever moving to an hour after midnight, a day after, etc., which is why we saw

  • slow growth in halflife if you review when recall probability dips to 80%,
  • Ebisu predicting much less than 5% recall for cards that you were getting correct, and
  • finding that the maximum likelihood estimates for initial halflife for real cards was 10’000 hours.

So we need some way to convert a posterior for quizzes after midnight to a posterior for quizzes after 1 am, and that’s what both @brownbat above and others have asked for.

Forgive me for being so dense to understand this and slow to think of a solution! My incompetence at mathematics is truly gargantuan.

I don’t yet have a great solution. But in https://github.com/fasiha/ebisu-likelihood-analysis/blob/main/demo.py I have a framework to help evaluate possible ways to do this translation from midnight to after midnight of our belief on recall.

I picked a very Anki-like translation: after ebisu.updateRecall, just boost the resulting model’s halflife by a fixed factor. The code does things a bit fancier, see these lines that show how,

  • for failed quizzes, we don’t boost the output of updateRecall, (1.0)
  • for hard quizzes, we boost half-way between 1.0 and the base boost, (1.2 if the base boost is 1.4, i.e., 1.4 - (1.4-1)/2)
  • for normal quizzes, we boost the halflife by the base boost, (1.4 e.g.)
  • for easy quizzes, we boost the halflife by more than the base boost (1.6, i.e., 1.4 + (1.4-1)/2))

Obviously this can be greatly improved. The goal of Ebisu is to not use magic numbers, to use statistical analysis to estimate these numbers, etc. But https://github.com/fasiha/ebisu-likelihood-analysis/blob/main/demo.py includes a bunch of machinery to evaluate this and other proposed ways to update Ebisu posteriors. If you can think of a better way to boost the models after quizzes, we can test it here.

This is by testing the proposed changes on real data and computing probabilistic likelihoods. In a nutshell, what demo.py does is:

  • loads an Anki database (I had my old collection.anki2, a SQLite database),
  • groups together reviews from the same card (it throws out cards with too few or too few correct reviews),
  • for a given initial model ([initialAlphaBeta, initialAlphaBeta, initialHalflife]) and some baseBoost, it sums up the log-probabilities returned by ebisu.predictRecall for each quiz. This is the likelihood of that model (initialAlphaBeta, initialHalflife, and baseBoost).

Then we sweep over different values of these parameters, initialAlphaBeta, initialHalflife, and baseBoost, and we can make plots that look like this:

exmaple-1300038030552

This plot shows, for a range of initial halflives (x axis), and a few different boosts (1.0 to 2.0, shown as different colors), the likelihood for a specific card I had with 27 quizzes (23 of them correct). (I fixed initialAlphaBeta=2 because it doesn’t really matter.) Some notes:

  • The blue curve above corresponds to a boost of 1.0, i.e., the current Ebisu case. The max likelihood for that case we don’t even know: it’s still climbing at 1000 hours initial halflife, which is obviously wrong.
  • But for boosts greater than 1.0, we see the likelihood curves peaking. For likelihood, higher is better, so for boosts 1.4 and 1.7, we reach the same maximum likelihood (it’s actually log likelihood, so -14; the value itself doesn’t matter, since it’s the product-of-probabilities which is equal to the sum-of-log-probabilities).
  • This tells us that mindlessly boosting the halflife after each review considerably improves the accuracy and believability of the algorithm.

https://github.com/fasiha/ebisu-likelihood-analysis/blob/main/demo.py will also generate bigger charts, like this:

exmaples

If you run it as is, demo.py will look for collection.anki2 (which is a file inside the APKG files that Anki generates—APKG is just a zip file, so if you unzip it, you’ll get this collection.anki2 SQLite database plus your images/sounds), load the reviews that correspond to actual undeleted cards, generate a training vs testing set (important for accurately benchmarking competing algorithms), calculate the likelihoods for a bunch of different halflife-boost combinations, and make a few plots.

I’m planning on finding a better method to boost the models after Ebisu’s update, but the way I’ll know that they’re better is that they’ll achieve higher likelihoods on more flashcards than worse methods.

Ideas right now:

  • instead of a static baseBoost applied after all quizzes, the boost needs to be dynamic and time-sensitive. I.e., if you review a mature card five times in five minutes, you shouldn’t be boosting the halflife by 1.4**5 = 5.4.
  • Take into account the most recent reviews, within a time window. If all are successes, boost halflife by some value greater than 1. If all are failures, boost halflife by some value less than 1.
  • For all these “magic numbers” including the max boost, min boost, number of recent reviews to consider, etc., offer an Ebisu function that calculates these values for a card given your quiz history. I am not at all sure what that API would look like but the hope is we can change updateRecall to do a quick coarse local update of the model, and maybe once a day or once a week, you can run a recalibrate function that takes all quizzes for this flashcard (or all flashcards) and updates these magic numbers by finding which numbers maximize likelihood.

I know the script is pretty long, as is this comment, but I wanted to share some detailed thoughts and code about how I’m planning to evaluate proposed algorithms for boosting models after Ebisu’s posterior update, i.e., detailing how to use likelihood to evaluate parameters and algorithms.

I really like how Ebisu right now is an automatic algorithm that just works given a couple of input numbers, and I’d like to find a way to do this boosting that retains the minimal mathematical nature of Ebisu currently, but we shall see!

I put some setup and run instructions at the top of https://github.com/fasiha/ebisu-likelihood-analysis/blob/main/demo.py, if you have time, please check it out!

With the (still in beta) v3 Anki scheduler, you can implement custom scheduling plugins in JavaScript. This would allow you to use custom schedulers even with AnkiDroid and the iOS version of Anki.

I suspect once the work on ebisu is finished, it’d be fairly easy to port the code to the v3 scheduler (and update the existing ebisu add-on). If no-one else is planning to do it, I’d be happy to.

Ah, ok, very helpful!

So the model is:

  1. Assume the forgetting rate is constant,*
  2. Use Bayes to gradually gain confidence in our prediction of that forgetting rate.

And, sure, we all know that in reality, the forgetting rate isn’t really constant at first. But ebisu’s predictions strengthen slowly over time, and the forgetting rate may eventually hit some plateau and become constant eventually, so they converge.

The downside: very inaccurate predictions at first, maybe for the majority of the reps in a card’s life. The workaround: just ignore that completely, and focus on the sorting of cards, don’t worry about the prediction numbers too much. Just do the most at risk cards first, and don’t bother looking at the odds.

Is that right?

That seems perfectly reasonable for organizing reviews and workflows, and might be all most folks really need.

But it seems like you could pretty easily could get much more accurate predictions throughout the entire lifecycle of a card, if you wanted that. Shouldn’t change your workflow much, so maybe just cosmetic. But more accurate predictions seem useful for their own sake, if you could get them cheaply.

To do that… Suppose instead of predicting a fixed term for half-life, you instead predict some coefficient that we multiply the half-life by after each review.

So here’s what that model looks like:

  1. Assume most memories are “well-behaved,”† and decay at a half-life that doubles with each review (i.e., coefficient = 2).
  2. Use Bayes to predict/test/adjust that coefficient.

In the beginning, half-life predictions would be much more accurate. Memories in the beginning (per Ebbinghaus) tend to exponentially increase in duration. In the long run, memories may revert to a fixed interval, like a year, and the coefficient would need to slowly slide back down to 1 too at that point, giving a fixed interval.

The biggest improvement on your workflow would be that very new cards that you perform very well on will drop in priority more quickly, which should give you more useful work per review.

The biggest risk would be that the coefficient wouldn’t revert to 1 quickly enough as the memory matures, leading to very long intervals with no new data to catch that they’re not getting reviewed enough.

I am NOT/NOT recommending any change to ebisu, which is really well implemented and produces very consistent output for lots of its users now.

But… I might try to implement a wrapper for targeting increasing half-lifes for one of my own projects. Happy to let you know how it goes if it sounds intriguing at all.

* well, ok, at least that the half-life period is a consistent interval, even if decay is a curve. † The first assumption here is also very imperfect! But should be accurate in more situations, I think many more.