npc_gzip: Problem with accuracy calculation?

https://github.com/bazingagin/npc_gzip/blob/a46991564161023bba3b1267e0e74c69dab8f8eb/experiments.py#L116

It appears that in the calc_acc method it marks a sample correct if ANY of the labels with the same most_count have the correct label.

For k=2 (I think the paper always used k=2?), any time there is a 1-1 tie (top 2 have different labels), it will be correct if EITHER is correct, rather than a “fair” classifier that would have to make a decision before scoring.

This fork has my implementation of the kNN accuracy results for comparisons. see file docstring for notes: https://github.com/kts/npc_gzip/blob/main/calc_acc.py

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 87
  • Comments: 25 (8 by maintainers)

Most upvoted comments

Thanks for the reply. Sorry if this added extra stress to your thesis day!

Now, I understand that this is the desired behavior rather than a bug.

Why do I call it top-2 accuracy? In k=2 case, you find the two closest training points to your test point and look at their labels. Let’s call the labels C for correct and W for wrong, depending on if it equals the test point’s label or not. There are 4 possibilities,

labels   voting       output
---------------------------------
WW       W wins 2-0   W (the two Ws could be different, but the result is the same)
WC       1-1 tie      C
CW       1-1 tie      C
CC       C wins 2-0   C

So, it’s equivalent to being correct if the correct label is one or both of the top 2. (I did try to mention in the blog that it’s not computing top-k for k>2, because it’s only looking at those TIED for highest count, not all k).

The problem is with the 1-1 ties. You need some method to break the tie, and in a “fair” classifier you cannot look at the test label. I think if you use k=1, the code is fine (never have ties), but the results will be necessarily lower (and I believe k=2 was used for the results in the paper).

You’re right, I too have never seen “top-k accuracy” used in the context of kNN, so maybe the wording of “top-2” isn’t very helpful and I shouldn’t use that. It just seemed a concise way to describe the phenomenon described above. (Yes, it’s a bit different here because we’re talking about k points instead of k labels).

I’m not sure I followed everything you wrote, but maybe you agree with this and the main issue is just making that more clear in the paper. That’s possible, but I’d say the main point is: this method uses the test label as part of its decision process which is not the standard classification setting and can’t be fairly compared to others that don’t.

Okay, you’re free to use the wording that you see fit and think readers will understand. I didn’t write the blog post or this issue to make you change the paper, etc, but for others trying to recreate the results.

But I do maintain that using this approach (either looking at the test label or running multiple times with randomness and taking the best) makes the comparisons with other baselines unfair - i.e. it’s wrong to say “this beats BERT in these tests” - unless you get those results with a classifier that doesn’t look at the answer.

A probably bigger problem with at least some of the datasets someone pointed out here https://github.com/bazingagin/npc_gzip/issues/13

Anyways, it’s still an interesting approach and paper. Cheers.

@bazingagin I was really excited about this paper, but I carefully checked the code and here I have to agree with @kts that the test labels are indeed used in the decision making process. The “maximum accuracy” you referred to is essentially computed with two steps for the WC (wrong-correct) tie cases: (1) choose the C label as the prediction using the ground-truth label (there is no other guaranteed way to choose a C label over a W label); (2) calculate the accuracy between the predictions and the ground truth labels.

So in a fair setting, there should never be a “maximum accuracy” (it is unfair itself unless other methods are also reporting the exactly same maximum accuracy); there should be one and only one accuracy between the predictions (each data sample should only have one final prediction) and the test labels. The process of obtaining the predictions should be independent of the test labels, for example, via majority voting or random selection.

Hope this helps. I think it is still a good paper after the fix and result update.

It seems to be an original intention of the authors (link): twit image

The question is: do authors report top-2 accuracy for all other methods? I found no code for them in this repository, but I may have missed something. And it seems there is nothing about it in the paper.

Because if you report top-1 accuracy for one set of methods and top-2 accuracy for another set of methods, it is not fair at all.

However, the paper is still very insightful and well-written, even if the numbers are incorrect.

(Edited): You can also compare the results of other methods on popular datasets with other papers. Here is AGNews, for instance. Those results are similar to the paper’s results and report top-1 accuracy, not top-2.

Okay, last comment on this 😃

Why report the max rather than the mean?

A “fair” classifier can certainly use randomness and you may want to run it multiple times to get a better estimate of it’s accuracy - but, for that use the mean over the trials.

I can define a classifier that ignores the data and randomly picks a class. It’s has a max possible accuracy of 100%.

@twjang Yes, but I guess an issue is that gzip will make more ties since the distance function is coarse grained integer (afaik), while for sentence embedding it is fine grained float.

Thank you @kts for opening the issue & Sorry for the late reply! re: your issue and blog: https://kenschutte.com/gzip-knn-paper/

When the flag rand=False, the logic of the code is to calculate the maximum accuracy when tied so it does mean an ideal situation when the guess is always right. This means:

  1. It’s not a bug.
  2. The logic of the code doesn’t report top k accuracy.
  3. This strategy is used for all non-parametric methods like W2V and SentenceBERT.

I want to elaborate the point 2. Let’s first assume “top k accuracy”[1] here means “the percentage of nearest k labels that contain ground truth”. Point 2 is pretty clear when k=3: Say we have a y1=“sports”, y2=“sports”, y3=“tech”, for the code snippet we will have predicted class as “sports” and we won’t score if the ground truth is “tech”. But for the top3 accuracy, we will score even if the ground truth is “tech”. So clearly, top3 accuracy will have higher score compared with our code snippet. This reasoning can be easily extrapolated to k>3 cases. I hope at this step, everyone is at least convinced that logically the code snippet doesn’t report top k accuracy. So what’s the difference between k=3 and k=2? And why you think the code is reporting top2 accuracy? It’s because numerically, the score of “top2 accuracy” and “k=2 with maximum strategy when tied” are the same under the definition of “top k accuracy”[1]. However, numerical equivalence doesn’t mean logical equivalence. Besides, I don’t think the definition of [1] makes sense for knn and my reason is as follows. In general, “top2 accuracy”[2] means in the top2 predicted classes, either one is correct, we score. However, in knn when k=2, there is a situation when the 2 nearest neighbor point to the same class - we are missing another class candidate if we want to report top2 accuracy. So for any discriminative models, say BERT, there is no situation when top2 predicted classes can be the same. For tasks like sentiment analysis with only “positive” and “negative” labels, this means any discriminative model’s top2 accuracy will be 100%. But for knn, top2 accuracy will mostly be less than that. We are implicitly using different definitions of top k accuracy for knn and discriminative models. To the best of my knowledge, I’m actually not aware of the definition of [1] or other specific definition of top k accuracy defined for knn. The reason I chose k=2 in the first place is that it’s recommended to use $\sqrt{n}$ and I wanted to pick a k that’s compatible with 5-shot setting. That being said, I should have reported different accuracy under different values of k and different “tie strategies”. I think k=2 with maximum is the upperbound accuracy we can get but this doesn’t make our result invalid or served as an unfair comparison with other methods because of the reason I mentioned above. Please let me know your thoughts!

Also, I would love to add “decrement” strategy to the repo and let people choose the strategy freely if you’d love to.

I overall can confirm the updated numbers that @bazingagin posted. For 5-shot only roughly, but that is due to the large deviation over runs (as is also evident from the large 95-confidence intervals in the table).

I do think gzip is quite interesting and its performance in few shot shows that advanced methods struggle in low resource scenarios. However, I think it should be also noted even simpler methods can also provide good results, for anyone interested: https://arxiv.org/abs/2307.15002

hey @maxmax1992, sure I do agree it makes the accuracy of gzip inflated. So please check the updated version of the table updated_gzip

The table also includes the updated result for the correct (no overlapped) DengueFilipino dataset. So only considering the random strategy, we can see

  1. In the few-shot settings, gzip(rand) outperforms all methods on Kinyarwanda, Kirundi, Filipino, and SogouNews but is worse than mBERT on Swahili, on which mBERT has been retained.
  2. In the full dataset setting, gzip(rand) is only highest on SogouNews, is competitive on Kinyarwanda and Swahili, and is bad on Kirundi and updated DengueFilipino

Thanks @bazingagin !

Btw, actually you don’t need to run random, the expected accurcy (using random guesser to break ties) can be simply calculated: For K=2, as @cyrilou242 mention, it boils down to the average of upper and lower bound. For arbitrary K the accuarcy will depend on the amount of options in each tie, but it can also be calculated.

@twjang Yes, this is reasonable strategy. For k=2 (used in paper), the only ties are 1-1, so that strategy is equivalent to k=1 (just take nearest neighbor).

@flipz357 The actual distances used are floats (NCD) https://github.com/bazingagin/npc_gzip/blob/3c219944acfbe4e80d83166eea698ff9da3321af/utils.py#L6 but, it’s derived from 3 integers (compressed string lengths), so you’re right, maybe it’s more susceptible to ties, depending on the dataset.

Hey, thanks for the nice paper. My understanding is @bazingagin has an information theory standpoint, for which measuring the amount of information returned by the 2-NN is relevant. Hence the “maximum” accuracy, with accuracy being a proxy of information captured. It seems as mentionned by @bazingagin that other methods may be more relevant to measure this “information capture”.

@kts @eric-xw focus on the end to end task and experimental setup, and challenge the fact that the accuracy numbers would not be reproducible in a real setup, because the label would not be available. The comparison becomes unfair to other models in the table 3 that don’t have this label info to exploit.

I will add random results anyway.

I think I’d be more helpful if you’d provide:

  1. the minimum and maximum accuracy (no need for random then, random being - on average - the average of the two) - it’s easier to reproduce
  2. the accuracy with k=1 - again easy to reproduce and correspond to a “practical classifier”