linguist: tokenizer.rb:17: [BUG] Segmentation fault
Moving over from discussions as this is definitely a bug.
Originally discussed in https://github.com/github/linguist/discussions/5876
<div type='discussions-op-text'>Originally posted by ceshiyixiaba April 25, 2022 git repository: https://github.com/ceshiyixiaba/languages
docker enviroment,error message:
docker run --user=1000 --rm -v $(pwd):$(pwd) -w $(pwd) -t linguist | more
/usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/tokenizer.rb:17: [BUG] Segmentation fault at 0x0000004700000057
ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux-musl]
-- Control frame information -----------------------------------------------
c:0026 p:---- s:0154 e:000153 CFUNC :extract_tokens
c:0025 p:0007 s:0149 e:000148 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/tokenizer.rb:17
c:0024 p:0043 s:0144 e:000143 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/classifier.rb:106
c:0023 p:0031 s:0136 e:000135 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/classifier.rb:84
c:0022 p:0043 s:0129 e:000128 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/classifier.rb:22
c:0021 p:0011 s:0122 e:000121 BLOCK /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:32
c:0020 p:0026 s:0119 e:000118 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:100
c:0019 p:0026 s:0113 e:000112 BLOCK /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:31 [FINISH]
c:0018 p:---- s:0108 e:000107 CFUNC :each
c:0017 p:0020 s:0104 e:000103 BLOCK /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:29
c:0016 p:0026 s:0099 e:000098 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:100
c:0015 p:0042 s:0093 e:000092 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:24
c:0014 p:0026 s:0086 e:000085 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/blob_helper.rb:368
c:0013 p:0041 s:0082 e:000081 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/lazy_blob.rb:70
c:0012 p:0030 s:0077 e:000076 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/blob_helper.rb:383
c:0011 p:0004 s:0073 e:000072 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:174
c:0010 p:0135 s:0066 e:000065 BLOCK /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:164 [FINISH]
c:0009 p:---- s:0057 e:000056 CFUNC :each_delta
c:0008 p:0158 s:0053 e:000052 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:149
c:0007 p:0038 s:0044 e:000043 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:116
c:0006 p:0032 s:0040 E:000ea8 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:68
c:0005 p:0176 s:0035 E:001bd8 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/bin/github-linguist:52
c:0004 p:0112 s:0016 e:000015 TOP /usr/local/bundle/gems/github-linguist-7.20.0/bin/github-linguist:142 [FINISH]
c:0003 p:---- s:0013 e:000012 CFUNC :load
c:0002 p:0112 s:0008 E:002000 EVAL /usr/local/bundle/bin/github-linguist:23 [FINISH]
c:0001 p:0000 s:0003 E:000810 (none) [FINISH]
I tested different versions:
- github-linguist 7.12.2: OK
- github-linguist 7.13.0: Error
$ /var/lib/gems/2.7.0/gems/github-linguist-7.12.2/bin/git-linguist stats --commit=`git rev-parse HEAD`
{"OpenEdge ABL":116388}
$ /var/lib/gems/2.7.0/gems/github-linguist-7.13.0/bin/git-linguist stats --commit=`git rev-parse HEAD`
free(): invalid next size (normal)
[1] 3231762 abort (core dumped) /var/lib/gems/2.7.0/gems/github-linguist-7.13.0/bin/git-linguist stats
$ ruby -v
ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux-gnu]
$ uname -a
Linux VM-0-10-ubuntu 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"
</div>
As pointed out in the discussion, 7.13 was the first release after @smola made changes to the tokenizer in https://github.com/github/linguist/pull/5205 and https://github.com/github/linguist/pull/5186 and https://github.com/github/linguist/pull/5061
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (9 by maintainers)
Commits related to this issue
- Less greedy tokenization, fixes segfault The tokenizer previously matched punctuation and symbols very greedily. To do so, it required trailing contexts to differentiate from some types of comments. ... — committed to smola/linguist by smola 2 years ago
- Less greedy tokenization, fixes segfault The tokenizer previously matched punctuation and symbols very greedily. To do so, it required trailing contexts to differentiate from some types of comments. ... — committed to smola/linguist by smola 2 years ago
- Less greedy tokenization, fixes segfault (#5969) * Less greedy tokenization, fixes segfault The tokenizer previously matched punctuation and symbols very greedily. To do so, it required trailing ... — committed to github-linguist/linguist by smola 2 years ago
Thanks @lildude! I was not able to spot the issue last time I tried debugging this.
I created https://github.com/github/linguist/pull/5969 with an alternative for punctuation tokenization, removing the use of trailing context.
No crash if I limit len to 16000.