linguist: tokenizer.rb:17: [BUG] Segmentation fault

Moving over from discussions as this is definitely a bug.

Originally discussed in https://github.com/github/linguist/discussions/5876

^{Originally posted by ceshiyixiaba April 25, 2022} git repository: https://github.com/ceshiyixiaba/languages

docker run --user=1000 --rm -v $(pwd):$(pwd) -w $(pwd) -t linguist | more
/usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/tokenizer.rb:17: [BUG] Segmentation fault at 0x0000004700000057
ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux-musl]

-- Control frame information -----------------------------------------------
c:0026 p:---- s:0154 e:000153 CFUNC  :extract_tokens
c:0025 p:0007 s:0149 e:000148 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/tokenizer.rb:17
c:0024 p:0043 s:0144 e:000143 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/classifier.rb:106
c:0023 p:0031 s:0136 e:000135 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/classifier.rb:84
c:0022 p:0043 s:0129 e:000128 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/classifier.rb:22
c:0021 p:0011 s:0122 e:000121 BLOCK  /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:32
c:0020 p:0026 s:0119 e:000118 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:100
c:0019 p:0026 s:0113 e:000112 BLOCK  /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:31 [FINISH]
c:0018 p:---- s:0108 e:000107 CFUNC  :each
c:0017 p:0020 s:0104 e:000103 BLOCK  /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:29
c:0016 p:0026 s:0099 e:000098 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:100
c:0015 p:0042 s:0093 e:000092 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:24
c:0014 p:0026 s:0086 e:000085 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/blob_helper.rb:368
c:0013 p:0041 s:0082 e:000081 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/lazy_blob.rb:70
c:0012 p:0030 s:0077 e:000076 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/blob_helper.rb:383
c:0011 p:0004 s:0073 e:000072 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:174
c:0010 p:0135 s:0066 e:000065 BLOCK  /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:164 [FINISH]
c:0009 p:---- s:0057 e:000056 CFUNC  :each_delta
c:0008 p:0158 s:0053 e:000052 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:149
c:0007 p:0038 s:0044 e:000043 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:116
c:0006 p:0032 s:0040 E:000ea8 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:68
c:0005 p:0176 s:0035 E:001bd8 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/bin/github-linguist:52
c:0004 p:0112 s:0016 e:000015 TOP    /usr/local/bundle/gems/github-linguist-7.20.0/bin/github-linguist:142 [FINISH]
c:0003 p:---- s:0013 e:000012 CFUNC  :load
c:0002 p:0112 s:0008 E:002000 EVAL   /usr/local/bundle/bin/github-linguist:23 [FINISH]
c:0001 p:0000 s:0003 E:000810 (none) [FINISH]

I tested different versions:

github-linguist 7.12.2: OK
github-linguist 7.13.0: Error

$ /var/lib/gems/2.7.0/gems/github-linguist-7.12.2/bin/git-linguist stats --commit=`git rev-parse HEAD`
{"OpenEdge ABL":116388}
$ /var/lib/gems/2.7.0/gems/github-linguist-7.13.0/bin/git-linguist stats --commit=`git rev-parse HEAD`
free(): invalid next size (normal)
[1]    3231762 abort (core dumped)  /var/lib/gems/2.7.0/gems/github-linguist-7.13.0/bin/git-linguist stats

$ ruby -v
ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux-gnu]
$ uname -a
Linux VM-0-10-ubuntu 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"

</div>

As pointed out in the discussion, 7.13 was the first release after @smola made changes to the tokenizer in https://github.com/github/linguist/pull/5205 and https://github.com/github/linguist/pull/5186 and https://github.com/github/linguist/pull/5061

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (9 by maintainers)

Commits related to this issue

Less greedy tokenization, fixes segfault The tokenizer previously matched punctuation and symbols very greedily. To do so, it required trailing contexts to differentiate from some types of comments. ... — committed to smola/linguist by smola 2 years ago
Less greedy tokenization, fixes segfault The tokenizer previously matched punctuation and symbols very greedily. To do so, it required trailing contexts to differentiate from some types of comments. ... — committed to smola/linguist by smola 2 years ago
Less greedy tokenization, fixes segfault (#5969) * Less greedy tokenization, fixes segfault The tokenizer previously matched punctuation and symbols very greedily. To do so, it required trailing ... — committed to github-linguist/linguist by smola 2 years ago

Most upvoted comments

Thanks @lildude! I was not able to spot the issue last time I tried debugging this.

I created https://github.com/github/linguist/pull/5969 with an alternative for punctuation tokenization, removing the use of trailing context.

smola on Jul 10, 2022

No crash if I limit len to 16000.

small string len: 5857
small string len: 3771
small string len: 1543
small string len: 4500
big string len: 35813
small string len: 4918
big string len: 36085
small string len: 12933
small string len: 9457
big string len: 51200
big string len: 41033
90.54%  9496857    Objective-C
7.45%   781186     C
0.92%   96041      C++
0.57%   59335      HTML
0.24%   24697      Objective-C++
0.16%   16374      Shell
0.10%   10395      REXX
0.03%   2823       Perl
0.02%   1692       Ruby

gpflaum on Jun 28, 2022