linguist: tokenizer.rb:17: [BUG] Segmentation fault

Moving over from discussions as this is definitely a bug.

Originally discussed in https://github.com/github/linguist/discussions/5876

<div type='discussions-op-text'>

Originally posted by ceshiyixiaba April 25, 2022 git repository: https://github.com/ceshiyixiaba/languages

docker enviroment,error message:

docker run --user=1000 --rm -v $(pwd):$(pwd) -w $(pwd) -t linguist | more
/usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/tokenizer.rb:17: [BUG] Segmentation fault at 0x0000004700000057
ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux-musl]

-- Control frame information -----------------------------------------------
c:0026 p:---- s:0154 e:000153 CFUNC  :extract_tokens
c:0025 p:0007 s:0149 e:000148 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/tokenizer.rb:17
c:0024 p:0043 s:0144 e:000143 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/classifier.rb:106
c:0023 p:0031 s:0136 e:000135 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/classifier.rb:84
c:0022 p:0043 s:0129 e:000128 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/classifier.rb:22
c:0021 p:0011 s:0122 e:000121 BLOCK  /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:32
c:0020 p:0026 s:0119 e:000118 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:100
c:0019 p:0026 s:0113 e:000112 BLOCK  /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:31 [FINISH]
c:0018 p:---- s:0108 e:000107 CFUNC  :each
c:0017 p:0020 s:0104 e:000103 BLOCK  /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:29
c:0016 p:0026 s:0099 e:000098 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:100
c:0015 p:0042 s:0093 e:000092 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist.rb:24
c:0014 p:0026 s:0086 e:000085 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/blob_helper.rb:368
c:0013 p:0041 s:0082 e:000081 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/lazy_blob.rb:70
c:0012 p:0030 s:0077 e:000076 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/blob_helper.rb:383
c:0011 p:0004 s:0073 e:000072 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:174
c:0010 p:0135 s:0066 e:000065 BLOCK  /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:164 [FINISH]
c:0009 p:---- s:0057 e:000056 CFUNC  :each_delta
c:0008 p:0158 s:0053 e:000052 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:149
c:0007 p:0038 s:0044 e:000043 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:116
c:0006 p:0032 s:0040 E:000ea8 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/lib/linguist/repository.rb:68
c:0005 p:0176 s:0035 E:001bd8 METHOD /usr/local/bundle/gems/github-linguist-7.20.0/bin/github-linguist:52
c:0004 p:0112 s:0016 e:000015 TOP    /usr/local/bundle/gems/github-linguist-7.20.0/bin/github-linguist:142 [FINISH]
c:0003 p:---- s:0013 e:000012 CFUNC  :load
c:0002 p:0112 s:0008 E:002000 EVAL   /usr/local/bundle/bin/github-linguist:23 [FINISH]
c:0001 p:0000 s:0003 E:000810 (none) [FINISH]

I tested different versions:

  • github-linguist 7.12.2: OK
  • github-linguist 7.13.0: Error
$ /var/lib/gems/2.7.0/gems/github-linguist-7.12.2/bin/git-linguist stats --commit=`git rev-parse HEAD`
{"OpenEdge ABL":116388}
$ /var/lib/gems/2.7.0/gems/github-linguist-7.13.0/bin/git-linguist stats --commit=`git rev-parse HEAD`
free(): invalid next size (normal)
[1]    3231762 abort (core dumped)  /var/lib/gems/2.7.0/gems/github-linguist-7.13.0/bin/git-linguist stats

$ ruby -v
ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux-gnu]
$ uname -a
Linux VM-0-10-ubuntu 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"
</div>

As pointed out in the discussion, 7.13 was the first release after @smola made changes to the tokenizer in https://github.com/github/linguist/pull/5205 and https://github.com/github/linguist/pull/5186 and https://github.com/github/linguist/pull/5061

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (9 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks @lildude! I was not able to spot the issue last time I tried debugging this.

I created https://github.com/github/linguist/pull/5969 with an alternative for punctuation tokenization, removing the use of trailing context.

No crash if I limit len to 16000.

small string len: 5857
small string len: 3771
small string len: 1543
small string len: 4500
big string len: 35813
small string len: 4918
big string len: 36085
small string len: 12933
small string len: 9457
big string len: 51200
big string len: 41033
90.54%  9496857    Objective-C
7.45%   781186     C
0.92%   96041      C++
0.57%   59335      HTML
0.24%   24697      Objective-C++
0.16%   16374      Shell
0.10%   10395      REXX
0.03%   2823       Perl
0.02%   1692       Ruby