linguist: lingust-language=xyz appears to be ignored on github if 'xyz' is not a known language. Proposal: make it appear as unknown language

lingust-language=xyz appears to be ignored on github if xyz is not a known language. Of course at a first glance, this seems like very reasonable behavior. I’m also not saying it’s a bug, but it might be something which I suggest is worth changing:

Your policy for inclusion of a language into linguist is we prefer that each new file extension be in use in hundreds of repositories before supporting them in Linguist - which is very reasonable, otherwise you’d have hundreds of obscure languages that die after a few months piled up in your language files. However, this means that for small new language projects, the statistics of what they are written in might be grossly incorrect without any indication of that in the UI.

Therefore I suggest the following:

“*.myl linguist-language=my-language” in .gitattributes for an unknown my-language should result in a stats entry of “Unknown language” for all .myl files in the repository. This gives any reader of the stats an indication that there is a language involved that github doesn’t know, while avoiding the pitfalls of not having a proper language color, syntax highlighting, language id or all the other things that are not available because the language isn’t actually in lib/linguist/languages.yml.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 20
  • Comments: 30 (11 by maintainers)

Most upvoted comments

I commonly use a couple of specialty languages in my research. These languages are not known to Linguist. That’s fine. I just want to be able to configure the override file in my repos so that (for my own repos only) the language bar shows the names of these specialty languages.

Just noting that I changed the repository above to lie to Linguist, claiming my toy language is Smalltalk - because the portion of the unreported files started getting on my nerves. (It’s not, and the syntax highlighting doesn’t work right, but … meh.)

Which spurred me to take another try on this windmill.

One option that should be simple from what I understand of how linguist is structured would be to add a few dummy-languages called eg. “Custom”, “DSL”, “Unknown” all highlighting as text. The repositories suffering from this issue could specify *.xyz linguist-language=Custom, and 99% of the issue would be gone.

  • It would be opt-in, so no-one who wants it gets it.
  • It would be safe, because users cannot control the name, or any of the code or rules associated with it.
  • It should be simple, just one or few entries added to languages.yml.

I this sounds like it would be acceptable to Linguist maintainers, I’m happy to submit a PR providing those dummy languages.

Adding support for custom names in the language bar requires a lot more than a few tweaks in Linguist as the GitHub-side changes are far from trivial and would require an internally commissioned project to implement it.

My understanding was that the Unknown Language would not be clickable in the language bar, as is already the case for the Other statistic. Then, the Unknown Language would have tm_scope: none syntax highlighting. Any unrecognized linguist-language override would be assigned the Unknown Language.

Considering this, are there any other changes on GitHub’s side besides supporting a special Unknown language entry that is only shown in the language bar? The update of statistics, the mapping of files to that language, and the selection of syntax highlighting are all on Linguist’s side.

Looks a lot like github is discriminating against little known languages, with a bogus argument that “lildude-is-a-plonker” could be abused this way, but that doesn’t prevent “lildude-is-a-plonker” emerging as a well known language and you’d have the same problem. And if you don’t know how to untaint a user supplied string in a webpage, you shouldn’t be allowed to maintain this project…

It would be nice if the language bar used the name provided, just for display. That way it would be more friendly for users of obscure or application specific languages.

However, if that would be too much work, simply lumping it all together as “unknown language” would be a big step up from where it is now.

I came here to ask for this exact thing. 😄 I have zero expectation of Linguist adding my toy languages to its languages.yml (that would be terrible), but it is also somewhat nasty to have zero accounting of files unknown to Linguist.

At the moment they’re not even accounted for in the “Other” percentage.

Eg: https://github.com/nikodemus/foolang reports Other 1.3%.

Currently 13.25% of that repository is in language unknown to Linguist.

If it said Other 14.55% that would seem much more reasonable and useful to me.

Being able to locally configure the language name via .gitattributes would be supernice, but just accounting for the files in the first place would be enough – there’s already plenty of mechanisms for telling Linguist to ignore files if this is undesirable in some cases.

Allowing another person to potentially affect the appearance (the language can potentially appear on any repo that Linguist identifies as that language due to the current implementation) of another user’s repo, unchecked, is a whole different story.

That’s the behavior for the known languages that are detected by linguist, but isn’t what’s suggested the ability to override specifically some files with a potentially unknown language?

I cannot foresee any security risk or other impact to other repositories with that proposal. Merely discoverability when doing searches and that’d be beneficial not detrimental.

Just to re-enforce my position: why should it be “unknown”? Let it be whatever language people specify it is. They’re already going through the trouble of specifying it manually. Controlling language dialects appears overzealous to me.

No change. The last part of my last comment still stands:

Adding support for custom names in the language bar requires a lot more than a few tweaks in Linguist as the GitHub-side changes are far from trivial and would require an internally commissioned project to implement it.

In short, this won’t be implemented until such time as GitHub’s Product team commission a project for it.

I understand. What about the idea posted above by @nikodemus to add one custom language to Linguist called “Unknown”, “Custom”, “DSL”, or “Other”?

That would allow users to do *.xyz linguist-language=Custom, which I think would satisfy those of us interested in this issue, while not requiring Github to support custom names in the language bar.

I think this a good, practical idea. @lildude What do you think? Is there anything preventing this on GitHub’s side?

It does indeed sound like a good idea. I don’t know off the top of my head if there’d be a problem on the GitHub side of things; I’d need to whip something up and then experiment to be 100% sure. I don’t have the bandwidth for this at the moment, but will add it to my ToDo list.

Linguist is responsible for language detection. If that language is explicitly specified in .gitattributes, that makes its job even easier and it can simply use that instead of guessing with heuristics.

The final breakdown that the tool produces, and the bar shown on the GitHub website are separate concerns.

If GitHub wants to filter which languages have their “official” stamp of approval, they can do that. Just like it currently shows “Others” if it’s given too many languages with small percentages in a repository.

That’s a display concern.

I think the concerns are being conflated and for some reason, linguist is gate-keeping the issue from escalating to the proper team internally at GitHub. I’m still willing to help if help is needed.

This feature seems very useful. The statistics have confused me before when one language not in the languages.yml was recognized as 2-3 different unrelated languages.

Exactly as above, I’m interested in a manual override with .gitattributes without requiring any changes to the way Linguist performs language detection.

I think it simplifies a lot the scope. Curiously, if the community were to provide such PR, can we expect it to be received positively or is it a waste of time? @pchaigno @lildude

I may even do so myself, considering I’ve had that same discussion with coworkers at different jobs and I foresee a lot of happy fellow developers if this came to fruition. It’d give Github that small edge over GitLab 😉

One problem here is also that some languages use a file extension that are currently “taken” and overwriting via .gitattributes does then not make them show up on the repo.

A workaround for this is to mark such files as Text, which will fix their highlighting and avoid skewing the repository’s stats with an unrelated language.

It doesn’t solve the broader issue of having no custom language support on GitHub, but it’s better than misclassification.

@nikodemus Idea seems indeed to solve the issue easily enough. I just encountered this problem when we migrated our Windev projects to github (and while it would be straightforward enough to add those to the known languages, I doubt it sees lots of usage on the platform, so probably not that useful). It just triggers my OCD that the project gets classified as shell because there is a couple of *.env on there, and because Linguist ignores the 2000+files it doesn’t recognizes, Setting a Custom tag would be sufficient IMO

Curiously, if the community were to provide such PR, can we expect it to be received positively or is it a waste of time?

This would be nice, if it was that easy, but I’m afraid we wouldn’t be able to accept it.

Outside of all things mentioned before, there is an expected functionality associated with the language bar in that users expect to be able to click the language bar, then the language and get search results for that language. This will fail to find any files as GitHub will not know about the language and thus return zero search results. This in turn will put an unfair burden on us in the Linguist community to keep explaining this (generally only three of us respond to these kinds of issues/questions) and the GitHub support personnel.

Adding support for custom names in the language bar requires a lot more than a few tweaks in Linguist as the GitHub-side changes are far from trivial and would require an internally commissioned project to implement it.

Security risk? What exact attack are you thinking of? If it’s just the amount of languages, set an upper limit. As for bad words, I don’t see it either - couldn’t you just put them into the project title or README already, who would gain anything from specifically putting them into the language bar? And the remedy is the same: remove the repo, nothing really changes. Just one spot more to put bad words next to 20 other easier spots.

It would be nice if the language bar used the name provided, just for display. That way it would be more friendly for users of obscure or application specific languages.

I can say for sure that that is not likely to happen as it’s a huge security risk, would require additional validation and is open to abuse:

*.md linguist-language=lildude-is-a-plonker

😆🤣

I’m still trying to find the bandwidth to see about implementing an “unknown” option.