docs: inconsistent analysis of etc

Analyzing the expression etc in corpus Portuguese-Bosque (https://github.com/UniversalDependencies/UD_Portuguese-Bosque/issues/386) we identified inconsistencies of this annotation in other UD corpus:

  • English (EWT and GUM): use upos equal to X.

  • German (HDT): separate etc in et and cetera.

  • French (ParTUT, GSD and Sequoia): varies between INTJ (ParTUT), X and ADV (GSD) and ADV (Sequoia).

  • Spanish (AnCora and GSD): varies between PUNCT (AnCora) and ADV (GSD).

  • Italian (ISDT and VIT): varies between ADV (ISDT) and NOUN (VIT).

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 52 (52 by maintainers)

Commits related to this issue

Most upvoted comments

“etc” is a loan word in English, not a foreign word. X is not a good option. Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as “and so on”. So I think it must be conj.

I think I agree with almost everything @sylvainkahane writes, except that I don’t come down on the side of CCONJ.

One word or two

Yes, “etc.” has a history whereby it comes from two Latin words. But it just doesn’t seem a good synchronic analysis to say that it should be two words. Would we next split up “another” because it comes from two English words? I think most linguists regard it as a mistake to try to preserve diachrony in a synchronic description. Evidence for it being one word synchronically includes:

  • It is frequently pronounced as [ɛksɛtɹʌ] or [ɪksɛtɹə] (perhaps even usually, though dictionaries are slow to realize it).
  • It is frequently written as “ect.” (non-standard, but common, perhaps related to previous item).

Syntax

No one has argued against the current analysis and @sylvainkahane’s argument here for conj: “Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as “and so on”. So I think it must be conj.” This does seem to me the best way to treat it in the syntax. Treating it as cc would look very odd and not capture the idea of there being conjoined things. If you compare the two sentences “I’ll bring sheets, etc.” and “I’ll bring sheets, towels”. Then I think we are best off representing both of them with a conj: sheets --conj--> etc. and sheets --conj--> towels.

Part of speech

Several of the choices are definitely wrong:

  • X: “etc.” has been around since late Middle English. It should be analyzed as a long incorporated loan word. The UD guidelines say of X: “This usage does not extend to ordinary loan words which should be assigned a normal part-of-speech.” It is only an X in EWT for the reason @amir-zeldes notes, as automatic conversion by a default rule from LDC FW.
  • INTJ, PUNCT: It’s not an interjection or punctuation. It just isn’t.
  • ADV: This is the part of speech assigned by Oxford dictionaries of English. It’s hard to understand why. I agree with @sylvainkahane that ““etc” has nothing in common with ADVs that modify a verb or an adjective.”

The two plausible candidates correspond to the two halves of the meaning of “etc.”: CCONJ or NOUN. I think we do have to accept that “etc.” is a weird special word, and anything we do is shoving it into some category or another. @sylvainkahane gives the case for CCONJ. But I think we are better off calling it a NOUN:

This should be addressed in the universal guidelines but it should be made clear there that the UPOS tag is not necessarily the same in all languages (while the conj deprel probably can be used everywhere), especially if they have their own equivalent instead of the Latin loanword. For example, the Czech equivalent is atd., standing for a tak dále “and so further”. It is tagged ADV in the Czech corpora (http://hdl.handle.net/11346/PMLTQ-L8ZB), presumably because both tak and dále are adverbs. On the other hand, I don’t think that this necessarily applies to English and I find NOUN quite acceptable among all the bad options for English etc.

cc:preconj relation for “both X and Y”?

I think that should be a separate issue, both because “etc.” is an issue for many languages mentioned above which may or may not have similar problems with “both” and because I’d like to get to a decision on etc. I don’t think this is too related, because “etc.” is the last member of a coordination chain (i.e. it is one of the coordinates itself) and these premodifiers are something different (not members of the coordination itself).

The more I think about it the more I agree with @manning , I basically think it is interchangeable with “the rest” (a NOUN) or “others” (in English, due to the s-plural, also a NOUN by virtue of the NNS -> guidelines):

  • Kim, Yun, and the rest
  • Kim, Yun, and other people
  • Kim, Yun, and others
  • Kim, Yun, etc.

For me all of these work the same and argue for NOUN. English UD data has only three lemmas tagged PART: not, infinitive to and the genitive 's. I think putting “etc.” on the same list would be odd, and considering how tricky this has turned out to be, I think there’s nothing too wrong about NOUN (effectively making it be a way of saying “rest” or “others”). It’s a simple solution that doesn’t take too much explaining. If we agree it’s deprel conj then a tag CCONJ is unexpected IMO, since that would mean the POS is determined by an internal dependent (etymological “et”) and not the internal head (“cetera”).

Latvian doesn’t use etc. particularly often, but there are two common abbreviations we would like to annotate in similar manner:

  1. u.c. from un citi ‘and others’
  2. utt. from un tā tālāk ‘and so on’

There are also couple rarer, u.t.j.p. (un tā jo projām ‘and so on’), v.tml. (vai tamlīdzīgi ‘or similar’), u.tml. (un tamlīdzīgi ‘and similar’), thus, after much discussion we just assigned separate tag (yd, that is, abbreviations serving as discourse markers) for them in our local tagset.

For UD needs we currently convert them to SYM with role conj, and the same way we annotate if some texts in our corpus use etc. SYM tag was born out of pure desperation and lack of understanding, how to treat it in UD style, but for conj our thinking was that usually these small abbreviations end some kind of list by indicating that the written list is incomplete and enlists only some of the items writer was thinking about. That is, Latvian thinking was that abbreviation works as the final element of the list.

Anyway, I am very interested in the final conclusions of this discussion 😃

Is this really so different from indefinite pronouns like some?

In EWT at least we consider some to be a DET, and someone to be a PRON.

Honestly the only thing we all agree on is that there is no good category for “etc.” (in English anyway). It’s sort of functional, and associated mainly with coordination, but doesn’t seem as grammatically “core” as pronouns, and doesn’t exist in a paradigm, which is why I think PRON seemed unintuitive (and PART). Nouns like “other” and “rest” can also have similar meanings. In reality, maybe it lies somewhere in between NOUN and PRON. Somebody should do a distributional corpus study and write a paper on it!

I’m less opposed to X than @manning and @sylvainkahane are. I agree with them in principle that it’s a well-integrated word of English, but given that it doesn’t seem to pattern distributionally like any other word of English, and it’s often spelled as an abbreviation reflecting its origin, X may be a reasonable approach in practice.

That doesn’t address @aryamanarora’s point, though, where the equivalent word is not salient as a borrowing in Punjabi.

Yes, borrowings are more likely to end up in an open class, but if it now patterns distributionally like a closed-class item (or rather, unlike any open class item) I don’t think the etymology should be relevant for choosing between non-X tags.

Maybe “etc.” started out as a scholarly loan—and the way we write it as an abbreviation reminds us of that—but I think ordinary people use it in spoken conversation with no idea of its Latin origins, and it is something of a function word even though we don’t traditionally think of it when making lists of function words.

That said, if we wanted to have a simple rule that abbreviations borrowed from Latin do not fit in any normal English POS category, then the correct tag would be X. Whether it’s a borrowing or not should be irrelevant to choosing between NOUN, CCONJ, and PART.

Agreed that “op. cit.”, “ibid.”, etc. (ha) are not a good fit for PART, and it’s hard to imagine anyone using them without knowing they’re scholarly jargon borrowed from Latin.

Just adding another data point: the Punjabi translational equivalent ਆਦਿ ādi I tagged as PART since it takes no nominal declensions, has no apparent gender, only occurs at the end of coordinations–it doesn’t seem to type well with any other part of speech. It also doesn’t really have the same weirdness of et cetera as a potentially foreign word, since Sanskrit loans are common and fully incorporated into the lexicon in Punjabi.

If we’re not considering it a foreign word or tokenizing it as two words I don’t see how etymology is relevant. “Etc.” to English speakers is probably not quite the same as “et cetera” to Latin speakers.

I would be fine with ADV or possibly CCONJ or PART. I just don’t see how “etc.” fits any of the standard distributional tests for NOUN in English.

In the Swedish treebanks etc. and etcetera are currently consistently coded as ADV/conj. The choice of ADV I think is motivated by the usual argument that ADV is a category for words that don’t fit elsewhere (as also @nschneid said) and it is what the dictionaries say. My proposal is that the treatment of etcetera can be language-specific and based on comparable words/phrases in the language, to the extent that they can be found.

In Swedish it can be compared to och så vidare, abbreviated commonly as osv. which mirrors the German und so weiter and usw, but also to med mera, abbreviated mm. m.m. or mm and med flera, abbreviated m.fl. or mfl. These however are introduced by an ADP (German mit, English with) and if spelled out would have a head with the dependency nmod or obl as the case may be. The function is quite similar to etc, however, as it ends or disrupts a listing of phrases. For this reason I would support a sub-dependency such as postconj.

A general argument to the English discussion: In UD function words usually count less than content words. Thus it is a bit odd that the part-of-speech for the abbreviations should be based on the first part (CCONJ or ADP) rather than what follows (ADV, NOUN or PRON).

If people insist on viewing it as nominal I would think PRON would make more sense than NOUN. It is vaguely similar to “everything-else”—both in meaning, and in that it doesn’t have a plural ending despite referring to multiple items.

But it also can’t do things that nominals normally do, like head NPs (absent coordination), or be the antecedent for anaphora.

Hmm. What about the argument that it can coordinate with non-nominals? “We need to mow the lawn, weed the garden, paint the mailbox, etc.”. “Bees swarmed everywhere—inside the hive, above the tree, etc.”

Also, unlike other nouns, it must be the last element in a coordination.

Non-Latinate paraphrases:

  • @sylvainkahane points out “…and so on” is a valid paraphrase. Where this occurs in EWT it is advmod(on/ADV, so/ADV). GUM also treats both as ADV (although it is inconsistent about which is the head).

  • Another option is “…and more”. Where this occurs we currently tag “more” as ADJ, though I’m not necessarily wedded to that.

It seems to me that no standard POS is a great fit because “etc.” has a very special distribution (last element of a coordination of any type). I could see this being an argument to call it X (or ADV, in systems where that is the garbage category).

Two complements about the CCONJ analysis of “etc”. Semantically “etc” contains the meaning of “and”: “A, B, etc” is always a (semantic) conjunction (as opposed to the disjunction “A or B”). Syntactically, “etc” excludes other CCONJs: “A and B”, “A, B, etc”, but *“A and B etc”. This mutual exclusion between “etc” and other CCONJs can allow us to consider that they belong to the same distributional class, even if “etc” occupies another position. Of course, “etc” does not share all the properties of CCONJs, but it is the best choice among a list of bad choices. X is a no-choice. ADV does not make sense, “etc” as nothing in common with ADVs that modify a verb or an adjective. NOUN is worst, “etc” cannot occupy nominal positions and it can close any coordination (I would like to dance, jump, etc). PUNCT is used for written symbols that have only a suprasegmental counterpart in spoken language.

Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as “and so on”. So I think it must be conj.

But what upostag to use? That is why I prefer split “et cetera”

Just “et cetera”? Are other abbreviations split as well? In the English tokenization we only split off clitics.