daru: Proposal: get rid of missing_values concept completely

Reasons:

  1. Vector#set_missing_positions is needed on any vector update and is slow (entire Daru.lazy_update trick exists to fight it);
  2. The concept is pretty confusing (and, to be honest, I’m not sure even inside Daru always used correctly – there are ton of simple nil-checks here and there, and eventual “right” missing-data checks);
  3. Ruby has native nil, which has the exact “there is no value” meaning.

As far as I can understand, the concept of “user decides what is ‘missing values’” is borrowed from environments where “no values” can be represented by strings like ‘None’ or ‘Nothing’, or ‘Unknown’ etc. I’m pretty sure that the only place in Daru such data should be supported is input and output. So, when fetching data from CSV with "None"s, we can provide some option (like missing: "None") – and they would be stored internally as nils. And, when saving to CSV, we provide same option – and again, nils are converted to "None"s or whatever user wants.

This change will help to greatly simplify API, codebase, docs, and Daru user’s mental model.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 26 (21 by maintainers)

Most upvoted comments

To be honest, the proposal was targeting serious simplifications of APIs and concepts. It may be to extreme, but what seems a “happy path” for me, is the following:

  1. Consider only “Ruby native” missing concept (nil)
    • Float::NaN could be transparently converted into nil on assignment…
    • …or left intact, without any special processing (see below about it);
  2. Get rid of all special *_missing_* and *_valid_* methods in favor of generic methods with same functionality (replacement methods has “drafty” names):
    • has_missing_values?include?(*values) (e.g. include?(nil, Float::NaN), but also include?(0) or basically everything);
    • only_validreject(*values)
    • missing_positionsindexes(*values)
    • n_validcount(*values)
    • And fillna (or anything similar, as per #198) could just be replace(from_value, to_value) or even replace([from_values], to_value)

And the note of “missingness” could then be used only for:

  • method like Ruby’s Array#compact (and maybe some other, thorougly selected, shortcuts for generic methods);
  • reading/writing options (like in original issue description, when reading/writing to CSV and specifying missing: 'NONE' option)

It so happens that numerical libraries like nmatrix frequently use Float::NAN for representing missing data in a floating point matrix.