daru: Proposal: get rid of missing_values concept completely
Reasons:
Vector#set_missing_positionsis needed on any vector update and is slow (entireDaru.lazy_updatetrick exists to fight it);- The concept is pretty confusing (and, to be honest, I’m not sure even inside Daru always used correctly – there are ton of simple
nil-checks here and there, and eventual “right” missing-data checks); - Ruby has native
nil, which has the exact “there is no value” meaning.
As far as I can understand, the concept of “user decides what is ‘missing values’” is borrowed from environments where “no values” can be represented by strings like ‘None’ or ‘Nothing’, or ‘Unknown’ etc. I’m pretty sure that the only place in Daru such data should be supported is input and output. So, when fetching data from CSV with "None"s, we can provide some option (like missing: "None") – and they would be stored internally as nils. And, when saving to CSV, we provide same option – and again, nils are converted to "None"s or whatever user wants.
This change will help to greatly simplify API, codebase, docs, and Daru user’s mental model.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 26 (21 by maintainers)
To be honest, the proposal was targeting serious simplifications of APIs and concepts. It may be to extreme, but what seems a “happy path” for me, is the following:
nil)Float::NaNcould be transparently converted intonilon assignment…*_missing_*and*_valid_*methods in favor of generic methods with same functionality (replacement methods has “drafty” names):has_missing_values?→include?(*values)(e.g.include?(nil, Float::NaN), but alsoinclude?(0)or basically everything);only_valid→reject(*values)missing_positions→indexes(*values)n_valid→count(*values)fillna(or anything similar, as per #198) could just bereplace(from_value, to_value)or evenreplace([from_values], to_value)And the note of “missingness” could then be used only for:
Array#compact(and maybe some other, thorougly selected, shortcuts for generic methods);missing: 'NONE'option)It so happens that numerical libraries like nmatrix frequently use Float::NAN for representing missing data in a floating point matrix.