daru: Proposal: get rid of missing_values concept completely
Reasons:
Vector#set_missing_positions
is needed on any vector update and is slow (entireDaru.lazy_update
trick exists to fight it);- The concept is pretty confusing (and, to be honest, I’m not sure even inside Daru always used correctly – there are ton of simple
nil
-checks here and there, and eventual “right” missing-data checks); - Ruby has native
nil
, which has the exact “there is no value” meaning.
As far as I can understand, the concept of “user decides what is ‘missing values’” is borrowed from environments where “no values” can be represented by strings like ‘None’ or ‘Nothing’, or ‘Unknown’ etc. I’m pretty sure that the only place in Daru such data should be supported is input and output. So, when fetching data from CSV with "None"
s, we can provide some option (like missing: "None"
) – and they would be stored internally as nil
s. And, when saving to CSV, we provide same option – and again, nil
s are converted to "None"
s or whatever user wants.
This change will help to greatly simplify API, codebase, docs, and Daru user’s mental model.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 26 (21 by maintainers)
To be honest, the proposal was targeting serious simplifications of APIs and concepts. It may be to extreme, but what seems a “happy path” for me, is the following:
nil
)Float::NaN
could be transparently converted intonil
on assignment…*_missing_*
and*_valid_*
methods in favor of generic methods with same functionality (replacement methods has “drafty” names):has_missing_values?
→include?(*values)
(e.g.include?(nil, Float::NaN)
, but alsoinclude?(0)
or basically everything);only_valid
→reject(*values)
missing_positions
→indexes(*values)
n_valid
→count(*values)
fillna
(or anything similar, as per #198) could just bereplace(from_value, to_value)
or evenreplace([from_values], to_value)
And the note of “missingness” could then be used only for:
Array#compact
(and maybe some other, thorougly selected, shortcuts for generic methods);missing: 'NONE'
option)It so happens that numerical libraries like nmatrix frequently use Float::NAN for representing missing data in a floating point matrix.