textwrap: Does not work for languages without word separators
For example, fill("កើតមកមានសេរីភាព", 6)
outputs កើតមកម
, ានសេរភ
and ាព
when it should output កើតមក
, មាន
and សេរីភាព
. See also w3’s Approaches to line breaking document which has the correct ways to line break words; implementing support for this would require storing a dictionary and matching words in it.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (12 by maintainers)
Commits related to this issue
- Find line breaks according to the Unicode line breaking algorithm This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standa... — committed to mgeisler/textwrap by mgeisler 3 years ago
- Find line breaks according to the Unicode line breaking algorithm This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standa... — committed to mgeisler/textwrap by mgeisler 3 years ago
- Find line breaks according to the Unicode line breaking algorithm This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standa... — committed to mgeisler/textwrap by mgeisler 3 years ago
- Find line breaks according to the Unicode line breaking algorithm This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standa... — committed to mgeisler/textwrap by mgeisler 3 years ago
- Find line breaks according to the Unicode line breaking algorithm This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standa... — committed to mgeisler/textwrap by mgeisler 3 years ago
- Use Unicode line breaking algorithm to find words This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](ht... — committed to mgeisler/textwrap by mgeisler 3 years ago
- Use Unicode line breaking algorithm to find words This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](ht... — committed to mgeisler/textwrap by mgeisler 3 years ago
- Use Unicode line breaking algorithm to find words This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](ht... — committed to mgeisler/textwrap by mgeisler 3 years ago
Hi @Kestrer and @tavianator, I’ve put up #313 which uses the unicode-linebreak crate to do find line breaks. We talked about it above and you both suggested that using the Rust ICU bindings might be better. I would like support for rust_icu too — for now I simply went with unicode-linebreak since it’s a pure-Rust crate.
Why does it matter that it’s pure Rust? Well, simply adding
rust_icu = "*"
to the dependencies and runningcargo build
gives me thisThat’s not a super impressive out-of-the-box experience 😄 I’m sure I could install a few development packages to make it work, but I’ll start with the much simpler unicode-linebreak crate.
@SirWindfield, yes, you’re quite right! I’ve been playing with removing the existing
WordSplitter
trait I already have and replace it with an enum — since I cannot find any third-party implementations ofWordSplitter
in the wild, I figured that it would be easier for everybody to just present the three known implementations as enum variants.However, I realize now that this is bad because it prevents us from dealing with my other worry:
Basically, the way to make everything work smoothly with iterators is to only build one big iterator chain. At no point can you make choices in such a chain — so things like this from #313 is forbidden:
In the PR, I had to
collect()
the results of the two iterators into aVec
because of the type mismatch. I’ll now go and rework this to use a trait instead so that there is only choice at runtime — and thus no branches with conflicting types.Hey @SirWindfield, in short, I would like to see the inner logic working in a streaming fashion, i.e., with iterators. That would also be useful for #224.
The reason Textwrap is not doing so today is basically
I used to have an iterator approach and it was a complicated and inflexible mess. Further, there were some subtle bugs and inconsistencies which finally got fixed when I rewrote the whole thing in #221. Moving to a more layered design has helped a lot (I think) and it allowed me to easily do things like
fill_inplace
. This was all part of the 0.13 release late last year.Now, iterators can certainly also be layered and organized in a nice way. However, I do think they add complexity simply by adding code. I removed code last year and I’m reluctant to add it back.
With the addition of the
wrap_optimal_fit
function (which requires O(1) access to the width of all words), the appeal of iterators diminished a lot. You’ll see that I used iterators in thefind_words
andsplit_words
functions, but when I came to the wrapping functions, I simply used the simplest common interface.Yes, one can of course let
wrap_optimal_fit
do acollect()
followed by returning something which implementsIterator
. However, you don’t get any of the expected advantages: you cannot feed in words one-by-one and get nicely wrapped lines out the other end.Consumers of Textwrap, such as Clap, seems to be okay happy with calling
textwrap::fill
and call it a day. So they transformString
toString
and that’s it. Now, this is likely very biased since that’s the only option I left them with after version 0.13.On a higher level, I find that iterators combine really poorly in Rust. Basically, you need to push all branching decisissions down into the iterators, i.e., you cannot do this:
since the two
filter
calls result in different iterators. We see this in the changes in #257 where we have code like this intextwrap::wrap
:So while the inner logic uses iterators all the way up to the wrapping logic, the driver of this logic still converts everything into an iterator at this point. I think we also see the consequence of this in #244 where I believe the top-level function becomes a
textwrap::wrap_greedy
and presumably there will also be atextwrap::wrap_optimal_fit
function later.Maybe I’m missing a trick here, but this makes for a poor API in my opinion. We can then try to push the complexity down the stack and @Kestrer mentioned doing that with an
enum
-based iterator. I’m not sure how nice that will look or feel in the code.Hi @SirWindfield, yeah that’s certainly a possibility!
Right now the way to do this is to use your own
Fragment
implementation: do the splitting any way you want and use the core wrapping functionality to wrap your custom fragments. This way you can wrap things with non-integer widths (see #312) and you can decide precisely where to split the text.