textwrap: Does not work for languages without word separators

For example, fill("កើតមកមានសេរីភាព", 6) outputs កើតមកម, ានសេរភ and ាព when it should output កើតមក, មាន and សេរីភាព. See also w3’s Approaches to line breaking document which has the correct ways to line break words; implementing support for this would require storing a dictionary and matching words in it.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (12 by maintainers)

Commits related to this issue

Find line breaks according to the Unicode line breaking algorithm This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standa... — committed to mgeisler/textwrap by mgeisler 3 years ago
Find line breaks according to the Unicode line breaking algorithm This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standa... — committed to mgeisler/textwrap by mgeisler 3 years ago
Find line breaks according to the Unicode line breaking algorithm This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standa... — committed to mgeisler/textwrap by mgeisler 3 years ago
Find line breaks according to the Unicode line breaking algorithm This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standa... — committed to mgeisler/textwrap by mgeisler 3 years ago
Find line breaks according to the Unicode line breaking algorithm This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standa... — committed to mgeisler/textwrap by mgeisler 3 years ago
Use Unicode line breaking algorithm to find words This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](ht... — committed to mgeisler/textwrap by mgeisler 3 years ago
Use Unicode line breaking algorithm to find words This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](ht... — committed to mgeisler/textwrap by mgeisler 3 years ago
Use Unicode line breaking algorithm to find words This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](ht... — committed to mgeisler/textwrap by mgeisler 3 years ago

Most upvoted comments

Hi @Kestrer and @tavianator, I’ve put up #313 which uses the unicode-linebreak crate to do find line breaks. We talked about it above and you both suggested that using the Rust ICU bindings might be better. I would like support for rust_icu too — for now I simply went with unicode-linebreak since it’s a pure-Rust crate.

Why does it matter that it’s pure Rust? Well, simply adding rust_icu = "*" to the dependencies and running cargo build gives me this

   Compiling rust_icu_sys v0.4.1
error: failed to run custom build command for `rust_icu_sys v0.4.1`

Caused by:
  process didn't exit successfully: `/home/mg/src/textwrap/target/debug/build/rust_icu_sys-e810eefa177b5c20/build-script-build` (exit code: 101)
  --- stdout
  cargo:rustc-cfg=feature="renaming"
  cargo:rustc-cfg=feature="use-bindgen"
  cargo:rustc-cfg=feature="icu_config"

  --- stderr
  thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: Empty }', /home/mg/.cargo/registry/src/github.com-1ecc6299db9ec823/rust_icu_sys-0.4.1/build.rs:224:36
  stack backtrace:
     0:     0x55d82481f2f0 - std::backtrace_rs::backtrace::libunwind::trace::hfe3b1cace85e87d8
                                 at /rustc/acca818928654807ed3bc1ce0e97df118f8716c8/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
     1:     0x55d82481f2f0 - std::backtrace_rs::backtrace::trace_unsynchronized::h542330af06479043
                                 at /rustc/acca818928654807ed3bc1ce0e97df118f8716c8/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
     2:     0x55d82481f2f0 - std::sys_common::backtrace::_print_fmt::h6b88726367858985
                                 at /rustc/acca818928654807ed3bc1ce0e97df118f8716c8/library/std/src/sys_common/backtrace.rs:67:5

That’s not a super impressive out-of-the-box experience 😄 I’m sure I could install a few development packages to make it work, but I’ll start with the much simpler unicode-linebreak crate.

Maybe an approach where splitting is abstracted away behind a trait would be better? That way the current behaviour can be implemented using that trait but if someone wants to use the ICU library they could implement the trait for it. Maybe a function that takes the input str and returns a list of Words.

@SirWindfield, yes, you’re quite right! I’ve been playing with removing the existing WordSplitter trait I already have and replace it with an enum — since I cannot find any third-party implementations of WordSplitter in the wild, I figured that it would be easier for everybody to just present the three known implementations as enum variants.

However, I realize now that this is bad because it prevents us from dealing with my other worry:

On a higher level, I find that iterators combine really poorly in Rust. Basically, you need to push all branching decisions down into the iterators,

Basically, the way to make everything work smoothly with iterators is to only build one big iterator chain. At no point can you make choices in such a chain — so things like this from #313 is forbidden:

    match line_break_algorithm {
        LineBreakAlgorithm::Whitespace => find_ascii_words(line),
        LineBreakAlgorithm::UnicodeLineBreaks => find_unicode_words(line),
    }

In the PR, I had to collect() the results of the two iterators into a Vec because of the type mismatch. I’ll now go and rework this to use a trait instead so that there is only choice at runtime — and thus no branches with conflicting types.

mgeisler on Apr 9, 2021

What are your thoughts?

Hey @SirWindfield, in short, I would like to see the inner logic working in a streaming fashion, i.e., with iterators. That would also be useful for #224.

The reason Textwrap is not doing so today is basically

I used to have an iterator approach and it was a complicated and inflexible mess. Further, there were some subtle bugs and inconsistencies which finally got fixed when I rewrote the whole thing in #221. Moving to a more layered design has helped a lot (I think) and it allowed me to easily do things like fill_inplace. This was all part of the 0.13 release late last year.

Now, iterators can certainly also be layered and organized in a nice way. However, I do think they add complexity simply by adding code. I removed code last year and I’m reluctant to add it back.
With the addition of the wrap_optimal_fit function (which requires O(1) access to the width of all words), the appeal of iterators diminished a lot. You’ll see that I used iterators in the find_words and split_words functions, but when I came to the wrapping functions, I simply used the simplest common interface.

Yes, one can of course let wrap_optimal_fit do a collect() followed by returning something which implements Iterator. However, you don’t get any of the expected advantages: you cannot feed in words one-by-one and get nicely wrapped lines out the other end.
Consumers of Textwrap, such as Clap, seems to be okay happy with calling textwrap::fill and call it a day. So they transform String to String and that’s it. Now, this is likely very biased since that’s the only option I left them with after version 0.13.

On a higher level, I find that iterators combine really poorly in Rust. Basically, you need to push all branching decisissions down into the iterators, i.e., you cannot do this:

let result = if some_condition() {
        it.filter(|&n| n > 10)
    } else {
        it.filter(|&n| n <= 10)
    };

since the two filter calls result in different iterators. We see this in the changes in #257 where we have code like this in textwrap::wrap:

        let wrapped_words = match options.wrap_algorithm {
            core::WrapAlgorithm::OptimalFit => core::wrap_optimal_fit(broken_words, line_lengths)
                .terminate_eol()
                .into_vec(),
            core::WrapAlgorithm::FirstFit => core::wrap_first_fit(broken_words, line_lengths)
                .terminate_eol()
                .collect(),
        };

So while the inner logic uses iterators all the way up to the wrapping logic, the driver of this logic still converts everything into an iterator at this point. I think we also see the consequence of this in #244 where I believe the top-level function becomes a textwrap::wrap_greedy and presumably there will also be a textwrap::wrap_optimal_fit function later.

Maybe I’m missing a trick here, but this makes for a poor API in my opinion. We can then try to push the complexity down the stack and @Kestrer mentioned doing that with an enum-based iterator. I’m not sure how nice that will look or feel in the code.

mgeisler on Apr 6, 2021

Hi @SirWindfield, yeah that’s certainly a possibility!

Right now the way to do this is to use your own Fragment implementation: do the splitting any way you want and use the core wrapping functionality to wrap your custom fragments. This way you can wrap things with non-integer widths (see #312) and you can decide precisely where to split the text.

mgeisler on Apr 5, 2021