newsboat: Deduplicate items based on their URLs

Newsboat version (copy the output of newsboat -v or the first line of git show):

Newsboat r2.24-70-g41f1b - https://newsboat.org/ Copyright © 2006-2015 Andreas Krennmair Copyright © 2015-2021 Alexander Batischev Copyright © 2006-2017 Newsbeuter contributors Copyright © 2017-2021 Newsboat contributors

Newsboat is free software licensed under the MIT License. (Type `./newsboat -vv’ to see the full text.) It bundles:

Newsboat r2.24-70-g41f1b System: Linux 4.19.0-17-amd64 (x86_64) Compiler: g++ 8.3.0 ncurses: ncurses 6.1.20181013 (compiled with 6.1) libcurl: libcurl/7.64.0 OpenSSL/1.1.1d zlib/1.2.11 libidn2/2.0.5 libpsl/0.20.2 (+libidn2/2.0.5) libssh2/1.8.0 nghttp2/1.36.0 librtmp/2.3 (compiled with 7.64.0) SQLite: 3.27.2 (compiled with 3.27.2) libxml2: compiled with 2.9.4

Config file (copy from ~/.newsboat/config or ~/.config/newsboat/config):

article-sort-order title
articlelist-title-format "%U"
auto-reload yes
bind-key J next-feed
bind-key K prev-feed
bind-key j next
bind-key k prev
bind-key o open-in-browser-and-mark-read articlelist
datetime-format "%F %T"
error-log ~/.local/share/newsboat/error.log
feedlist-format "%4i %n %11u %L - %t"
# ignore-article "*" ...
ignore-mode display
keep-articles-days 365
prepopulate-query-feeds yes
reload-threads 32
run-on-startup end; open
show-read-articles no
show-read-feeds no
suppress-first-reload yes
unbind-key C

Steps to reproduce the issue:

  1. I follow a couple of aggregation feeds: Hacker News, Lobsters and often times authors original feed. This means I end up with 3 copies of some articles. Added a "query:Unread Articles:unread = \"yes\" along with article-sort-order title to manually deduplicate those (i.e. mark two of those read in feedlist) but it would be better if newsboat could do this automatically probably using the Link attribute.
  2. I also subscribe to some forums where the Link values have a common prefix fora given thread:
https://.../index.php?topic=187328.msg156161638#msg156161638
https://.../index.php?topic=187328.msg156161361#msg156161361

so these are NOT duplicates but you may want to ignore anything with the prefix “https://…/index.php?topic=187328.msg” in case. You can do this with ignore-article but it’s cumbersome to do manually with an editor.

In the query context when a duplicate article is read or delete that action should happen on all duplicate articles. An action should probably be persistent so you only see an article once even if they don’t occur in the same session (newsreaders of old that that feature to “silence a thread”). Not sure what the expect behavior should be when in the itemlist/itemview context (seeing duplicates as now, or take the same action across duplicates when operating on any of them)?

Other info you think is relevant:

Low (to medium) priority.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 37 (20 by maintainers)

Most upvoted comments

With the changed timestamp semantic in the <updated> fields, some of these timestamps then reflected some later point in time. Hence, new entries were recreated in Newsboat’s cache, which of course are marked unread.

Thanks @der-lyse, this seems to confirm @Minoru’s explanation.

I typically subscribe to an author’s feed after seeing their article references on an aggregating site. This means my use case would be to add every new feed to be deduplicated. This puts me in the opposite end of you in that I would much rather miss an article than presented with duplicates. This is just point and not an argument for or against anything.

We may want to think of duplicates as a non-binary score perhaps a k-means clustering of articles? Then think about what to do with a given cluster of similar articles. If they are exact duplicates then it may not matter, but if they are similar maybe user prefers feed A over B. I don’t have any great examples in my feed today but is an one example:

https://www.tomshardware.com/news/lowest-cpu-shipments-in-30-years-amd-intel-q2-2022-cpu-market-share https://www.techspot.com/news/95611-desktop-cpu-sales-see-biggest-decline-30-years.html

The source of the information is the same (one of the articles, a 3rd article, press release, embargo lifted of product release).
In other words grouping and intra-group order including hiding sufficiently similar articles.

Google News handles this in an interesting way on their website (their feed is useless). They have n references to the same story, then elevate one of the sources as the main one and the remaining n - 1 as secondary points (they still have duplicates between sections for instance Headlines and U.S.).

@Friptick Yeah, tags might be useful for this. Alternatively, we could just enumerate feed URLs as we do for always-download setting, but that then requires updates in two places when the URLs change.

The following examples might help with the script I proposed: https://github.com/newsboat/newsboat/blob/331176fdd9fdbd876f37b33f22a1a7054a13b2f0/contrib/slashdot.rb, https://github.com/newsboat/newsboat/blob/331176fdd9fdbd876f37b33f22a1a7054a13b2f0/contrib/heise.rb, https://github.com/msharov/snownews/blob/de3bd8b28191c4d4bc1be18275786613bcbc0c94/docs/untested/wikiwatch. They all fetch a single feed, change it, and print it out. My suggestion is to fetch multiple feeds, combine them, deduplicate the result, and print it out. I’d reach for Python with its requests and feedparser packages; shell is not really suitable for this IMHO.

urls file format is specific to Newsboat, so I don’t see where the expectation of portability comes from. Moving query feeds out of urls file immediately raises the question of how to establish the ordering for article-sort-order "none"; I didn’t think too deep on what other problems might come up. Please open a separate issue if you want to discuss this further.

@allanwind You seem to imagine a different design, one where the feed is piped through a script whenever it’s opened in Newsboat. This doesn’t exist, and I’m not sure it’s wise to do, since feeds can be large; this will be slow. The exec I’m talking about is invoked much earlier, when the feed is downloaded by Newsboat.