ArchiveBox: Pocket and Pinboard imports causing tags to be split incorrectly into individual characters w/ broken hyphenation
A simple bug due to a .split()
or set()
somewhere on the tags_str instead of the tags list. Should be easy to fix.
We should also add a filter to prevent emptystring / whitespace-only tags:
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (12 by maintainers)
I tried out what I suggested in the previous comment in my test environment and it seems to have fixed the problem. All I did was copy the method to split out the tags into a list from further down in the same file.
One note on the tests: Without my changes, there are 8 failing tests for me. After my changes, there are still the same 8 faililng tests, so my pull request did not introduce any new test failiures.
I did a little digging because I would love to get my pinboard.json file imported as well.
I’m not sure if this will help, but this is what I found: I ran through the debugger when adding a json file, and stepped through each line until I wound up in this method: https://github.com/ArchiveBox/ArchiveBox/blob/84b927e3e5fb8da93fb86a9070291495a7563a35/archivebox/core/models.py#L249-L255
where I noticed it was processing one letter of a tag at a time. The entrypoint to this function looks like it is expecting tags to be a list.
There are two places that call this function:
https://github.com/ArchiveBox/ArchiveBox/blob/32764347ce2e59919f763c552bd3e250f49c2f5b/archivebox/index/sql.py#L47
https://github.com/ArchiveBox/ArchiveBox/blob/32764347ce2e59919f763c552bd3e250f49c2f5b/archivebox/index/sql.py#L113
It looks like the first one passes in the tag string: https://github.com/ArchiveBox/ArchiveBox/blob/32764347ce2e59919f763c552bd3e250f49c2f5b/archivebox/index/sql.py#L36-L38
While the second one sets up a list to pass in: https://github.com/ArchiveBox/ArchiveBox/blob/32764347ce2e59919f763c552bd3e250f49c2f5b/archivebox/index/sql.py#L107-L109
I’m not really a python guy, but it looks like the first call also needs to setup the tags as a list before passing them to the
save_tags
method?Feel free to delete this if it doesn’t help. Cheers!
Just submitted PR #911 to introduce the
TAG_SEPARATORS
option. It will default to a comma, which is what tags have already been split on, so if this option is not given, there should be no noticeable changes to existing users.As mentioned in the PR, I can supply the text for the wiki page that describes how to use this option.
And here are my changes to implement the
TAB_WHITESPACE_SPLIT
option.https://github.com/hannah98/ArchiveBox/commit/b2609f0dc9be76f60cb6fde1ab97c6a81ce03376
I’ve tested it with and without the option set. The option defaults to false (to retain backward compatibility). When the option is not set, or set to False, my tags from pinboard end up like
foo bar
. When the option is set to True, my tags from pinboard end up likefoo
andbar
individually.At this point I think the owner, @pirate, would have to chime in. Maybe it’s the case that all tags should be split on spaces and then my first, oneline option would be good. But the second option provides a way for it to be toggled.
I can submit a PR for either one.
I understand the annoyance, I’ve been seeing this too lately with my imports. Will have to track down what changed recently that could affect this.