ArchiveBox: archivebox 0.4.2 init fails parsing old json (ValueError: year 1586476777 is out of range/dateutil.parser._parser.ParserError: year 1586476777 is out of range)
Describe the bug
archivebox init produces error
ValueError: year 1586476777 is out of range dateutil.parser._parser.ParserError: year 1586476777 is out of range
Steps to reproduce
create virtual environment
mkcd /home/kangus/src/archivebox0.4/ pew new -p /usr/bin/python3.8 -a $(pwd) archivebox0.4
clone
git clone https://github.com/pirate/ArchiveBox cd ArchiveBox git branch -a
checkout relevant
git checkout remotes/origin/v0.4.3
install dependencies
pip install -e .
config ENV
eval export $(grep -v '^#' /home/kangus/.ArchiveBox.conf)
migration
` /home/kangus/src/archivebox0.4/ArchiveBox/bin/archivebox init
`
Screenshots or log output
/home/kangus/src/archivebox0.4/ArchiveBox/bin/archivebox init
[*] Updating existing ArchiveBox collection in this folder...
/data/Zalohy/archivebox
------------------------------------------------------------------
[*] Verifying archive folder structure...
√ /data/Zalohy/archivebox/sources
√ /data/Zalohy/archivebox/archive
√ /data/Zalohy/archivebox/logs
√ /data/Zalohy/archivebox/ArchiveBox.conf
[*] Verifying main SQL index and running migrations...
√ /data/Zalohy/archivebox/index.sqlite3
Operations to perform:
Apply all migrations: admin, auth, contenttypes, core, sessions
Running migrations:
No migrations to apply.
[*] Collecting links from any existing indexes and archive folders...
√ Loaded 28875 links from existing main index.
Traceback (most recent call last):
File "/home/kangus/.local/share/virtualenvs/archivebox0.4/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 655, in parse
ret = self._build_naive(res, default)
File "/home/kangus/.local/share/virtualenvs/archivebox0.4/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1241, in _build_naive
naive = default.replace(**repl)
ValueError: year 1586476777 is out of range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/kangus/src/archivebox0.4/ArchiveBox/bin/archivebox", line 14, in <module>
archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/cli/archivebox.py", line 54, in main
run_subcommand(
File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/cli/__init__.py", line 55, in run_subcommand
module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore
File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/cli/archivebox_init.py", line 32, in main
init(
File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/util.py", line 105, in typechecked_function
return func(*args, **kwargs)
File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/main.py", line 321, in init
fixed, cant_fix = fix_invalid_folder_locations(out_dir=out_dir)
File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/index/__init__.py", line 572, in fix_invalid_folder_locations
link = parse_json_link_details(entry.path)
File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/util.py", line 105, in typechecked_function
return func(*args, **kwargs)
File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/index/json.py", line 100, in parse_json_link_details
return Link.from_json(link_json)
File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/index/schema.py", line 190, in from_json
info['updated'] = parse_date(info.get('updated'))
File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/util.py", line 105, in typechecked_function
return func(*args, **kwargs)
File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/util.py", line 144, in parse_date
return dateparser.parse(date)
File "/home/kangus/.local/share/virtualenvs/archivebox0.4/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1374, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/home/kangus/.local/share/virtualenvs/archivebox0.4/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 657, in parse
six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)
File "<string>", line 3, in raise_from
dateutil.parser._parser.ParserError: year 1586476777 is out of range: 1586476777.093312
Software versions
- OS: (Linux Mint 19 Tara/ Ubuntu 18.4)
- ArchiveBox version: (374dd39)
- Python version: (3.8.2, also tested on 3.7.2)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (7 by maintainers)
I like that idea of warning. Since you are still recording the raw value, it’s really a superficial problem. On the frontend, if the parsed date fails the
1960-01-01 < date < $CURRENT_YEAR+1
check, it could display a placeholder value in addition to logging a warning to stdout/file.I understand you are busy. I am waiting for the rewrite to be merged before I start hacking away. 😃
We can also warn the user or bail out if the parsed date is outside of something like this:
1960-01-01 < date < $CURRENT_YEAR+1
.Here’s what I changed in the browser export script https://github.com/mdhowle/ArchiveBox/commit/414d5e6189e807be3df6ece9bec4e4a8a0f878d6. I haven’t tested it thoroughly, but it does output correctly on my machine.
Like you’ve mentioned, it’s difficult to interpret an integer/float as a timestamp confidently. Guessing the datetime format is good for the user’s experience until it is wrong.
One idea is to only accept the common formats you’d know like Unix timestamp and anything dateutil can parse. Otherwise require the user to define the format. Maybe via command line argument
/archive --date-format="%Y-%m-%d %H:%M:%S" https://example.com/rss/feed.xml
or defined in the config likeThanks, I understand now. I wasn’t thinking of browser histories.
If
archivebox-export-browser-history
is exporting browser history, it would know the browser and the epoch it uses internally. Is there any reason why that script couldn’t convert the timestamps to a standard epoch, or pass the browser name/source to archivebox so it knows how to convert it? From a quick look, the internal epochs the browsers use don’t change between OSes.if I change line 144 in util.py from
to:
it works!
To reproduce the error:
throws error: