ArchiveBox: archivebox 0.4.2 init fails parsing old json (ValueError: year 1586476777 is out of range/dateutil.parser._parser.ParserError: year 1586476777 is out of range)

Describe the bug

archivebox init produces error

ValueError: year 1586476777 is out of range dateutil.parser._parser.ParserError: year 1586476777 is out of range

Steps to reproduce

create virtual environment

mkcd /home/kangus/src/archivebox0.4/ pew new -p /usr/bin/python3.8 -a $(pwd) archivebox0.4

clone

git clone https://github.com/pirate/ArchiveBox cd ArchiveBox git branch -a

checkout relevant

git checkout remotes/origin/v0.4.3

install dependencies

pip install -e .

config ENV

eval export $(grep -v '^#' /home/kangus/.ArchiveBox.conf)

migration

` /home/kangus/src/archivebox0.4/ArchiveBox/bin/archivebox init

`

Screenshots or log output

/home/kangus/src/archivebox0.4/ArchiveBox/bin/archivebox  init                                                                                                                                                                     
[*] Updating existing ArchiveBox collection in this folder...                                                                                                                                                                                 
    /data/Zalohy/archivebox                                                                                                                                                                                                                   
------------------------------------------------------------------                                                                                                                                                                            
                                                                                                                                                                                                                                              
[*] Verifying archive folder structure...   
    √ /data/Zalohy/archivebox/sources
    √ /data/Zalohy/archivebox/archive
    √ /data/Zalohy/archivebox/logs
    √ /data/Zalohy/archivebox/ArchiveBox.conf

[*] Verifying main SQL index and running migrations...
    √ /data/Zalohy/archivebox/index.sqlite3

    Operations to perform:
      Apply all migrations: admin, auth, contenttypes, core, sessions
    Running migrations:
    No migrations to apply.

[*] Collecting links from any existing indexes and archive folders...
    √ Loaded 28875 links from existing main index.
Traceback (most recent call last):
  File "/home/kangus/.local/share/virtualenvs/archivebox0.4/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 655, in parse
    ret = self._build_naive(res, default)
  File "/home/kangus/.local/share/virtualenvs/archivebox0.4/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1241, in _build_naive
    naive = default.replace(**repl)
ValueError: year 1586476777 is out of range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/kangus/src/archivebox0.4/ArchiveBox/bin/archivebox", line 14, in <module>
    archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
  File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/cli/archivebox.py", line 54, in main
    run_subcommand(
  File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/cli/__init__.py", line 55, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/cli/archivebox_init.py", line 32, in main
    init(
  File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/util.py", line 105, in typechecked_function
    return func(*args, **kwargs)
  File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/main.py", line 321, in init
    fixed, cant_fix = fix_invalid_folder_locations(out_dir=out_dir)
  File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/index/__init__.py", line 572, in fix_invalid_folder_locations
    link = parse_json_link_details(entry.path)
  File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/util.py", line 105, in typechecked_function
    return func(*args, **kwargs)
  File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/index/json.py", line 100, in parse_json_link_details
    return Link.from_json(link_json)
  File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/index/schema.py", line 190, in from_json
    info['updated'] = parse_date(info.get('updated'))
  File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/util.py", line 105, in typechecked_function
    return func(*args, **kwargs)
  File "/home/kangus/src/archivebox0.4/ArchiveBox/archivebox/util.py", line 144, in parse_date
    return dateparser.parse(date)
  File "/home/kangus/.local/share/virtualenvs/archivebox0.4/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1374, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/kangus/.local/share/virtualenvs/archivebox0.4/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 657, in parse
    six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)
  File "<string>", line 3, in raise_from
dateutil.parser._parser.ParserError: year 1586476777 is out of range: 1586476777.093312

Software versions

  • OS: (Linux Mint 19 Tara/ Ubuntu 18.4)
  • ArchiveBox version: (374dd39)
  • Python version: (3.8.2, also tested on 3.7.2)

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

I like that idea of warning. Since you are still recording the raw value, it’s really a superficial problem. On the frontend, if the parsed date fails the 1960-01-01 < date < $CURRENT_YEAR+1 check, it could display a placeholder value in addition to logging a warning to stdout/file.

I understand you are busy. I am waiting for the rewrite to be merged before I start hacking away. 😃

We can also warn the user or bail out if the parsed date is outside of something like this: 1960-01-01 < date < $CURRENT_YEAR+1.

Here’s what I changed in the browser export script https://github.com/mdhowle/ArchiveBox/commit/414d5e6189e807be3df6ece9bec4e4a8a0f878d6. I haven’t tested it thoroughly, but it does output correctly on my machine.

Like you’ve mentioned, it’s difficult to interpret an integer/float as a timestamp confidently. Guessing the datetime format is good for the user’s experience until it is wrong.

One idea is to only accept the common formats you’d know like Unix timestamp and anything dateutil can parse. Otherwise require the user to define the format. Maybe via command line argument /archive --date-format="%Y-%m-%d %H:%M:%S" https://example.com/rss/feed.xml or defined in the config like

SOURCE_TIMESTAMP_CONVERSIONS = {
  "https://example.com/rss/feed.xml": "%Y-%m-%d %H:%M:%S"
}

Thanks, I understand now. I wasn’t thinking of browser histories.

If archivebox-export-browser-history is exporting browser history, it would know the browser and the epoch it uses internally. Is there any reason why that script couldn’t convert the timestamps to a standard epoch, or pass the browser name/source to archivebox so it knows how to convert it? From a quick look, the internal epochs the browsers use don’t change between OSes.

if I change line 144 in util.py from

       return dateparser.parse(date)

to:

        regex1 = u"[0-9]{6}."
        match = re.search(regex1, date)
        if match:
            print("unix epoch date time string")
            regex3 = u"\."
            match3 = re.search(regex3, date)
            if match3:
                return datetime.fromtimestamp(float(date))
            else:
                return datetime.fromtimestamp(int(date))
        else:
            regex2 = u"[0-9]{4}-[0-9]{2}."
            match2 = re.search(regex2, date)
            if match2:
                print("ISO date time string")
                return dateparser.parse(date)
            else:
                print("no match for date time string")
                return dateparser.parse(date)

it works!

To reproduce the error:

from datetime import datetime
from dateutil import parser as dateparser

date = '1586476777.093312'
dateparser.parse(date)

throws error:

Traceback (most recent call last):
  File "/home/kangus/.local/share/virtualenvs/archivebox0.4/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 655, in parse
    ret = self._build_naive(res, default)
  File "/home/kangus/.local/share/virtualenvs/archivebox0.4/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1241, in _build_naive
    naive = default.replace(**repl)
ValueError: year 1586476777 is out of range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kangus/.local/share/virtualenvs/archivebox0.4/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1374, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/kangus/.local/share/virtualenvs/archivebox0.4/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 657, in parse
    six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)
  File "<string>", line 3, in raise_from
dateutil.parser._parser.ParserError: year 1586476777 is out of range: 1586476777.093312