ArchiveBox: Bugfix: django branch start_ts error on init

Describe the bug

When attempting to archivebox init with version 0.4.3 in old archive, archivebox fails at Collecting links from any existing indexes and archive folders... with KeyError: 'start_ts'

Steps to reproduce

  1. Installed Django branch with git clone and pip install ..
  2. Navigated to old archive directory.
  3. Ran archivebox init
  4. archivebox goes through most of importing process, and then dies with the error listed below.

Screenshots or log output

Traceback (most recent call last):
  File "/home/USERNAME/.local/bin/archivebox", line 8, in <module>
    sys.exit(main())
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 126, in main
    pwd=pwd or OUTPUT_DIR,
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 62, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/archivebox_init.py", line 34, in main
    out_dir=pwd or OUTPUT_DIR,
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 108, in typechecked_function
    return func(*args, **kwargs)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/main.py", line 316, in init
    for link in load_main_index(out_dir=out_dir, warn=False)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 108, in typechecked_function
    return func(*args, **kwargs)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/__init__.py", line 250, in load_main_index
    all_links = list(parse_json_main_index(out_dir))
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 52, in parse_json_main_index
    yield Link.from_json(link_json)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 203, in from_json
    cast_result = ArchiveResult.from_json(json_result)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 62, in from_json
    info['start_ts'] = parse_date(info['start_ts'])
KeyError: 'start_ts'

Software versions

  • OS: Ubuntu 18.04.4 LTS
  • ArchiveBox version: 848977e
  • Python version: Python 3.7.8

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

Awesome, that’s a relief to hear. We were worried it was a regression from the latest version. I’m going to close this issue for now but I’ll keep responding to your comments here, don’t worry.

If you post a ZIP (or email me email) of a handful of those swapped folders I’ll write you a bash script that fixes it.

@drpfenderson one more try please. Also, if you install it with pip install -e . you will always have installed the version of the code you are currently running (no need to pip install after changing branches i.e.)

Perfect, thanks for those samples. It confirms our suspicion that you had a few links archived with a very old version before we introduced start_ts. We’ll add a workaround that will handle that older schema and upgrade those files to the new style.

(Also thanks for the sponsorship @drpfenderson!)

Here is a snippet from the beginning of the main index.json file. Here is another snipped from later in the file. Let me know if you would like/need more, or are looking for something in particular.

The index.html says that it was created with version a3a048d4. Here is a gist containing the output of one of the most recent index.json files, with redacted personal info.