ArchiveBox: Bugfix: django.db.utils.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp
Describe the bug
Y’all helped me with upgrading my super old archive to the django branch before official 0.4.9 release. I recently upgraded to the newest version, so I could start adding links. archivebox said I had to re-init. archivebox init
gives me following error, and will not let me add new links.
django.db.utils.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp
Full log/error below.
Steps to reproduce
git checkout master
to switch from django branch.git pull origin master
to pull new release.pip install -e .
(also tried withpip uninstall archivebox && pip install .
)- Navigate to archivebox-output directory.
- Run
archivebox init
. - error.
Screenshots or log output
[i] [2020-07-31 17:34:44] ArchiveBox v0.4.9: archivebox init
> /.archivebox-output/archive-working
[*] Updating existing ArchiveBox collection in this folder...
/.archivebox-output/archive-working
------------------------------------------------------------------
[*] Verifying archive folder structure...
√ /.archivebox-output/archive-working/sources
√ /.archivebox-output/archive-working/archive
√ /.archivebox-output/archive-working/logs
√ /.archivebox-output/archive-working/ArchiveBox.conf
[*] Verifying main SQL index and running migrations...
√ /.archivebox-output/archive-working/index.sqlite3
Operations to perform:
Apply all migrations: admin, auth, contenttypes, core, sessions
Running migrations:
Applying core.0005_auto_20200728_0326... OK
[*] Collecting links from any existing indexes and archive folders...
√ Loaded 1376 links from existing main index.
√ Added 347 orphaned links from existing archive directories.
! Skipped adding 239 invalid link data directories.
X /* SNIP A BUNCH OF BROKEN ARCHIVES /*
Hint: For more information about the link data directories that were skipped, run:
archivebox status
archivebox list --status=invalid
[*] [2020-07-31 18:01:50] Writing 1723 links to main index...
Traceback (most recent call last):
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/query.py", line 575, in update_or_create
obj = self.select_for_update().get(**kwargs)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/query.py", line 417, in get
self.model._meta.object_name
core.models.DoesNotExist: Snapshot matching query does not exist.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/backends/utils.py", line 86, in _execute
return self.cursor.execute(sql, params)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 396, in execute
return Database.Cursor.execute(self, query, params)
sqlite3.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/USERNAME/.local/bin/archivebox", line 33, in <module>
sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
File "/home/USERNAME/datahoard/ArchiveBox/archivebox/cli/__init__.py", line 126, in main
pwd=pwd or OUTPUT_DIR,
File "/home/USERNAME/datahoard/ArchiveBox/archivebox/cli/__init__.py", line 62, in run_subcommand
module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore
File "/home/USERNAME/datahoard/ArchiveBox/archivebox/cli/archivebox_init.py", line 35, in main
out_dir=pwd or OUTPUT_DIR,
File "/home/USERNAME/datahoard/ArchiveBox/archivebox/util.py", line 109, in typechecked_function
return func(*args, **kwargs)
File "/home/USERNAME/datahoard/ArchiveBox/archivebox/main.py", line 369, in init
write_main_index(list(all_links.values()), out_dir=out_dir)
File "/home/USERNAME/datahoard/ArchiveBox/archivebox/util.py", line 109, in typechecked_function
return func(*args, **kwargs)
File "/home/USERNAME/datahoard/ArchiveBox/archivebox/index/__init__.py", line 235, in write_main_index
write_sql_main_index(links, out_dir=out_dir)
File "/home/USERNAME/datahoard/ArchiveBox/archivebox/util.py", line 109, in typechecked_function
return func(*args, **kwargs)
File "/home/USERNAME/datahoard/ArchiveBox/archivebox/index/sql.py", line 42, in write_sql_main_index
Snapshot.objects.update_or_create(url=link.url, defaults=info)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/query.py", line 580, in update_or_create
obj, created = self._create_object_from_params(kwargs, params, lock=True)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/query.py", line 604, in _create_object_from_params
raise e
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/query.py", line 596, in _create_object_from_params
obj = self.create(**params)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/query.py", line 433, in create
obj.save(force_insert=True, using=self.db)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/base.py", line 746, in save
force_update=force_update, update_fields=update_fields)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/base.py", line 784, in save_base
force_update, using, update_fields,
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/base.py", line 887, in _save_table
results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/base.py", line 926, in _do_insert
using=using, raw=raw,
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/query.py", line 1204, in _insert
return query.get_compiler(using=using).execute_sql(returning_fields)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1392, in execute_sql
cursor.execute(sql, params)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/backends/utils.py", line 68, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/backends/utils.py", line 77, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/backends/utils.py", line 86, in _execute
return self.cursor.execute(sql, params)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/backends/utils.py", line 86, in _execute
return self.cursor.execute(sql, params)
File "/home/USERNAME/.local/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 396, in execute
return Database.Cursor.execute(self, query, params)
django.db.utils.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp
Software versions
- OS: Ubuntu 18.04
- ArchiveBox version: 0.4.9 (0ac4e12)
- Python version: Python 3.7.8
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (7 by maintainers)
Very helpful @karlicoss! This is high on our priority list of things to fix.
I’ll check in with an update once we’ve started working on this. I suspect it’s a relatively simple bug in the timestamp deduping code, most of the work will be QA and testing to make sure we don’t introduce any regressions while we fix it.
For context, timestamp deduping has been one of the most brittle parts of ArchiveBox in the past years, and we already have plans to remove the need for it in a refactoring in the next major version.
With the changes present in the
cdvv7788:sql_index
branch, reflected in PR #452, it fixed my issue! I was able toarchivebox init
on the old index, updated with some broken directories, but ultimately wrote everything to the index. Looks to be intact! I’ll just add the “invalid link data directories” through a .txt file.For what it is worth, v0.4.21 fixed the issue I was having regarding
sqlite3.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp
. Thank you!@pirate Updated to newest.
same error, exactly, as my last log.
The rest of the log is exactly the same as well, line references and all.
EDIT: I thought maybe I could try nuking it, starting from scratch. No dice, same error. I tried with docker and docker-compose as well, after removing the original package from pip. Same error in both, but with python3.8 instead.
Deployed the latest Docker image and it seems to have fixed the issue. Thanks so much!
Used
pip install --upgrade archivebox
, it upgraded and installed 2 additional packages.Went to archive directory to run
archivebox init
.Note: I’m not sure if you need the entire traceback each time, since most of it is identical, but figured more is better when hunting down bugs. Apologies if it’s too much.
@apkallum - Using your build, it gets a bit further. Modifies a few entries, and then gives following error:
EDIT: To be clear, this is using
archivebox init
in the main archive directory.EDIT 2: Oops. Realized I had switched to Python 3.8 for another project and forgot to update-alternatives. Running
archivebox init
with Python 3.7, with apkallum’s branch, gives me essentially same error.Happens for me as well. Archivebox version:
v0.4.13
(image from Docker hub).I experimented a bit and managed to consistently reproduce. I suspect the urls that have a suffix in the timestamp are causing it.
Create a new (empty) archive directory, put it in the compose file and initialise
docker-compose run --rm archivebox init
Archive few URLs
input:
First archiving:
Goes well:
Now if you rerun the same command, it works well too
As expected, just says everything is already in the index
Now try running against on of the urls that has a dot in the timestamp (with a suffix)
Interesting enough, running against
https://beepb00p.xyz/promnesia.html
, that has the timestamp1597171609
works fine and as expected just says it’s already in the index.Now if you try to add a completely different set of links, it works fine again:
And again, if you try to add
http://blog.sigfpe.com/2008/02/what-is-topology.html
, it works, if you tryhttp://blog.sigfpe.com/2006/11/yoneda-lemma.html
it fails.