influxdb: Backup/restore fails with a lot of databases
Bug report
System info: InfluxDB v1.5.3, installed from brew on Mac OS X 10.12.6
Steps to reproduce:
- Start a clean instance of InfluxDB
rm -r .influxdb
influxd
- Create some dummy databases
curl -X POST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE test"
curl -X POST http://localhost:8086/write?db=test --data-binary "a i=1"
curl -X POST http://localhost:8086/query --data-urlencode "q=$(perl dummy_data.pl 1 500)"
where dummy_data.pl is
use 5.010;
use strict;
use warnings;
for my $i ($ARGV[0]..$ARGV[1]) {
my $db = "test$i";
say "CREATE DATABASE $db WITH DURATION 260w REPLICATION 1 SHARD DURATION 12w NAME rp2;";
say "CREATE RETENTION POLICY rp1 ON $db DURATION 100d REPLICATION 1 SHARD DURATION 2w;";
say "CREATE CONTINUOUS QUERY cq1 ON $db RESAMPLE EVERY 5m FOR 10m BEGIN SELECT LAST(a) AS b, c INTO $db.rp2.m FROM $db.rp1.m GROUP BY time(5m) END;";
say "CREATE CONTINUOUS QUERY cq2 ON $db RESAMPLE EVERY 5m FOR 10m BEGIN SELECT MAX(a) AS b, c INTO $db.rp2.m FROM $db.rp1.m GROUP BY time(5m) END;";
}
- Backup everything
rm -r ./backup
influxd backup -portable ./backup
- Try to restore the database
test
influxd restore -portable -db test -newdb test_bak backup/
Expected behavior: The database test is restored as test_bak
Actual behavior: Restoring the database fails (most of the time…) with the message error updating meta: DB metadata not changed. database may already exist, even if test_bak does not exist.
I wasn’t able to understand to resulting log line, where RetentionPolicyInfo isn’t always the same:
failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0)
Additional info: This behaviour seems to depend on the amount of metadata. If I add only 100 dummy databases instead of 500 (curl -X POST http://localhost:8086/query --data-urlencode "q=$(perl dummy_data.pl 1 100)"), everything works well.
Me trying to restore a few times, where the 6th attempt worked:
➜ ~ rm -r ./backup
influxd backup -portable ./backup
2018/06/13 10:27:37 backing up metastore to backup/meta.00
2018/06/13 10:27:37 No database, retention policy or shard ID given. Full meta store backed up.
2018/06/13 10:27:37 Backing up all databases in portable format
2018/06/13 10:27:37 backing up db=
2018/06/13 10:27:37 backing up db=test rp=autogen shard=1 to backup/test.autogen.00001.00 since 0001-01-01T00:00:00Z
2018/06/13 10:27:37 backing up db=_internal rp=monitor shard=2 to backup/_internal.monitor.00002.00 since 0001-01-01T00:00:00Z
2018/06/13 10:27:37 backup complete:
2018/06/13 10:27:37 backup/20180613T082737Z.meta
2018/06/13 10:27:37 backup/20180613T082737Z.s1.tar.gz
2018/06/13 10:27:37 backup/20180613T082737Z.s2.tar.gz
2018/06/13 10:27:37 backup/20180613T082737Z.manifest
➜ ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:45 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜ ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:52 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜ ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:53 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜ ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:54 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜ ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:54 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜ ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:55 Restoring shard 1 live from backup 20180613T082737Z.s1.tar.gz
➜ ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:57 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜ ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:58 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
The corresponding logs:
2018-06-13T08:27:37.023239Z info Cache snapshot (start) {"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4kl000", "op_name": "tsm1_cache_snapshot", "op_event": "start"}
2018-06-13T08:27:37.026848Z info Snapshot for path written {"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4kl000", "op_name": "tsm1_cache_snapshot", "path": "/Users/ang/.influxdb/data/test/autogen/1", "duration": "3.621ms"}
2018-06-13T08:27:37.026885Z info Cache snapshot (end) {"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4kl000", "op_name": "tsm1_cache_snapshot", "op_event": "end", "op_elapsed": "3.657ms"}
2018-06-13T08:27:37.031269Z info Cache snapshot (start) {"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4ml000", "op_name": "tsm1_cache_snapshot", "op_event": "start"}
2018-06-13T08:27:37.033460Z info Snapshot for path written {"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4ml000", "op_name": "tsm1_cache_snapshot", "path": "/Users/ang/.influxdb/data/_internal/monitor/2", "duration": "2.198ms"}
2018-06-13T08:27:37.033493Z info Cache snapshot (end) {"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4ml000", "op_name": "tsm1_cache_snapshot", "op_event": "end", "op_elapsed": "2.230ms"}
2018-06-13T08:27:45.624373Z info failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0) {"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:52.234943Z info failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0) {"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:53.457241Z info failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0) {"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:54.170693Z info failed to decode meta: proto: meta.DatabaseInfo: illegal tag 0 (wire type 0) {"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:54.841937Z info failed to decode meta: proto: meta.Data: illegal tag 0 (wire type 0) {"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:55.620080Z info Opened file {"log_id": "08f3wpxW000", "engine": "tsm1", "service": "filestore", "path": "/Users/ang/.influxdb/data/test_bak/autogen/3/000000001-000000001.tsm", "id": 0, "duration": "0.158ms"}
2018-06-13T08:27:57.340738Z info failed to decode meta: proto: meta.Data: illegal tag 0 (wire type 0) {"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:58.570292Z info failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0) {"log_id": "08f3wpxW000", "service": "snapshot"}
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 7
- Comments: 48 (3 by maintainers)
The fix from PR #17495 has been merged to master-1.x on 31 Mar. If I rebuild influxd from master-1.x it works.
So the fix was not released neither with 1.8.0 (Jun), nor 1.8.1 (Jul), nor 1.8.2 (Aug) 😐 Why? People can’t restore backups… However I have to say some backups do restore ok, but some don’t due to this issue.
This seems pretty critical. Is this prioritized?
I submitted the patch above. Anyone interested, please review so we can get this merged as quickly as possible.
And my solution is: ` func (s *Service) readRequest(conn net.Conn) (Request, []byte, error) { var r Request d := json.NewDecoder(conn)
} `
Yes, it helps.
Same here. We are moving away from influx which is much behind any other competitor.
my influxdb version is: 1.7.6 I found the problem in file ”influxdata/influxdb/services/snapshotter/service.go“: `func (s *Service) readRequest(conn net.Conn) (Request, []byte, error) { var r Request d := json.NewDecoder(conn)
}` This func is to read the contents of the metadata file from the TCP connection. Because the file is too large, it needs to be sent several times, but it only receives two times at most when it is received. That is to say, when the metadata file is too large, it can not receive the complete content. So it should be improved here to fully receive the data sent by the client.
I have modified and compiled this part of code in my environment. After modification, there will be no such problem as “proto: meta.data: illegal tag 0 (wire type 0)”. The restore command is executed successfully.
This should be fixed by a combination of #21991 (in 1.8.9) and #17495 (in 1.8.10).
I was able to duplicate with https://github.com/influxdata/influxdb/issues/9968#issue-331908486 on some tries with v1.8.0
I was not able to duplicate on latest 1.8 including https://github.com/influxdata/influxdb/pull/22427 (coming in 1.8.10). Ran a script to run the repro 20x. Will close this when the 1.8 backport for #22427 closes.
It seems #17495 will be merged in next release v1.9.0.
https://github.com/influxdata/influxdb/blob/b26a2f7a0e41349938cec592a2abac4d93c9ab1c/CHANGELOG.md #17495: fix(snapshotter): properly read payload
I found that, restore on real machine (laptop) works, but not on any server with virtual disk. I tried with VPS from Digital Ocean, Azure and some Vietnamese provider (vHost, Vinahost, VCCloud, Vietnix). I also tried bare metal server from Scaleway, which comes with network disk. All failed to restore InfluxDB database (portable mode).
Log from client:
Log from server:
We just ran into this issues. It would be nice if the fix #17495 gets backported to 1.8.x
Again the fix is not included in 1.8.3 (Sep).
While influxdata doesn’t care, here is all-in-one Dockerfile to build a new release from master-1.x:
docker build -t yourimage:1.8.x .