influxdb: Backup/restore fails with a lot of databases

Bug report

System info: InfluxDB v1.5.3, installed from brew on Mac OS X 10.12.6

Steps to reproduce:

Start a clean instance of InfluxDB

rm -r .influxdb
influxd

Create some dummy databases

curl -X POST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE test"
curl -X POST http://localhost:8086/write?db=test --data-binary "a i=1"
curl -X POST http://localhost:8086/query --data-urlencode "q=$(perl dummy_data.pl 1 500)"

where dummy_data.pl is

use 5.010;
use strict;
use warnings;
for my $i ($ARGV[0]..$ARGV[1]) {
    my $db = "test$i";
    say "CREATE DATABASE $db WITH DURATION 260w REPLICATION 1 SHARD DURATION 12w NAME rp2;";
    say "CREATE RETENTION POLICY rp1 ON $db DURATION 100d REPLICATION 1 SHARD DURATION 2w;";
    say "CREATE CONTINUOUS QUERY cq1 ON $db RESAMPLE EVERY 5m FOR 10m BEGIN SELECT LAST(a) AS b, c INTO $db.rp2.m FROM $db.rp1.m GROUP BY time(5m) END;";
    say "CREATE CONTINUOUS QUERY cq2 ON $db RESAMPLE EVERY 5m FOR 10m BEGIN SELECT MAX(a) AS b, c INTO $db.rp2.m FROM $db.rp1.m GROUP BY time(5m) END;";
}

Backup everything

rm -r ./backup
influxd backup -portable ./backup

Try to restore the database test

influxd restore -portable -db test -newdb test_bak backup/

Expected behavior: The database test is restored as test_bak

Actual behavior: Restoring the database fails (most of the time…) with the message error updating meta: DB metadata not changed. database may already exist, even if test_bak does not exist. I wasn’t able to understand to resulting log line, where RetentionPolicyInfo isn’t always the same:

failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0)

Additional info: This behaviour seems to depend on the amount of metadata. If I add only 100 dummy databases instead of 500 (curl -X POST http://localhost:8086/query --data-urlencode "q=$(perl dummy_data.pl 1 100)"), everything works well.

Me trying to restore a few times, where the 6th attempt worked:

➜  ~ rm -r ./backup
influxd backup -portable ./backup
2018/06/13 10:27:37 backing up metastore to backup/meta.00
2018/06/13 10:27:37 No database, retention policy or shard ID given. Full meta store backed up.
2018/06/13 10:27:37 Backing up all databases in portable format
2018/06/13 10:27:37 backing up db=
2018/06/13 10:27:37 backing up db=test rp=autogen shard=1 to backup/test.autogen.00001.00 since 0001-01-01T00:00:00Z
2018/06/13 10:27:37 backing up db=_internal rp=monitor shard=2 to backup/_internal.monitor.00002.00 since 0001-01-01T00:00:00Z
2018/06/13 10:27:37 backup complete:
2018/06/13 10:27:37 	backup/20180613T082737Z.meta
2018/06/13 10:27:37 	backup/20180613T082737Z.s1.tar.gz
2018/06/13 10:27:37 	backup/20180613T082737Z.s2.tar.gz
2018/06/13 10:27:37 	backup/20180613T082737Z.manifest
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:45 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:52 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:53 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:54 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:54 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:55 Restoring shard 1 live from backup 20180613T082737Z.s1.tar.gz
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:57 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist
➜  ~ influxd restore -portable -db test -newdb test_bak backup/
2018/06/13 10:27:58 error updating meta: DB metadata not changed. database may already exist
restore: DB metadata not changed. database may already exist

The corresponding logs:

2018-06-13T08:27:37.023239Z	info	Cache snapshot (start)	{"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4kl000", "op_name": "tsm1_cache_snapshot", "op_event": "start"}
2018-06-13T08:27:37.026848Z	info	Snapshot for path written	{"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4kl000", "op_name": "tsm1_cache_snapshot", "path": "/Users/ang/.influxdb/data/test/autogen/1", "duration": "3.621ms"}
2018-06-13T08:27:37.026885Z	info	Cache snapshot (end)	{"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4kl000", "op_name": "tsm1_cache_snapshot", "op_event": "end", "op_elapsed": "3.657ms"}
2018-06-13T08:27:37.031269Z	info	Cache snapshot (start)	{"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4ml000", "op_name": "tsm1_cache_snapshot", "op_event": "start"}
2018-06-13T08:27:37.033460Z	info	Snapshot for path written	{"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4ml000", "op_name": "tsm1_cache_snapshot", "path": "/Users/ang/.influxdb/data/_internal/monitor/2", "duration": "2.198ms"}
2018-06-13T08:27:37.033493Z	info	Cache snapshot (end)	{"log_id": "08f3wpxW000", "engine": "tsm1", "trace_id": "08f3y4ml000", "op_name": "tsm1_cache_snapshot", "op_event": "end", "op_elapsed": "2.230ms"}
2018-06-13T08:27:45.624373Z	info	failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:52.234943Z	info	failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:53.457241Z	info	failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:54.170693Z	info	failed to decode meta: proto: meta.DatabaseInfo: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:54.841937Z	info	failed to decode meta: proto: meta.Data: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:55.620080Z	info	Opened file	{"log_id": "08f3wpxW000", "engine": "tsm1", "service": "filestore", "path": "/Users/ang/.influxdb/data/test_bak/autogen/3/000000001-000000001.tsm", "id": 0, "duration": "0.158ms"}
2018-06-13T08:27:57.340738Z	info	failed to decode meta: proto: meta.Data: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}
2018-06-13T08:27:58.570292Z	info	failed to decode meta: proto: meta.RetentionPolicyInfo: illegal tag 0 (wire type 0)	{"log_id": "08f3wpxW000", "service": "snapshot"}

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 7
Comments: 48 (3 by maintainers)

Commits related to this issue

Optimize content: when executing the restore command, when the meta file is too large, the influxdb server cannot fully receive the meta file sent by the client. issue link: https://github.com/influx... — committed to code-mechine/influxdb by deleted user 4 years ago

Most upvoted comments

The fix from PR #17495 has been merged to master-1.x on 31 Mar. If I rebuild influxd from master-1.x it works.

So the fix was not released neither with 1.8.0 (Jun), nor 1.8.1 (Jul), nor 1.8.2 (Aug) 😐 Why? People can’t restore backups… However I have to say some backups do restore ok, but some don’t due to this issue.

+15

roman-vynar on Sep 8, 2020

This seems pretty critical. Is this prioritized?

cantino on Nov 19, 2019

I submitted the patch above. Anyone interested, please review so we can get this merged as quickly as possible.

ayang64 on Mar 31, 2020

my influxdb version is: 1.7.6 I found the problem in file ”influxdata/influxdb/services/snapshotter/service.go“: `func (s *Service) readRequest(conn net.Conn) (Request, []byte, error) { var r Request d := json.NewDecoder(conn)
if err := d.Decode(&r); err != nil {
	return r, nil, err
}

bits := make([]byte, r.UploadSize+1)

if r.UploadSize > 0 {

	remainder := d.Buffered()

	n, err := remainder.Read(bits)
	if err != nil && err != io.EOF {
		return r, bits, err
	}

	// it is a bit random but sometimes the Json decoder will consume all the bytes and sometimes
	// it will leave a few behind.
	if err != io.EOF && n < int(r.UploadSize+1) {
		_, err = conn.Read(bits[n:])
	}

	if err != nil && err != io.EOF {
		return r, bits, err
	}
	// the JSON encoder on the client side seems to write an extra byte, so trim that off the front.
	return r, bits[1:], nil
}

return r, bits, nil
}` This func is to read the contents of the metadata file from the TCP connection. Because the file is too large, it needs to be sent several times, but it only receives two times at most when it is received. That is to say, when the metadata file is too large, it can not receive the complete content. So it should be improved here to fully receive the data sent by the client.

I have modified and compiled this part of code in my environment. After modification, there will be no such problem as “proto: meta.data: illegal tag 0 (wire type 0)”. The restore command is executed successfully.

And my solution is： ` func (s *Service) readRequest(conn net.Conn) (Request, []byte, error) { var r Request d := json.NewDecoder(conn)

if err := d.Decode(&r); err != nil {
    return r, nil, err
}
var buffer bytes.Buffer
if r.UploadSize > 0 {
    bits := make([]byte, r.UploadSize+1)
    remainder := d.Buffered()
    n, err := remainder.Read(bits)
    if err != nil && err != io.EOF {
        return r, bits, err
    }
    fmt.Println("remainder num: ", n)
    buffer.Write(bits[0:n])
    // Set the timeout according to the actual situation
    _ = conn.SetReadDeadline(time.Now().Add(20 * time.Second))
    for {
        //bs := make([]byte, r.UploadSize-int64(n+rn))
        nu, err := conn.Read(bits)
        if err != nil && err != io.EOF {
            return r, buffer.Bytes(), err
        }
        if err != io.EOF && n < int(r.UploadSize+1) {
            buffer.Write(bits[0:nu])
            n += nu
            if n >= int(r.UploadSize) {
                // upStream receiving completed
                break
            }
            continue
        }
    }
    // the JSON encoder on the client side seems to write an extra byte, so trim that off the front.
    return r, buffer.Bytes()[1:], nil
}
return r, buffer.Bytes(), nil

} `

code-mechine on Mar 13, 2020

didn’t this commit solve the problem?

Yes, it helps.

It seems #17495 will be merged in next release v1.9.0.

roman-vynar on Sep 10, 2021

For us not being able to restore databases reliably was the final straw to decide to move away from influx. Lots of valuable business data could have been lost because of this

Same here. We are moving away from influx which is much behind any other competitor.

roman-vynar on Mar 1, 2021

my influxdb version is: 1.7.6 I found the problem in file ”influxdata/influxdb/services/snapshotter/service.go“: `func (s *Service) readRequest(conn net.Conn) (Request, []byte, error) { var r Request d := json.NewDecoder(conn)

if err := d.Decode(&r); err != nil {
	return r, nil, err
}

bits := make([]byte, r.UploadSize+1)

if r.UploadSize > 0 {

	remainder := d.Buffered()

	n, err := remainder.Read(bits)
	if err != nil && err != io.EOF {
		return r, bits, err
	}

	// it is a bit random but sometimes the Json decoder will consume all the bytes and sometimes
	// it will leave a few behind.
	if err != io.EOF && n < int(r.UploadSize+1) {
		_, err = conn.Read(bits[n:])
	}

	if err != nil && err != io.EOF {
		return r, bits, err
	}
	// the JSON encoder on the client side seems to write an extra byte, so trim that off the front.
	return r, bits[1:], nil
}

return r, bits, nil

}` This func is to read the contents of the metadata file from the TCP connection. Because the file is too large, it needs to be sent several times, but it only receives two times at most when it is received. That is to say, when the metadata file is too large, it can not receive the complete content. So it should be improved here to fully receive the data sent by the client.

I have modified and compiled this part of code in my environment. After modification, there will be no such problem as “proto: meta.data: illegal tag 0 (wire type 0)”. The restore command is executed successfully.

code-mechine on Mar 13, 2020

This should be fixed by a combination of #21991 (in 1.8.9) and #17495 (in 1.8.10).

I was able to duplicate with https://github.com/influxdata/influxdb/issues/9968#issue-331908486 on some tries with v1.8.0

I was not able to duplicate on latest 1.8 including https://github.com/influxdata/influxdb/pull/22427 (coming in 1.8.10). Ran a script to run the repro 20x. Will close this when the 1.8 backport for #22427 closes.

lesam on Sep 9, 2021

It seems #17495 will be merged in next release v1.9.0.

https://github.com/influxdata/influxdb/blob/b26a2f7a0e41349938cec592a2abac4d93c9ab1c/CHANGELOG.md #17495: fix(snapshotter): properly read payload

miettal on May 10, 2021

I found that, restore on real machine (laptop) works, but not on any server with virtual disk. I tried with VPS from Digital Ocean, Azure and some Vietnamese provider (vHost, Vinahost, VCCloud, Vietnix). I also tried bare metal server from Scaleway, which comes with network disk. All failed to restore InfluxDB database (portable mode).

Log from client:

$ influxd restore -portable -db daothanh /tmp/db/ts_daothanh
2019/03/08 08:12:44 error updating meta: updating metadata on influxd service failed: err=read tcp 127.0.0.1:54174->127.0.0.1:8088: read: connection reset by peer, n=16
restore: updating metadata on influxd service failed: err=read tcp 127.0.0.1:54174->127.0.0.1:8088: read: connection reset by peer, n=16

Log from server:

$ journalctl -u influxdb

...
Mar 08 08:12:44 db-ams influxd[6119]: ts=2019-03-08T08:12:44.044335Z lvl=info msg="failed to decode meta: proto: meta.ShardGroupInfo: illegal ta
...

hongquan on Mar 8, 2019

We just ran into this issues. It would be nice if the fix #17495 gets backported to 1.8.x

palmamartin on Mar 1, 2021

Again the fix is not included in 1.8.3 (Sep).

While influxdata doesn’t care, here is all-in-one Dockerfile to build a new release from master-1.x:

FROM golang:1.15.5-alpine3.12 as builder

RUN set -ex && \
    apk update && \
    apk add ca-certificates git bash gcc musl-dev && \
    git config --global http.https://gopkg.in.followRedirects true && \
    git clone --depth 1 --branch master-1.x https://github.com/influxdata/influxdb.git /opt/ && \
    cd /opt && \
    go build ./cmd/influxd && \
    chmod +x influxd

RUN git clone --depth 1 --branch master https://github.com/influxdata/influxdata-docker.git /opt2


FROM alpine:3.12

RUN echo 'hosts: files dns' >> /etc/nsswitch.conf
RUN set -ex && \
    apk add --no-cache tzdata bash ca-certificates && \
    update-ca-certificates

COPY --from=builder /opt/influxd /usr/bin/influxd
COPY --from=builder /opt2/influxdb/1.8/alpine/influxdb.conf /etc/influxdb/influxdb.conf

EXPOSE 8086

VOLUME /var/lib/influxdb

COPY --from=builder /opt2/influxdb/1.8/alpine/entrypoint.sh /entrypoint.sh
COPY --from=builder /opt2/influxdb/1.8/alpine/init-influxdb.sh /init-influxdb.sh
ENTRYPOINT ["/entrypoint.sh"]
CMD ["influxd"]

docker build -t yourimage:1.8.x .

roman-vynar on Dec 18, 2020