tensorflow: Memory leak in Tensorflow Graph on CPU

I have a Face Detector that I’m trying to use for inference in Golang via official Tensorflow bindings. However, I faced a stepwise memory leak that causes the killing of an application due to OOM.

Screenshot with the leak:

Heap samples from pprof does not show anything interesting in Go code. It looks like the leak on C++ backend side:

go tool pprof http://localhost:8200/debug/pprof/heap
Fetching profile over HTTP from http://localhost:8200/debug/pprof/heap
Saved profile in /pprof.facedetector.alloc_objects.alloc_space.inuse_objects.inuse_space.073.pb.gz
File: facedetector
Build ID: c4dba901e690718468bdd4fa7d1a631daca9e65a
Type: inuse_space
Time: Jan 22, 2020 at 11:44pm (MSK)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)
(pprof) top20
Showing nodes accounting for 1861.64kB, 100% of 1861.64kB total
      flat  flat%   sum%        cum   cum%
  930.82kB 50.00% 50.00%   930.82kB 50.00%  bytes.makeSlice
  930.82kB 50.00%   100%   930.82kB 50.00%  main.(*FaceDetector).FindFaces
         0     0%   100%   930.82kB 50.00%  bytes.(*Buffer).Grow
         0     0%   100%   930.82kB 50.00%  bytes.(*Buffer).grow
         0     0%   100%   930.82kB 50.00%  io/ioutil.ReadFile
         0     0%   100%   930.82kB 50.00%  io/ioutil.readAll
         0     0%   100%  1861.64kB   100%  main.main
         0     0%   100%  1861.64kB   100%  runtime.main

Code to reproduce the issue

This code just reading an image from a file system in a for-loop and feed it to the model.

package main

import (
	"io/ioutil"
	"log"
	tf "github.com/tensorflow/tensorflow/tensorflow/go"
	"net/http"
	_ "net/http/pprof"
)

type FaceDetector struct {
	session         *tf.Session
	graph           *tf.Graph

	inputOp  tf.Output
	bboxesOp tf.Output
	scoresOp tf.Output
}

func NewFaceDetector(frozenGraphPath string) (*FaceDetector, error) {
	fd := &FaceDetector{}
	model, err := ioutil.ReadFile(frozenGraphPath)
	if err != nil {
		return nil, err
	}

	fd.graph = tf.NewGraph()
	if err := fd.graph.Import(model, ""); err != nil {
		return nil, err
	}

	fd.session, err = tf.NewSession(fd.graph, nil)
	if err != nil {
		return nil, err
	}

	fd.inputOp = fd.graph.Operation("input_image").Output(0)
	fd.bboxesOp = fd.graph.Operation("bboxes").Output(0)
	fd.scoresOp = fd.graph.Operation("scores_1/GatherV2").Output(0)

	return fd, nil
}

func (fd *FaceDetector) Close() error {
	if err := fd.session.Close(); err != nil {
		return err
	}
	return nil
}

func (fd *FaceDetector) FindFaces(image []byte) ([]float32, [][]int32, error) {
	imageTensor, err := tf.NewTensor(string(image))
	if err != nil {
		return nil, nil, err
	}

	output, err := fd.session.Run(
		map[tf.Output]*tf.Tensor{
			fd.inputOp: imageTensor,
		},
		[]tf.Output{
			fd.bboxesOp,
			fd.scoresOp,
		},
		nil)

	if err != nil {
		return nil, nil, err
	}

	bboxes := output[0].Value().([][]int32)
	scores := output[1].Value().([]float32)

	return scores, bboxes, nil
}

func main() {
	log.SetFlags(log.LstdFlags | log.Lshortfile | log.Lmicroseconds)

	go func() {
		log.Println(http.ListenAndServe("0.0.0.0:8200", nil))
	}()

	fd, err := NewFaceDetector("OptimizedGraph.pb")

	if err != nil {
		panic(err)
	}

	defer fd.Close()

	for {
		img, err := ioutil.ReadFile("photo.jpg")

		if err != nil {
			log.Println("Fail to read image:", err)
		}

		scores, boxes, err := fd.FindFaces(img)

		if err != nil {
			log.Println("Fail to infer:", err)
		}

		log.Println(scores, boxes)
	}
}

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu-based Docker images
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Not tested
TensorFlow installed from (source or binary): Binary/Pre-compiled (https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-cpu-linux-x86_64-1.15.0.tar.gz)
TensorFlow version (use command below): 1.15.0
Python version: -
Bazel version (if compiling from source): -
GCC/Compiler version (if compiling from source): -
CUDA/cuDNN version: -
GPU model and memory: -
Golang version: 1.13

photo.jpg: https://user-images.githubusercontent.com/2982775/72982648-96393b80-3df0-11ea-807f-c4979b4c3af2.jpg TF Graph: graph.zip

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 10
Comments: 16 (1 by maintainers)

Most upvoted comments

Hello everyone !

Since we switch our prediction platform to Tensorflow we also experienced a memory leak.

After finding that out, we used go tool pprof in order to found where leaks where happening but it was useless. Impossible to find the leak using pprof. In fact, pprof’s inuse_space sum output was completely different from top’s. Everything seemed to be fine on pprof but top was showing a constant rise in memory usage.

We tried to figured out if our code could be the issue and it is not. No leak at all without any call to TF and still nothing of a memory leak shown by pprof.

We tried to abstract as much as possible and finished with this piece of code.

package main

import (
	"fmt"
	tf "github.com/tensorflow/tensorflow/tensorflow/go"
	"runtime"
	"time"
)

func printStats(mem runtime.MemStats) {

	runtime.ReadMemStats(&mem)

	fmt.Println("mem.Alloc:", mem.Alloc)

	fmt.Println("mem.TotalAlloc:", mem.TotalAlloc)

	fmt.Println("mem.HeapAlloc:", mem.HeapAlloc)

	fmt.Println("mem.NumGC:", mem.NumGC)

	fmt.Println("-----")

}

func loadModel(modelPath string, tags []string) {
	model, err := tf.LoadSavedModel(modelPath, tags, nil)
	if err != nil {
		fmt.Println("error OPENING model")
		return
	}
	err = model.Session.Close()
	if err != nil {
		fmt.Println("error CLOSING model")
		return
	}
}

func main() {
	var mem runtime.MemStats

	printStats(mem)
	tags := []string{"serve"}
	for i := 0; i < 300; i++ {
		loadModel("./path_to_model/", tags)
		fmt.Println("loading model: => ", i)
		printStats(mem)
	}
	fmt.Println("done")
	printStats(mem)
	fmt.Println("call to garbage collector")
	runtime.GC()
	printStats(mem)
	fmt.Println("sleeping 10 minutes")
	time.Sleep(time.Minute * 10)
	fmt.Println("complete done")
}

We are basically loading a model 300 times, closing them right after and then calling the Garbage Collector at the end of the process to force the memory pointed by useless pointers to be freed. However, whether it be with TF 1.15.2 or 2.0.1 we are experiencing something weird. The models are, as weird as it is, being freed some time to times. I let you run this program several times and see what top says on the process’s memory.

We are expecting it to decrease as soon as the GC is called. On both 1.15.2 & 2.0.1, sometimes it decreases, sometimes it does not and we think it is linked with some useless pointers still used by TF somewhere, preventing the GC to free them. Note that with our recents tests we can point that this issue seems to happens more often with TF 2.0.1 than with 1.15.2.

In our case and with our model, the program start at around 100-200Mo, reach 1.5Go at the end of the 300 models loading and then when the GC is called, the inuse memory shown by top drops back to 150Mo or sometimes doesn’t and stays at 1.5Go.

All of this lead us to believe that the issue is from the C++ code OR from the CGo bindings of Tensorflow. We couldn’t figured out anything else. An investigation would be appreciated 🚀

+12

sshmaxime on Jun 17, 2020

Execute the following code, a memory leak will occur

for i := 0; i < 100; i++ {
	tf.LoadSavedModel("/model/fm_model_v1", []string{"serve"}, nil)
	time.sleep(10 * time.Second)
}

integritybravery on Mar 10, 2020

I have the same issue, but I use tf.LoadSavedModel, I tried assign model to nil, defer session.close(), golang garbage collection, none of them works, go pprof or cmd top both show the same result, loadsavedmodel did not release memory.

in my case, i had to choose one of hundreds model file and load it to inference, I know there’s a mechanism stop my session.close or model release due to I’m sending request frequently

is there anything i can do? I choose golang and try prevent from python memory leak nightmare, now I put myself into another one, sight

radi9 on Feb 11, 2020

Hi There,

We are checking to see if you still need help on this, as you are using an older version of tensorflow which is officially considered end of life . We recommend that you upgrade to the latest 2.x version and let us know if the issue still persists in newer versions. Please open a new issue for any help you need against 2.x, and we will get you the right help.

This issue will be closed automatically 7 days from now. If you still need help with this issue, please provide us with more information.

tensorflowbutler on Feb 1, 2021

It seems that tensorflow has no limit on cpu utilization by default，memory usage will increase until OOM if you don’t add limit. Try add tf_config = tf.ConfigProto(inter_op_parallelism_threads=1,intra_op_parallelism_threads=1)) sess=tf.Session(config=tf_config) and it works for me.

jerklee on Aug 27, 2020

I have narrow it down to NewTensor also creating memory leak. It seems the golang garbage collector does not clear the whole unused Tensor.

sam-cts on Jun 17, 2020