tensorflow: Memory leak in Tensorflow Graph on CPU
I have a Face Detector that I’m trying to use for inference in Golang via official Tensorflow bindings. However, I faced a stepwise memory leak that causes the killing of an application due to OOM.
Screenshot with the leak:

Heap samples from pprof does not show anything interesting in Go code. It looks like the leak on C++ backend side:
go tool pprof http://localhost:8200/debug/pprof/heap
Fetching profile over HTTP from http://localhost:8200/debug/pprof/heap
Saved profile in /pprof.facedetector.alloc_objects.alloc_space.inuse_objects.inuse_space.073.pb.gz
File: facedetector
Build ID: c4dba901e690718468bdd4fa7d1a631daca9e65a
Type: inuse_space
Time: Jan 22, 2020 at 11:44pm (MSK)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)
(pprof) top20
Showing nodes accounting for 1861.64kB, 100% of 1861.64kB total
flat flat% sum% cum cum%
930.82kB 50.00% 50.00% 930.82kB 50.00% bytes.makeSlice
930.82kB 50.00% 100% 930.82kB 50.00% main.(*FaceDetector).FindFaces
0 0% 100% 930.82kB 50.00% bytes.(*Buffer).Grow
0 0% 100% 930.82kB 50.00% bytes.(*Buffer).grow
0 0% 100% 930.82kB 50.00% io/ioutil.ReadFile
0 0% 100% 930.82kB 50.00% io/ioutil.readAll
0 0% 100% 1861.64kB 100% main.main
0 0% 100% 1861.64kB 100% runtime.main
Code to reproduce the issue
This code just reading an image from a file system in a for-loop and feed it to the model.
package main
import (
"io/ioutil"
"log"
tf "github.com/tensorflow/tensorflow/tensorflow/go"
"net/http"
_ "net/http/pprof"
)
type FaceDetector struct {
session *tf.Session
graph *tf.Graph
inputOp tf.Output
bboxesOp tf.Output
scoresOp tf.Output
}
func NewFaceDetector(frozenGraphPath string) (*FaceDetector, error) {
fd := &FaceDetector{}
model, err := ioutil.ReadFile(frozenGraphPath)
if err != nil {
return nil, err
}
fd.graph = tf.NewGraph()
if err := fd.graph.Import(model, ""); err != nil {
return nil, err
}
fd.session, err = tf.NewSession(fd.graph, nil)
if err != nil {
return nil, err
}
fd.inputOp = fd.graph.Operation("input_image").Output(0)
fd.bboxesOp = fd.graph.Operation("bboxes").Output(0)
fd.scoresOp = fd.graph.Operation("scores_1/GatherV2").Output(0)
return fd, nil
}
func (fd *FaceDetector) Close() error {
if err := fd.session.Close(); err != nil {
return err
}
return nil
}
func (fd *FaceDetector) FindFaces(image []byte) ([]float32, [][]int32, error) {
imageTensor, err := tf.NewTensor(string(image))
if err != nil {
return nil, nil, err
}
output, err := fd.session.Run(
map[tf.Output]*tf.Tensor{
fd.inputOp: imageTensor,
},
[]tf.Output{
fd.bboxesOp,
fd.scoresOp,
},
nil)
if err != nil {
return nil, nil, err
}
bboxes := output[0].Value().([][]int32)
scores := output[1].Value().([]float32)
return scores, bboxes, nil
}
func main() {
log.SetFlags(log.LstdFlags | log.Lshortfile | log.Lmicroseconds)
go func() {
log.Println(http.ListenAndServe("0.0.0.0:8200", nil))
}()
fd, err := NewFaceDetector("OptimizedGraph.pb")
if err != nil {
panic(err)
}
defer fd.Close()
for {
img, err := ioutil.ReadFile("photo.jpg")
if err != nil {
log.Println("Fail to read image:", err)
}
scores, boxes, err := fd.FindFaces(img)
if err != nil {
log.Println("Fail to infer:", err)
}
log.Println(scores, boxes)
}
}
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes.
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu-based Docker images
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Not tested
- TensorFlow installed from (source or binary): Binary/Pre-compiled (https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-cpu-linux-x86_64-1.15.0.tar.gz)
- TensorFlow version (use command below): 1.15.0
- Python version: -
- Bazel version (if compiling from source): -
- GCC/Compiler version (if compiling from source): -
- CUDA/cuDNN version: -
- GPU model and memory: -
- Golang version: 1.13
photo.jpg: https://user-images.githubusercontent.com/2982775/72982648-96393b80-3df0-11ea-807f-c4979b4c3af2.jpg TF Graph: graph.zip
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 10
- Comments: 16 (1 by maintainers)
Hello everyone !
Since we switch our prediction platform to Tensorflow we also experienced a memory leak.
After finding that out, we used go tool pprof in order to found where leaks where happening but it was useless. Impossible to find the leak using pprof. In fact, pprof’s inuse_space sum output was completely different from top’s. Everything seemed to be fine on pprof but top was showing a constant rise in memory usage.
We tried to figured out if our code could be the issue and it is not. No leak at all without any call to TF and still nothing of a memory leak shown by pprof.
We tried to abstract as much as possible and finished with this piece of code.
We are basically loading a model 300 times, closing them right after and then calling the Garbage Collector at the end of the process to force the memory pointed by useless pointers to be freed. However, whether it be with TF 1.15.2 or 2.0.1 we are experiencing something weird. The models are, as weird as it is, being freed some time to times. I let you run this program several times and see what top says on the process’s memory.
We are expecting it to decrease as soon as the GC is called. On both 1.15.2 & 2.0.1, sometimes it decreases, sometimes it does not and we think it is linked with some useless pointers still used by TF somewhere, preventing the GC to free them. Note that with our recents tests we can point that this issue seems to happens more often with TF 2.0.1 than with 1.15.2.
In our case and with our model, the program start at around 100-200Mo, reach 1.5Go at the end of the 300 models loading and then when the GC is called, the inuse memory shown by top drops back to 150Mo or sometimes doesn’t and stays at 1.5Go.
All of this lead us to believe that the issue is from the C++ code OR from the CGo bindings of Tensorflow. We couldn’t figured out anything else. An investigation would be appreciated 🚀
Execute the following code, a memory leak will occur
I have the same issue, but I use tf.LoadSavedModel, I tried assign model to nil, defer session.close(), golang garbage collection, none of them works, go pprof or cmd top both show the same result, loadsavedmodel did not release memory.
in my case, i had to choose one of hundreds model file and load it to inference, I know there’s a mechanism stop my session.close or model release due to I’m sending request frequently
is there anything i can do? I choose golang and try prevent from python memory leak nightmare, now I put myself into another one, sight
Hi There,
We are checking to see if you still need help on this, as you are using an older version of tensorflow which is officially considered end of life . We recommend that you upgrade to the latest 2.x version and let us know if the issue still persists in newer versions. Please open a new issue for any help you need against 2.x, and we will get you the right help.
This issue will be closed automatically 7 days from now. If you still need help with this issue, please provide us with more information.
It seems that tensorflow has no limit on cpu utilization by default,memory usage will increase until OOM if you don’t add limit. Try add
tf_config = tf.ConfigProto(inter_op_parallelism_threads=1,intra_op_parallelism_threads=1)) sess=tf.Session(config=tf_config)and it works for me.I have narrow it down to NewTensor also creating memory leak. It seems the golang garbage collector does not clear the whole unused Tensor.