google-api-python-client: Possible memory leak when using Google Sheets API
There seems to be a memory leak when using the google-api-client with GSheets.
Environment:
$ python --version
Python 3.6.4
$ pip show google-api-python-client
Name: google-api-python-client
Version: 1.7.3
Here’s a simple reproducer (without a .client_secret.json
):
#!/usr/bin/env python3
import httplib2
import os
from apiclient import discovery
from memory_profiler import profile
from oauth2client import client, tools
from oauth2client.file import Storage
from time import sleep
SCOPES = "https://www.googleapis.com/auth/spreadsheets.readonly"
# See https://cloud.google.com/docs/authentication/getting-started
CLIENT_SECRET_FILE = ".client_secret.json"
APPLICATION_NAME = "ClientDebug"
DISCOVERY_URL = "https://sheets.googleapis.com/$discovery/rest?version=v4"
def get_credentials():
home_dir = os.path.expanduser("~")
credential_dir = os.path.join(home_dir, ".credentials")
flags = None
if not os.path.exists(credential_dir):
os.makedirs(credential_dir)
credential_path = os.path.join(credential_dir,
"sheets.googleapis.com-clientdebug.json")
store = Storage(credential_path)
credentials = store.get()
if not credentials or credentials.invalid:
flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
flow.user_agent = APPLICATION_NAME
credentials = tools.run_flow(flow, store, flags)
return credentials
@profile(precision=4)
def get_responses(creds):
"""Fetch spreadsheet data."""
sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
http = creds.authorize(httplib2.Http())
service = discovery.build("sheets", "v4", http=http,
discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
result = service.spreadsheets().values().get(
spreadsheetId=sheet_id, range="A1:O").execute()
values = result.get("values", [])
print("Got {} rows".format(len(values)))
if __name__ == "__main__":
creds = get_credentials()
for i in range(0, 50):
get_responses(creds)
sleep(2)
For measurements I used memory_profiler
module with following results:
First and second iteration
Got 760 rows
Filename: ./main.py
Line # Mem usage Increment Line Contents
================================================
35 26.5195 MiB 26.5195 MiB @profile(precision=4)
36 def get_responses(creds):
37 """Fetch spreadsheet data."""
38 26.5195 MiB 0.0000 MiB sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
39
40 26.5195 MiB 0.0000 MiB http = creds.authorize(httplib2.Http())
41 26.5195 MiB 0.0000 MiB service = discovery.build("sheets", "v4", http=http,
42 29.2891 MiB 2.7695 MiB discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
43 49.5742 MiB 20.2852 MiB result = service.spreadsheets().values().get(
44 49.5742 MiB 0.0000 MiB spreadsheetId=sheet_id, range="A1:O").execute()
45 49.5742 MiB 0.0000 MiB values = result.get("values", [])
46
47 49.5742 MiB 0.0000 MiB print("Got {} rows".format(len(values)))
Got 760 rows
Filename: ./main.py
Line # Mem usage Increment Line Contents
================================================
35 49.5742 MiB 49.5742 MiB @profile(precision=4)
36 def get_responses(creds):
37 """Fetch spreadsheet data."""
38 49.5742 MiB 0.0000 MiB sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
39
40 49.5742 MiB 0.0000 MiB http = creds.authorize(httplib2.Http())
41 49.5742 MiB 0.0000 MiB service = discovery.build("sheets", "v4", http=http,
42 49.5742 MiB 0.0000 MiB discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
43 67.9922 MiB 18.4180 MiB result = service.spreadsheets().values().get(
44 67.9922 MiB 0.0000 MiB spreadsheetId=sheet_id, range="A1:O").execute()
45 67.9922 MiB 0.0000 MiB values = result.get("values", [])
46
47 67.9922 MiB 0.0000 MiB print("Got {} rows".format(len(values)))
Last iteration
Got 760 rows
Filename: ./main.py
Line # Mem usage Increment Line Contents
================================================
35 229.6055 MiB 229.6055 MiB @profile(precision=4)
36 def get_responses(creds):
37 """Fetch spreadsheet data."""
38 229.6055 MiB 0.0000 MiB sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
39
40 229.6055 MiB 0.0000 MiB http = creds.authorize(httplib2.Http())
41 229.6055 MiB 0.0000 MiB service = discovery.build("sheets", "v4", http=http,
42 229.6055 MiB 0.0000 MiB discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
43 229.6055 MiB 0.0000 MiB result = service.spreadsheets().values().get(
44 229.6055 MiB 0.0000 MiB spreadsheetId=sheet_id, range="A1:O").execute()
45 229.6055 MiB 0.0000 MiB values = result.get("values", [])
46
47 229.6055 MiB 0.0000 MiB print("Got {} rows".format(len(values)))
There’s clearly a memory leak, as the reproducer fetches the same data over and over again, yet the memory consumption keeps rising. Full log can be found here.
As a temporary workaround for one of my long-running applications I use an explicit garbage collector call, which mitigates this issue, at least for now:
...
import gc
...
result = service.spreadsheets().values().get(
spreadsheetId=sheet_id, range="A1:O").execute()
values = result.get("values", [])
gc.collect()
...
I went a little deeper, and the main culprit seems to be in the createMethod
function when creating dynamic method batchUpdate
:
Method 'batchUpdate, approx. __doc__ size: 2886834
<class 'function'>
Filename: /home/fsumsal/venv/googleapiclient/lib64/python3.6/site-packages/googleapiclient/discovery.py
Line # Mem usage Increment Line Contents
================================================
1064 48.7 MiB 48.7 MiB @profile
1065 def _add_basic_methods(self, resourceDesc, rootDesc, schema):
1066 # If this is the root Resource, add a new_batch_http_request() method.
1067 48.7 MiB 0.0 MiB if resourceDesc == rootDesc:
...
1086
1087 # Add basic methods to Resource
1088 48.7 MiB 0.0 MiB if 'methods' in resourceDesc:
1089 66.8 MiB 0.0 MiB for methodName, methodDesc in six.iteritems(resourceDesc['methods']):
1090 56.0 MiB 0.0 MiB fixedMethodName, method = createMethod(
1091 66.8 MiB 18.1 MiB methodName, methodDesc, rootDesc, schema)
1092 66.8 MiB 0.0 MiB print(type(method))
1093 66.8 MiB 0.0 MiB self._set_dynamic_attr(fixedMethodName,
1094 66.8 MiB 0.0 MiB method.__get__(self, self.__class__))
1095 # Add in _media methods. The functionality of the attached method will
1096 # change when it sees that the method name ends in _media.
1097 66.8 MiB 0.0 MiB if methodDesc.get('supportsMediaDownload', False):
1098 fixedMethodName, method = createMethod(
1099 methodName + '_media', methodDesc, rootDesc, schema)
1100 self._set_dynamic_attr(fixedMethodName,
1101 method.__get__(self, self.__class__))
(This method has a huge docstring.)
Nevertheless, there is probably a reference loop somewhere, as the gc.collect()
call manages to collect all those unreachable objects.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 24 (7 by maintainers)
I have a cron job, on google app engine, that reads data in from a google sheet. I am noticing the same memory leak (or maybe a different memory leak?). I tried the recommend work arounds: 1. creating the “sheets” object only once, and use gc.collect(). Neither worked in my case. As a a test, I changed the few lines of code that read data from a google sheet to read data from a database table, and the memory leak went away.
Circling back to this, this recommendation from https://github.com/googleapis/google-api-python-client/issues/535#issuecomment-404994715 is the best way to avoid this.
Creating multiple service objects results in (1) potential memory problems and (2) takes extra time for refreshing credentials. If you’re creating a service object inside a loop, or a function that’s called more than once, move it outside the loop/function so it can be reused.
My solution in the end was just too use “pure” http requests
I never fixed it… in the short term, I used a high mem appengine instance so that would take longer to hit the memory threshold and then, as a long term solution, I switched to airtable instead of google sheets.
@mcdonc @theacodes If I get a vote I’d like to see the second option, adding a .close() method. I’ve spent the past week or so tracking down this same memory error and found my way to this page. It happens I specifically looked for close() methods in the Resource objects because I knew something somewhere wasn’t being released.
Adding a .close() method seems cleaner than my having to call gc.collect(). Either way I have to do something to cleanup resources and calling .close() is analogous to what we do already for files and other things.
In any case, this issue should be mentioned in the documentation and sample code, please!
@mrc0mmand yes, in your particular case, creating “sheets” only once would make it leak so little that you won’t need gc.collect()
Now that we’ve figured why we perceive there is a leak, a) is it a problem? and b), if so, what can we do about it?
Is it a problem?
Yes and no.
Theoretically, no, it’s not a problem. Reference cycles in programs are normal, and Python’s garbage collector will eventually trash the objects involved in cycles. Every time a Resource object is created (by calling methods on other Resource objects), we get some cycles between that Resource and its dynamic methods, and, in theory, this is fine.
Practically, yes, it’s a problem. Repeated Resource creatoin causes the process RSS to bloat, and, on Linux at least, the memory consumed by these references is not given back to the OS due to memory fragmentation, even after the cycles are broken.
What can we do about it?
I’ve put it some work on a branch (https://github.com/mcdonc/google-api-python-client/tree/fix-createmethod-memleak-535) trying to make the symptoms slightly better.
Try #1 on that branch, which is now a few commits back, and isn’t represented by the current state of the branch, was caching the functions that become methods on Resource objects, only creating one function per input instead of one per call. This is not a reasonable fix, however, because refs involved in cycles still grow; every time a Resource is instantiated, it binds itself to some number of methods, and even if the functions representing these methods are not repeatedly created, the act of binding cached methods to each still creates cycles.
Try #2, which represents the current state of the branch, dynamically creates and caches one Resource class per set of inputs, instead of just caching the result of dynamic method creation. This disuses the descriptor protocol to bind dynamic methods to instances, so the only circrefs are those as if each resource type had its own class in sys.modules[‘googleapiclient.discovery’]. The number of circrefs is dramatically reduced, and RSS growth is bounded after the first call of the replication script (unlike master, where it grows without bound on each call, although every so often gc kicks in and brings it down a little). According to gc.set_debug(gc.DEBUG_LEAK) under py 3.6, he length of gc.garbage is 2214 after 40 iterations of the reproducer script for-loop, instead of master’s gargantuan 45218. And I believe we could bring that down more by fixing an unrelated different leak. However, the resulting instances cannot be pickled, which is, I believe, part of their API.
So I think we have these options:
Cause pickling to no longer be part of the API and use the code on the https://github.com/mcdonc/google-api-python-client/tree/fix-createmethod-memleak-535 branch. If pickling was absolutely necessary, we could create (and instruct people to use) a custom unpickler that generated the dynamic classes for us, by subclassing Unpickler ala https://docs.python.org/3/library/pickle.html#pickle.Unpickler
Cause Resource instances to refer to any subresource they create and expose a “close()” method on resources, which would clear its dynamic methods and the dynamic methods of any subresource recursively, which would break the refcycle. However, this method would need to be explicitly called by library users; there isn’t a natural place for us to call it.
Do nothing. We punt and tell people to a) create as few resource as possible (don’t do more work than is necessary) and to b) call gc.collect() after sections of their code that have a side effect of creating lots of resources.
This is the “leak”, a very simple circref. It cannot be avoided without a redesign.
It is caused by implementing “dynamic methods” using the descriptor protocol, ala https://github.com/google/google-api-python-client/blob/master/googleapiclient/discovery.py#L1088
If there is a way around this circref, I don’t know it.
Running debugging commentary for future reference:
Not a function of the docstring being large; if I change createMethod to not set the docstring onto the dynamically generated method, it still leaks.
Causing the generated method to be a noop doesn’t fix the leak.
Only creating “sheets” once prevents the leak too (if it’s created more than once, the leak remains):
Code:
Code:
objgraph
library fails to show growing references after the first iteration. Further frustration with this is that importing objgraph causes the leak to go away, causing a Heisenbug. This turns out to be because objgraph imports graphviz, and something in graphviz… does… something. But even if graphviz doesnt get imported by objgraph, calling show_growth() also causes a Heisenbug; calling show_growth() prevents the leak.Code:
Code:
service.spreadsheets().values()
theValueError('Cannot call function; instance was purged')
error is raised.Code:
And then eg.:
I’ve been able to replicate this with a variant of your script, thanks!
I think it may be possible to cache the result of createMethod such that we don’t create a new closure for every call, which would make the memory leak less dramatic (probably not noticeable), and faster. But before I do that I’ll try to identify the actual cycle.