cysimdjson: Memory leak

I am observing a memory leak

Part of the code
metadata_parser, gamedata_parser = cysimdjson.JSONParser(), cysimdjson.JSONParser()
with lz4.frame.open(filepath) as file:
  for line in file:
      idx, metadata, gamedata = line.rstrip(b'\n').split(chr(31).encode())
      metadata, gamedata = metadata_parser.parse(metadata), gamedata_parser.parse(gamedata)
      for key, value in gamedata.at_pointer('/0/common').items():
          if key not in test_data['common']:
              test_data['common'][key] = []
          value_type = str(type(value))
          if value_type not in test_data['common'][key]:
              test_data['common'][key].append(value_type)
      for _, player in gamedata.at_pointer('/1').items():
          for key, value in player.items():
              if key not in test_data['player']:
                  test_data['player'][key] = []
              value_type = str(type(value))
              if value_type not in test_data['player'][key]:
                  test_data['player'][key].append(value_type)
      for _, vehicles in gamedata.at_pointer('/0/vehicles').items():
          vehicles = [vehicles] if isinstance(vehicles, dict) else vehicles
          for vehicle in vehicles:
              if isinstance(vehicle, str):
                  continue
              for key, value in vehicle.items():
                  if key not in test_data['vehicle']:
                      test_data['vehicle'][key] = []
                  value_type = str(type(value))
                  if value_type not in test_data['vehicle'][key]:
                      test_data['vehicle'][key].append(value_type)

I am analyzing a large data dump, over 100gb, and memory leaks are preventing the process from completing successfully. The leak is somewhere on the C side of the extension, since profiling the python part didn’t show anything. I followed the first manual and ran valgrind

valgrind log

Gist

I can provide more information, just tell me what and how ))

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 18 (15 by maintainers)

Most upvoted comments

To explain what’s happening here, object and PyObject [*] ultimately represent the same thing, but are handled in different ways. object is ref counted, PyObject is not. Casting to <object> is generally wrong, and needs close attention.

yield <object> string_view_to_python_string(sv) looks like an immediate red flag to me. So lets expand it:

object temp;
v = string_view_to_python_string(sv)
# At this point, v has a ref count of 1
temp = <object> v
# At this point, v has a ref count of 2. Wait, what?
yield temp
# At this point, v has a ref count of 1.

Casting v (a PyObject*) to object created a reference to it, and incremented the ref count. When Python is done with the object returned from keys(), it will decrement it by 1 and…nothing, because its ref count is now still 1.

Ultimately, this is just because the signature of string_view_to_python_string is wrong, since it returns an owned object not a borrowed one.

cdef  PyObject * string_view_to_python_string(string_view & sv)

Is telling Cython that this method will return a borrowed reference, and:

cdef object string_view_to_python_string(string_view & sv)

Is telling Cython that this method will return an “owned” reference.

I will issue a PR unless you get to it first.

The fix to this should be trivial. Try this @lemire:

Index: cysimdjson/cysimdjson.pyx
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/cysimdjson/cysimdjson.pyx b/cysimdjson/cysimdjson.pyx
--- a/cysimdjson/cysimdjson.pyx	(revision 5f544eeb2a9c1cfa0d48a5bb00a742f2a78d3beb)
+++ b/cysimdjson/cysimdjson.pyx	(date 1681156775080)
@@ -1,11 +1,10 @@
 # cython: language_level=3
 
-from libc.stdint cimport int64_t, uint64_t, uint32_t
+from libc.stdint cimport int64_t, uint64_t
 from libcpp cimport bool
 from libcpp.string cimport string
 
 from cpython.bytes cimport PyBytes_AsStringAndSize
-from cpython.ref cimport PyObject
 
 from cython.operator cimport preincrement
 from cython.operator cimport dereference
@@ -121,7 +120,7 @@
 
 	cdef object element_to_py_string(simdjson_element & value) except + simdjson_error_handler
 
-	cdef PyObject * string_view_to_python_string(string_view & sv)
+	cdef object string_view_to_python_string(string_view & sv)
 	cdef string get_active_implementation()
 
 	cdef const char * PyUnicode_AsUTF8AndSize(object, Py_ssize_t *)
@@ -186,11 +185,10 @@
 
 	def keys(JSONObject self):
 		cdef string_view sv
-
 		cdef simdjson_object.iterator it = self.Object.begin()
 		while it != self.Object.end():
 			sv = it.key()
-			yield <object> string_view_to_python_string(sv)
+			yield string_view_to_python_string(sv)
 			preincrement(it)