PyRFC: Returning large table is slow and consumes lots of memory

We’re calling an RFC that returns a table with 500,000 rows (120 fields) (KE24 report) and this is consuming>8Gb RAM on the server and fails because of this.

In an attempt to get around this I’ve implemented some callbacks so that rows can be processed directly to file. Whilst this does seem to work, it is still taking ~12 minutes to extract the 500,000 rows which isn’t particularly fast. The batch_size in the code below doesn’t seem to have any effect really, set to batches of 10,000 it takes ~8secs to build and ~3secs to process to a csv file.

My new code is below to potentially spark some ideas of how to get a large table of data out quickly. The bottleneck seems to be the processing of the data out of the resultset. Perhaps processing the fields for every row of the table is not the most efficient using WrapStructure?

Any ideas how to speed this up and keep the amount of used memory to a minimum would be greatly appreciated.

Thanks, Pat

    def call_and_save_table(self, func_name, table_name, batch_size, fnOnStartTable, fnOnEndTable, fnOnGetTableRow, **params):
        """ Invokes a remote-enabled function module via RFC.

        :param func_name (str): Name of the function module that will be invoked.
        :param table_name (str): Name of the table to save
        :param batch_size (int): Number of records to pass into fnOnGetTableRow
        :param fnOnStartTable (func): Callback for starting the table
        :param fnOnEndTable (func): Callback for completing the table
        :param fnOnGetTableRow (func): Callback for processing rows
        :param params (kwargs): Parameter of the function module. All non optional
              IMPORT, CHANGING, and TABLE parameters must be provided.

        :return: Boolean

        :raises: :exc:`~pyrfc.RFCError` or a subclass
                 thereof if the RFC call fails.
        """
        cdef RFC_RC rc
        cdef RFC_ERROR_INFO errorInfo
        cdef unsigned paramCount

        cdef RFC_TABLE_HANDLE table_handle
        cdef RFC_TYPE_DESC_HANDLE table_type_desc
        cdef RFC_PARAMETER_DESC paramDesc
        cdef RFC_FIELD_DESC fieldDesc
        cdef unsigned i, rowCount, fieldCount

        funcName = fillString(func_name)
        if not self.alive:
            self._open()
        cdef RFC_FUNCTION_DESC_HANDLE funcDesc = RfcGetFunctionDesc(self._handle, funcName, &errorInfo)
        free(funcName)
        if not funcDesc:
            self._error(&errorInfo)
        cdef RFC_FUNCTION_HANDLE funcCont = RfcCreateFunction(funcDesc, &errorInfo)
        if not funcCont:
            self._error(&errorInfo)
        try: # now we have a function module
            for name, value in params.iteritems():
                fillFunctionParameter(funcDesc, funcCont, name, value)
            with nogil:
                rc = RfcInvoke(self._handle, funcCont, &errorInfo)
            if rc != RFC_OK:
                self._error(&errorInfo)

            tableName = fillString(table_name)
            try:
                rc = RfcGetParameterDescByName(funcDesc, tableName, &paramDesc, &errorInfo)
                if rc != RFC_OK:
                    self._error(&errorInfo)
                table_type_desc = paramDesc.typeDescHandle
                rc = RfcGetTable(funcCont, tableName, &table_handle, &errorInfo)
                if rc != RFC_OK:
                    self._error(&errorInfo)
                writeTable(table_type_desc, table_handle, self._bconfig, batch_size, fnOnStartTable, fnOnEndTable, fnOnGetTableRow)
                return True
            finally:
                free(tableName)
        finally:
            RfcDestroyFunction(funcCont, NULL)

cdef writeTable(RFC_TYPE_DESC_HANDLE typeDesc, RFC_TABLE_HANDLE table_handle, config, batch_size, fnOnStartTable, fnOnEndTable, fnOnGetTableRow):
    cdef RFC_RC rc
    cdef RFC_ERROR_INFO errorInfo
    cdef unsigned i, fieldCount, rowCount
    # # For debugging in tables (cf. class TableCursor)
    # tc = TableCursor()
    # tc.typeDesc = typeDesc
    # tc.container = table_handle
    # return tc

    ## get the field names array and sort it, this can be passed to a DictWriter so the field order is predictable
    header_dict = {}
    headers_array = []
    field_desc_array = []
    cdef RFC_FIELD_DESC fieldDesc
    rc = RfcGetFieldCount(typeDesc, &fieldCount, &errorInfo)
    if rc != RFC_OK:
        raise wrapError(&errorInfo)

    for i in range(fieldCount):
        rc = RfcGetFieldDescByIndex(typeDesc, i, &fieldDesc, &errorInfo)
        if rc != RFC_OK:
            raise wrapError(&errorInfo)
        header_dict[wrapString(fieldDesc.name)] = wrapString(fieldDesc.name)
        field_desc_array[i] = fieldDesc

    headers_array = sorted(header_dict.keys())

    rc = RfcGetRowCount(table_handle, &rowCount, &errorInfo)
    if rc != RFC_OK:
        raise wrapError(&errorInfo)

    fnOnStartTable(headers_array, rowCount)

    batch = []
    for i in xrange(rowCount):
        rc = RfcMoveTo(table_handle, i, &errorInfo)
        if rc != RFC_OK:
            raise wrapError(&errorInfo)
        batch.append(wrapStructure(typeDesc, table_handle, config))
        if ((i+1) % batch_size == 0) or (i == rowCount-1):
            fnOnGetTableRow(batch)
            batch = []

    fnOnEndTable()

    return []

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 2
  • Comments: 15 (10 by maintainers)

Most upvoted comments

I ran the RFC daily to retrieve a full period (500,000 rows). The timing remained just short of 12mins.

Is it possible for pyRFC to retrieve the field list at the start of a table and pass that info to process each row rather than retrieving the info for every row in the table? That may speed up processing a little.

As for the memory consumption, I’m not sure why pyRFC uses so much to build the results. Consuming > 8Gb for 500,000 records seems way too much. It’s 305Mb in a CSV file.

Just for reference, the memory issue does not occur when I use the change detailed in my first post to stream the values directly to a file. The memory issue would then seem to be related to building the result dicts in Python which my method avoids. Thanks for the continued investigation!

I changed the title to “Chunking big data”, to sync with related stackoverflow question: https://stackoverflow.com/questions/49691298/sap-rfc-chunking-of-big-data/51862805#51862805

I will also check possible optimisation, to delete each table row from C heap (SAP NW RFC SDK), after “wrapping” it in Python. If works, it would reduce memory consumption for tables, by not keeping C and Python full table copies in parallel.