presto: TPC-DS q04: VeloxRuntimeError: Bad memory usage track state
Only found when running TPC-DS SF-10K:
2023-06-14T20:46:03.577394406Z I0614 20:46:03.577050 8 Announcer.cpp:153] Announcement succeeded: 202
2023-06-14T20:46:08.552160664Z E0614 20:46:08.551824 9 Exceptions.h:68] Line: ../../velox/velox/common/memory/MemoryPool.cpp:420, Function:~MemoryPoolImpl, Expression: (usedReservationBytes_ == 0) && (reservationBytes_ == 0) && (minReservationBytes_ == 0) Bad memory usage track state: Memory Pool[op.39.0.2.TableScan LEAF MMAP track-usage thread-safe]<unlimited capacity used 256B available 1023.75KB reservation [used 256B, reserved 1.00MB, min 0B] counters [allocs 3602, frees 3600, reserves 0, releases 1, collisions 0])>, Source: RUNTIME, ErrorCode: INVALID_STATE
2023-06-14T20:46:08.552219481Z terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
2023-06-14T20:46:08.637251757Z what(): Exception: VeloxRuntimeError
2023-06-14T20:46:08.637293482Z Error Source: RUNTIME
2023-06-14T20:46:08.637310354Z Error Code: INVALID_STATE
2023-06-14T20:46:08.637325993Z Reason: Bad memory usage track state: Memory Pool[op.39.0.2.TableScan LEAF MMAP track-usage thread-safe]<unlimited capacity used 256B available 1023.75KB reservation [used 256B, reserved 1.00MB, min 0B] counters [allocs 3602, frees 3600, reserves 0, releases 1, collisions 0])>
2023-06-14T20:46:08.637342993Z Retriable: False
2023-06-14T20:46:08.637357860Z Expression: (usedReservationBytes_ == 0) && (reservationBytes_ == 0) && (minReservationBytes_ == 0)
2023-06-14T20:46:08.637374096Z Function: ~MemoryPoolImpl
2023-06-14T20:46:08.637388901Z File: ../../velox/velox/common/memory/MemoryPool.cpp
2023-06-14T20:46:08.637404410Z Line: 420
2023-06-14T20:46:08.637419179Z Stack trace:
2023-06-14T20:46:08.637433844Z # 0 _ZN8facebook5velox7process10StackTraceC1Ei
2023-06-14T20:46:08.637448867Z # 1 _ZN8facebook5velox14VeloxExceptionC2EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
2023-06-14T20:46:08.637464126Z # 2 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
2023-06-14T20:46:08.637479821Z # 3 0x0000000000000000
2023-06-14T20:46:08.637494523Z # 4 _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv
2023-06-14T20:46:08.637509479Z # 5 _ZN8facebook5velox4exec4TaskD1Ev
2023-06-14T20:46:08.637524316Z # 6 _ZNSt15_Sp_counted_ptrIPN8facebook5velox4exec4TaskELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv
2023-06-14T20:46:08.637556183Z # 7 _ZNSt23_Sp_counted_ptr_inplaceIN8facebook6presto10PrestoTaskESaIS2_ELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv
2023-06-14T20:46:08.637571225Z # 8 _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv
2023-06-14T20:46:08.637585038Z # 9 _ZN8facebook6presto11TaskManager13cleanOldTasksEv
2023-06-14T20:46:08.637598822Z # 10 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv
2023-06-14T20:46:08.637612627Z # 11 _ZN5folly17FunctionScheduler14runOneFunctionERSt11unique_lockISt5mutexENSt6chrono10time_pointINS5_3_V212steady_clockENS5_8durationIlSt5ratioILl1ELl1000000000EEEEEE
2023-06-14T20:46:08.637627262Z # 12 _ZN5folly17FunctionScheduler3runEv
2023-06-14T20:46:08.637640864Z # 13 _ZZN5folly17FunctionScheduler5startEvENKUlvE_clEv
2023-06-14T20:46:08.637655013Z # 14 _ZSt13__invoke_implIvZN5folly17FunctionScheduler5startEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
2023-06-14T20:46:08.637668964Z # 15 _ZSt8__invokeIZN5folly17FunctionScheduler5startEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
2023-06-14T20:46:08.637684170Z # 16 _ZNSt6thread8_InvokerISt5tupleIJZN5folly17FunctionScheduler5startEvEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE
2023-06-14T20:46:08.637698661Z # 17 _ZNSt6thread8_InvokerISt5tupleIJZN5folly17FunctionScheduler5startEvEUlvE_EEEclEv
2023-06-14T20:46:08.637712734Z # 18 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5folly17FunctionScheduler5startEvEUlvE_EEEEE6_M_runEv
2023-06-14T20:46:08.637727006Z # 19 0x0000000000000000
2023-06-14T20:46:08.637740574Z # 20 start_thread
2023-06-14T20:46:08.637754190Z # 21 clone
2023-06-14T20:46:08.637768023Z
2023-06-14T20:46:08.637792910Z *** Aborted at 1686775568 (Unix time, try 'date -d @1686775568') ***
2023-06-14T20:46:08.637807596Z *** Signal 6 (SIGABRT) (0x7) received by PID 7 (pthread TID 0x7b76bb70e700) (linux TID 9) (maybe from PID 7, UID 0) (code: -6), stack trace: ***
2023-06-14T20:46:08.637821910Z (error retrieving stack trace)
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 20 (9 by maintainers)
I looked at the code and the allocation happens when a new row group is enqueued. A new BufferedInput would be created and data would be loaded inside, and the new BufferedInput would be pushed to the
inputs_
map in ParquetReader::ReaderBase. Theinputs_
map and corresponding BufferedInput and their buffers are freed when ParquetReader::ReaderBase runs out of life. So it seems the report in the above stack is false positive. @xiaoxmeng is https://github.com/facebookincubator/velox/pull/5639 supposed to fix this? If yes then @pramodsatya can you rerun this query with this fix and --velox_memory_pool_debug_enabled=true?Nonetheless, there are several places we can improve “Avoid duplicate reading for small parquet files” (https://github.com/facebookincubator/velox/pull/5942) I created https://github.com/facebookincubator/velox/issues/7335 and https://github.com/facebookincubator/velox/issues/7336. We can also make
ReaderBase::scheduleRowGroups()
a bit simpler. I will work on these issues later this week.This is very likely due to an IO thread that performs a split prefetch outliving the task. When the task is destroyed, it reports that the memory allocated by the split prefetch has not been released. This is a false positive (reporting bug) and there is no real memory leak.
Let’s disable prefetch (pass the flag
--split_preload_per_driver=0
to Prestissimo) for now. I will investigate this further. CC: @xiaoxmeng@pramodsatya has the image and he will try to run and reproduce this.