mlx: Unit test Floating Point Exception
Building from source and running on an M1 I am getting a floating point exception from the Negatie slice and ascending bounds portion of the test_array.TestArray.test_slice_negative_step unit test. All tests pass when commenting the following lines out:
# Negatie slice and ascending bounds
b_np = a_np[0:20:-3]
b_mx = a_mx[0:20:-3]
self.assertTrue(np.array_equal(b_np, b_mx))
Excerpted exception details:
Exception Type: EXC_ARITHMETIC (SIGFPE)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Termination Reason: Namespace SIGNAL, Code 8 Floating point exception: 8
Thread 0:: Dispatch queue: com.apple.main-thread
0 ??? 0x7ff89ac8e9a8 ???
1 libsystem_kernel.dylib 0x7ff80b0430fe __psynch_cvwait + 10
2 libsystem_pthread.dylib 0x7ff80b07f758 _pthread_cond_wait + 1242
3 libc++.1.dylib 0x7ff80afbc1e2 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18
4 libc++.1.dylib 0x7ff80afbca51 std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock<std::__1::mutex>&) + 45
5 libc++.1.dylib 0x7ff80afbcaba std::__1::__assoc_sub_state::wait() + 46
6 libmlx.dylib 0x10afa5648 mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, bool) + 4520
Thread 11 Crashed:
0 AGXMetal13_3 0x7ffa243eec5b AGX::ComputeService<AGX::G13::Encoders, AGX::G13::Classes, AGX::G13::ObjClasses, AGX::G13::EncoderComputeServiceClasses>::executeKernelWithThreadsPerGridImpl(eAGXDataBufferPools, MTLSize, MTLSize) + 363
1 AGXMetal13_3 0x7ffa243e90d5 -[AGXG13GFamilyComputeContext dispatchThreads:threadsPerThreadgroup:] + 197
2 libmlx.dylib 0x10b7876ae mlx::core::Arange::eval_gpu(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&) + 1742
3 libmlx.dylib 0x10b7833cf std::__1::__function::__func<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2>, void ()>::operator()() + 143
4 libmlx.dylib 0x10afa2706 mlx::core::scheduler::StreamThread::thread_fn() + 518
5 libmlx.dylib 0x10afa2953 void* std::__1::__thread_proxy[abi:v160006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*>>(void*) + 67
6 libsystem_pthread.dylib 0x7ff80b07f1d3 _pthread_start + 125
7 libsystem_pthread.dylib 0x7ff80b07abd3 thread_start + 15
Thread 11 crashed with X86 Thread State (64-bit):
rax: 0x000000007fffffff rbx: 0x0000000000000000 rcx: 0x0000000000000010 rdx: 0x0000000000000000
rdi: 0x0000000000000000 rsi: 0x0000000000000001 rbp: 0x00000003028c6c60 rsp: 0x00000003028c6bf0
r8: 0x0000000000000002 r9: 0x0000600003da15f8 r10: 0x00007ffa4dc030d8 r11: 0x00007ff814d09672
r12: 0x00007fc888350000 r13: 0x0000000000000000 r14: 0x00000003028c6c70 r15: 0x0000000000000000
rip: <unavailable> rfl: 0x0000000000000203
tmp0: 0x0000000000000000 tmp1: 0x000000007fffffff tmp2: 0x0000000000000000
About this issue
- Original URL
- State: closed
- Created 6 months ago
- Comments: 28 (21 by maintainers)
I just restarted my whole system and the cmake command start working without forcing the env variable in CMakeList file, I read a lot of stuff online my best guest is that there was some random cache over the
CMAKE_APPLE_SILICON_PROCESSORvariable that set that as blank value (mycmake --system-informationdisplay as blank value before restarting) and then cascade the default value to x86_64 for the dependent variables, I suspect that some of my reinstallation attempts override the variable.Thanks for the time dropped here @awni
Debugging just the
test metal reducetest case it is reliably hitting a divide-by-zero (not floating-point as I saw in the others) exception in reduce.cpp due tothread_group_size(as well asin_sizeandmod_in_size) being 0.This may be a separate issue than the floating-point exception.