mlx: Unit test Floating Point Exception

Building from source and running on an M1 I am getting a floating point exception from the Negatie slice and ascending bounds portion of the test_array.TestArray.test_slice_negative_step unit test. All tests pass when commenting the following lines out:

# Negatie slice and ascending bounds
b_np = a_np[0:20:-3]
b_mx = a_mx[0:20:-3]
self.assertTrue(np.array_equal(b_np, b_mx))

Excerpted exception details:

Exception Type:        EXC_ARITHMETIC (SIGFPE)
Exception Codes:       0x0000000000000001, 0x0000000000000000
Termination Reason:    Namespace SIGNAL, Code 8 Floating point exception: 8


Thread 0::  Dispatch queue: com.apple.main-thread
0   ???                           	    0x7ff89ac8e9a8 ???
1   libsystem_kernel.dylib        	    0x7ff80b0430fe __psynch_cvwait + 10
2   libsystem_pthread.dylib       	    0x7ff80b07f758 _pthread_cond_wait + 1242
3   libc++.1.dylib                	    0x7ff80afbc1e2 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18
4   libc++.1.dylib                	    0x7ff80afbca51 std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock<std::__1::mutex>&) + 45
5   libc++.1.dylib                	    0x7ff80afbcaba std::__1::__assoc_sub_state::wait() + 46
6   libmlx.dylib                  	       0x10afa5648 mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, bool) + 4520


Thread 11 Crashed:
0   AGXMetal13_3                  	    0x7ffa243eec5b AGX::ComputeService<AGX::G13::Encoders, AGX::G13::Classes, AGX::G13::ObjClasses, AGX::G13::EncoderComputeServiceClasses>::executeKernelWithThreadsPerGridImpl(eAGXDataBufferPools, MTLSize, MTLSize) + 363
1   AGXMetal13_3                  	    0x7ffa243e90d5 -[AGXG13GFamilyComputeContext dispatchThreads:threadsPerThreadgroup:] + 197
2   libmlx.dylib                  	       0x10b7876ae mlx::core::Arange::eval_gpu(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&) + 1742
3   libmlx.dylib                  	       0x10b7833cf std::__1::__function::__func<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2>, void ()>::operator()() + 143
4   libmlx.dylib                  	       0x10afa2706 mlx::core::scheduler::StreamThread::thread_fn() + 518
5   libmlx.dylib                  	       0x10afa2953 void* std::__1::__thread_proxy[abi:v160006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*>>(void*) + 67
6   libsystem_pthread.dylib       	    0x7ff80b07f1d3 _pthread_start + 125
7   libsystem_pthread.dylib       	    0x7ff80b07abd3 thread_start + 15

Thread 11 crashed with X86 Thread State (64-bit):
  rax: 0x000000007fffffff  rbx: 0x0000000000000000  rcx: 0x0000000000000010  rdx: 0x0000000000000000
  rdi: 0x0000000000000000  rsi: 0x0000000000000001  rbp: 0x00000003028c6c60  rsp: 0x00000003028c6bf0
   r8: 0x0000000000000002   r9: 0x0000600003da15f8  r10: 0x00007ffa4dc030d8  r11: 0x00007ff814d09672
  r12: 0x00007fc888350000  r13: 0x0000000000000000  r14: 0x00000003028c6c70  r15: 0x0000000000000000
  rip: <unavailable>       rfl: 0x0000000000000203
 tmp0: 0x0000000000000000 tmp1: 0x000000007fffffff tmp2: 0x0000000000000000

About this issue

  • Original URL
  • State: closed
  • Created 6 months ago
  • Comments: 28 (21 by maintainers)

Most upvoted comments

I just restarted my whole system and the cmake command start working without forcing the env variable in CMakeList file, I read a lot of stuff online my best guest is that there was some random cache over the CMAKE_APPLE_SILICON_PROCESSOR variable that set that as blank value (my cmake --system-information display as blank value before restarting) and then cascade the default value to x86_64 for the dependent variables, I suspect that some of my reinstallation attempts override the variable.

Thanks for the time dropped here @awni

Debugging just the test metal reduce test case it is reliably hitting a divide-by-zero (not floating-point as I saw in the others) exception in reduce.cpp due to thread_group_size (as well as in_size and mod_in_size) being 0.

This may be a separate issue than the floating-point exception. image