prisma: Segmentation fault on ARM64 Linux
Bug description
I’m using prisma in arm64 linux with openssl3 get Segmentation fault (core dumped) error.
% npx prisma version
prisma : 4.10.1
@prisma/client : 4.10.1
Current platform : linux-arm64-openssl-3.0.x
Query Engine (Node-API) : libquery-engine aead147aa326ccb985dcfed5b065b4fdabd44b19 (at node_modules/@prisma/engines/libquery_engine-linux-arm64-openssl-3.0.x.so.node)
Migration Engine : migration-engine-cli aead147aa326ccb985dcfed5b065b4fdabd44b19 (at node_modules/@prisma/engines/migration-engine-linux-arm64-openssl-3.0.x)
Format Wasm : @prisma/prisma-fmt-wasm 4.10.1-1.80b351cc7c06d352abe81be19b8a89e9c6b7c110
Default Engines Hash : aead147aa326ccb985dcfed5b065b4fdabd44b19
Studio : 0.481.0
Preview Features : orderByNulls, filteredRelationCount
% uname -a
Linux ubuntu 5.19.0-1019-oracle #22-Ubuntu SMP Thu Mar 9 02:32:24 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
% lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 22.10
Release: 22.10
Codename: kinetic
I can’t create a reproducible case, the problem I encountered when running the https://github.com/civitai/civitai project, here’s the full log for reference:
npm run dev
> model-share@0.1.0 dev
> next dev
ready - started server on 0.0.0.0:3000, url: http://localhost:3000
info - Loaded env from /home/ubuntu/Projects/GitHub/Civitai/civitai/.env
warn - You have enabled experimental features (largePageDataBytes, modularizeImports) in next.config.mjs.
warn - Experimental features are not covered by semver, and may cause unexpected or broken application behavior. Use at your own risk.
event - compiled client and server successfully in 7.8s (4341 modules)
wait - compiling / (client and server)...
event - compiled client and server successfully in 2.9s (4582 modules)
wait - compiling /src/middleware (client and server)...
event - compiled successfully in 72 ms (41 modules)
wait - compiling /api/auth/[...nextauth] (client and server)...
event - compiled successfully in 306 ms (562 modules)
prisma:query SELECT "public"."Tag"."id", "public"."Tag"."name", "public"."Tag"."isCategory" FROM "public"."Tag" LEFT JOIN "public"."TagRank" AS "orderby_1_TagRank" ON ("public"."Tag"."id" = "orderby_1_TagRank"."tagId") WHERE ("public"."Tag"."target" && $1 AND "public"."Tag"."id" NOT IN ($2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26,$27,$28,$29,$30,$31,$32,$33,$34,$35,$36,$37,$38,$39,$40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$50,$51,$52,$53,$54,$55,$56,$57,$58,$59,$60,$61,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71) AND ("public"."Tag"."id") NOT IN (SELECT "t0"."id" FROM "public"."Tag" AS "t0" INNER JOIN "public"."TagsOnTags" AS "j0" ON ("j0"."toTagId") = ("t0"."id") WHERE ("j0"."fromTagId" IN ($72,$73,$74,$75,$76,$77,$78,$79,$80,$81,$82,$83,$84,$85,$86,$87,$88,$89,$90,$91,$92,$93,$94,$95,$96,$97,$98,$99,$100,$101,$102,$103,$104,$105,$106,$107,$108,$109,$110,$111,$112,$113,$114,$115,$116,$117,$118,$119,$120,$121,$122,$123,$124,$125,$126,$127,$128,$129,$130,$131,$132,$133,$134,$135,$136,$137,$138,$139,$140,$141) AND "t0"."id" IS NOT NULL)) AND "public"."Tag"."unlisted" = $142) ORDER BY "orderby_1_TagRank"."modelCountAllTimeRank" ASC LIMIT $143 OFFSET $144
prisma:query SELECT COUNT(*) FROM (SELECT "public"."Tag"."id" FROM "public"."Tag" WHERE ("public"."Tag"."target" && $1 AND "public"."Tag"."id" NOT IN ($2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26,$27,$28,$29,$30,$31,$32,$33,$34,$35,$36,$37,$38,$39,$40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$50,$51,$52,$53,$54,$55,$56,$57,$58,$59,$60,$61,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71) AND ("public"."Tag"."id") NOT IN (SELECT "t0"."id" FROM "public"."Tag" AS "t0" INNER JOIN "public"."TagsOnTags" AS "j0" ON ("j0"."toTagId") = ("t0"."id") WHERE ("j0"."fromTagId" IN ($72,$73,$74,$75,$76,$77,$78,$79,$80,$81,$82,$83,$84,$85,$86,$87,$88,$89,$90,$91,$92,$93,$94,$95,$96,$97,$98,$99,$100,$101,$102,$103,$104,$105,$106,$107,$108,$109,$110,$111,$112,$113,$114,$115,$116,$117,$118,$119,$120,$121,$122,$123,$124,$125,$126,$127,$128,$129,$130,$131,$132,$133,$134,$135,$136,$137,$138,$139,$140,$141) AND "t0"."id" IS NOT NULL)) AND "public"."Tag"."unlisted" = $142) OFFSET $143) AS "sub"
wait - compiling /api/trpc/[trpc] (client and server)...
event - compiled successfully in 822 ms (562 modules)
prisma:query SELECT 1
prisma:query SELECT "public"."Announcement"."id", "public"."Announcement"."title", "public"."Announcement"."content", "public"."Announcement"."color", "public"."Announcement"."emoji" FROM "public"."Announcement" WHERE ("public"."Announcement"."id" NOT IN ($1) AND ("public"."Announcement"."startsAt" <= $2 OR "public"."Announcement"."startsAt" IS NULL) AND ("public"."Announcement"."endsAt" >= $3 OR "public"."Announcement"."endsAt" IS NULL)) ORDER BY "public"."Announcement"."id" DESC LIMIT $4 OFFSET $5
Segmentation fault (core dumped)
and version 4.12.0-integration-rtld-deepbind.3 can’t fix my issue.
% npx prisma version
prisma : 4.12.0-integration-rtld-deepbind.3
@prisma/client : 4.12.0-integration-rtld-deepbind.3
Current platform : linux-arm64-openssl-3.0.x
Query Engine (Node-API) : libquery-engine 3b9f029aeb9a91829e6648c61146b02f3646d1e7 (at node_modules/@prisma/engines/libquery_engine-linux-arm64-openssl-3.0.x.so.node)
Migration Engine : migration-engine-cli 3b9f029aeb9a91829e6648c61146b02f3646d1e7 (at node_modules/@prisma/engines/migration-engine-linux-arm64-openssl-3.0.x)
Format Wasm : @prisma/prisma-fmt-wasm 4.12.0-22.3b9f029aeb9a91829e6648c61146b02f3646d1e7
Default Engines Hash : 3b9f029aeb9a91829e6648c61146b02f3646d1e7
Studio : 0.483.0
Preview Features : orderByNulls, filteredRelationCount
How to reproduce
I can’t create a reproducible case, the problem I encountered when running the https://github.com/civitai/civitai project.
Expected behavior
No response
Prisma information
ref to: https://github.com/civitai/civitai/blob/main/prisma/schema.prisma
ref to: https://github.com/civitai/civitai/blob/main/src/server/db/client.ts
Environment & setup
- OS: Ubuntu 22.10
- Database: PostgreSQL
- Node.js version: v18.15.0
Prisma Version
v4.10.1 and v4.12.0-integration-rtld-deepbind.3
prisma : 4.10.1
@prisma/client : 4.10.1
Current platform : linux-arm64-openssl-3.0.x
Query Engine (Node-API) : libquery-engine aead147aa326ccb985dcfed5b065b4fdabd44b19 (at node_modules/@prisma/engines/libquery_engine-linux-arm64-openssl-3.0.x.so.node)
Migration Engine : migration-engine-cli aead147aa326ccb985dcfed5b065b4fdabd44b19 (at node_modules/@prisma/engines/migration-engine-linux-arm64-openssl-3.0.x)
Format Wasm : @prisma/prisma-fmt-wasm 4.10.1-1.80b351cc7c06d352abe81be19b8a89e9c6b7c110
Default Engines Hash : aead147aa326ccb985dcfed5b065b4fdabd44b19
Studio : 0.481.0
Preview Features : orderByNulls, filteredRelationCount
prisma : 4.12.0-integration-rtld-deepbind.3
@prisma/client : 4.12.0-integration-rtld-deepbind.3
Current platform : linux-arm64-openssl-3.0.x
Query Engine (Node-API) : libquery-engine 3b9f029aeb9a91829e6648c61146b02f3646d1e7 (at node_modules/@prisma/engines/libquery_engine-linux-arm64-openssl-3.0.x.so.node)
Migration Engine : migration-engine-cli 3b9f029aeb9a91829e6648c61146b02f3646d1e7 (at node_modules/@prisma/engines/migration-engine-linux-arm64-openssl-3.0.x)
Format Wasm : @prisma/prisma-fmt-wasm 4.12.0-22.3b9f029aeb9a91829e6648c61146b02f3646d1e7
Default Engines Hash : 3b9f029aeb9a91829e6648c61146b02f3646d1e7
Studio : 0.483.0
Preview Features : orderByNulls, filteredRelationCount
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 32
- Comments: 79 (16 by maintainers)
I’ve identified the problem and currently working on fixing it. The root cause was related to how we build and link OpenSSL in our cross-compilation images, leading to glibc accidentally being statically linked and causing problems. As other people noted in the comments, the newer the OS (and thus system libc version), the higher probability of problems there is, and it also just so happens to be easier to run into problems with binaries compiled for OpenSSL 3 rather than OpenSSL 1.1, but it affects all our ARM64 binaries except (theoretically) those for glibc-based systems with OpenSSL 1.0.
The fix for glibc distros was fairly simple (I’ll do the CI tricks across the repos to release a dev/integration build based on 5.2.0 so you could test and provide feedback). Alpine requires a bit more work, but you can expect news soon.
Adding to my previous comment, this appears to be caused by a double-free bug that occurs in openssl, in the
quaintpackage’s postgres connector, in the prisma query engine. Not sure which part is to blame:I compiled the prisma/prisma-engine’s branch 4.16.x (commit 4bc8b6e1b66cb932731fb1bdbbc550d1e010de81) with the AddressSanitizer enabled, which caught the bug.
There is of course the possibility that removing RTLD_DEEPBIND as part of my compiling&running steps produced some issue with openssl which caused this as a different segfault, I don’t really know what that flag does. Or this could be a false positive, but rust book says:
I compiled and ran the custom query-engine .so.node library with the sanitizer like this: https://gist.github.com/cxcorp/4a2b4184b9a51b54dd49ed692e6f05bd
I reproduced it like this:
prisma.xxx.findUnique()invocations to a Postgres 15 instancePrismaClientKnownRequestError: \nInvalidprisma.xxx.findUnique()invocation:\n\n\nServer has closed the connection.SUMMARY: AddressSanitizer: double-free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:52 in __interceptor_freeThis is it in a nutshell, but really I used this project: https://github.com/cxcorp/saituri-9000, opened the webpage, killed connections from pg, then reloaded the webpage twice.
Here is the dump caught by AddressSanitizer:
prisma -v:I was finally able to reproduce the issue (I got an abort rather than a segfault but it’s likely caused by the same problem, given the
free(): unaligned chunk detected in tcache 2message that I got), so I should be able to continue the investigation further now.I ran into this and after a bunch of debugging, can only replicate using Node 18. Downgrading to 16, or upgrading to 19 (can’t test 20 right now for other reasons), causes the issue not to happen anymore.
Hey everyone! Could you test and confirm if updating
prismaand@prisma/clientto version5.3.0-24.integration-arm-openssl-main-92c9273a161ecfa87c1e2b27abd570c922b184a6fixes the issues for you?Important: install it as an exact dependency, i.e., you must not have a caret (
^) in your package.json before the package version. If you update it from CLI rather than editing the package.json manually, use the-Eflag:FWIW, this is branched off
mainand not 5.2.0 as I suggested above — doing the latter as an integration release turned out to be a bit messy.@aqrln
apple m1/node:18 - wooooork
So far I have not yet seen a reliable workaround that works with a current version of Node (v20) and an up-to-date (= without known vulnerabilities) version of OpenSSL. Therefore I would kindly ask whether someone at Prisma (perhaps @aqrln) is currently working on this bug? I also remember that @cxcorp did a great job pinpointing the crashes and identifying dependencies upstream to be a possible root cause for this – has there been any progress on this investigation in the meantime?
I compiled the library with ThreadSanitizer this time. From what I’ve understood, ThreadSanitizer may produce more false-positives, especially as you cannot compile openssl statically and include -fsanitize=thread, but the prisma-query-engine-napi library at least should be instrumented in this stack trace.
Just before the crash, ThreadSanitizer spits out reports about 12 data races of two tokio task executor threads trying to use
openssl::ssl::connector::SslConnector::builder()at the same time, followed by a double free and a segfault after yet another data race. This would correspond with what AddressSanitizer reported. MaybeX509_STORE_set_default_paths_exor the other openssl initialization functions are not thread safe? Or maybe it’s linking to the wrong openssl lock callbacks, if provided? Or could just be a false positive.Here is a sample of what it reported (a lot of info), please find the rest of the traces in chronological order before crash here: https://gist.github.com/cxcorp/7515158f2b8bb764bd8bc80ccfdd98d3
Also with @Jolg42’s help I’ve now released
5.3.0-integration-arm-openssl-5-2-0.1which is equivalent to 5.2.0 but is built with the fix — feel free to test that instead if you’d prefer (again, make sure you install it as an exact version).I am experiencing this issue as well on Hetzner arm64 Ampere.
@aqrln
arm64v8/node:18 - also works
Hey guys 👋 . We were seeing the same thing with Prisma version 5.1.1 (with Postgres version 15.3) when running the
node:latest-alpineimage withdocker compose. We are using M1 Macbooks (arm64). Our application usesnodemonfor development and we were seeing random, intermittent crashes like so:Running the application without
nodemongave us _a little bit _ of more info…(exit code 139)After looking around google, it seems like exit code 139 is related to a segmentation fault.
Thinking it may be an alpine issue, I decided to just the regular
node:latestimage (without the-alpinesuffix). That resulted in the following error code:These errors would happen for us every so often, whenever the our frontend would make queries to the backend which called the Prisma function
findUniqueOrThrow(). Because it’s not consistent, it’s extremely difficult to reproduce, but I’ve found that waiting 10 minutes between requests somehow consistently reproduced the crash.Our solution was indeed to match up the openssl
binaryTargetwith what’s provided in the image. So for example, thenode:18.7-alpineimage has the following openssl package:So we have the following in
generator clientwhen using thenode:18.7-alpineimage:Here is my
prisma --version, in case anyone is interested:TL;DR - switching to the
node:18.7-alpineimage and modifying ourbinaryTargetsto the above fixed our crashes.Hi! I originally posted in https://github.com/prisma/prisma/issues/10649#issuecomment-1656941029 but was asked to move it here, so here goes:
I’ve spent many hours trying to figure out why Prisma is segfaulting in the query engine, have tried changing the engine format to “engine” but without any promising result unfortunately.
I tried playing around with different versions and found that node 18.16 and above has this problem, from what I can tell node version
18.15is the latest version that runs without encountering any segfault. Worth noting is that the faulty versions did still work for smaller queries but they broke when GraphQL concatenated larger queries which I had a test case for.So basically this was the workaround I came up with:
Bumped into this when trying to use Prisma with Postgres 15 on a Raspberry Pi 4 64-bit (arm64) running in musl-based Docker container (
node:18-alpine). It’s using the prebuiltlibquery_engine-linux-musl-arm64-openssl-3.0.x.so.nodelibrary.The issue occurs for me whenever the postgres connection times out. I can reproduce it by terminating the connection from postgres’ end with
pg_terminate_backend, e.g.select pg_terminate_backend(pid) from pg_stat_activity where pid <> pg_backend_pid();.The stack trace points to tokio-runtime, so might be something upstream. Or maybe the stack trace is just being exceptionally unhelpful and it just happens to blow up inside an async section.
lldb/llnode dumps:
Unhelpfully empty stack trace:
Threads:
When using a glibc-based container (
node:18), the error is:malloc(): unaligned tcache chunk detected, again in tokio-runtime:Turn SSL mode off by using
sslmode=disable*This is perfectly fine if your RDS instance is not publicly accessibleIt’s really a critical error and needs to be investigated. For now, I have 2 solutions: To migrate the small amount of data currently stored in Postgres to another MySQL instance. (Not a straightforward solution as it will require tweaking the codebase as MySQL doesn’t support scalar lists.) The second is to deploy the whole application again in a new x86-64 EC2 instance as the web application is still in the early stages of production.
@edelbalso yes, you need to update to at 5.3.0 or higher, preferably latest 5.7.1.
Since you’re on a fairly old version, you might unfortunately have to do some work to migrate from Prisma 4.x to Prisma 5.x. Please take a look at the upgrade guide to see if there’s anything that would affect your application, and how to handle that: https://www.prisma.io/docs/orm/more/upgrade-guides/upgrading-versions/upgrading-to-prisma-5
@cxcorp thanks again for your investigation and for steps for reproduction, I finally had some time to take a closer look last week. I couldn’t reproduce the segfault and the double-free with asan (I only got a bunch of leaks in napi-rs glue code and in node/v8 reported) with locally built engines on ARM64 Linux (NixOS). This hints that the issue might be either with the OpenSSL version (I was building locally with 3.0.9) or with cross-compilation. I’ll try reproducing with our stock cross-compiled engines in another VM or in a docker container as I can’t easily use them on NixOS and see what happens.
UPD: actually I just realised I might have built it in debug and not in release profile, so I’ll have to double check that as well
Setting
engineType = "binary"reduced the number of failures for me (and a single segfault did no longer kill the entire test run), but some are still present. Especially tests working with lots of database rows and sorting/filtering are the ones that still fail. SimplefindFirst()orfindUnique()appear to be unaffected.I had the same problem and tried a lot. The only thing that helped was the following configuration in the schema for client:
easy way to reproduce is to use a domain that can’t be resolved, like
postgres://user:pass@local.domain:5432/postgreswithnode:18.16-alpineimage and thebinaryTargets = ["native"]on ARM64. Works fine in any x64 environment@aqrln Just tested
5.3.0-integration-arm-openssl-5-2-0.1on Alpine (imagenode:20.5.1-alpine) – works without problems in a setup which consistently broke with Prisma 5.2.0. Thank you so much for your effort!@aqrln
Amazon Linux 2023 - it works
It may also work on alpine?
Some context I find suspiciously correlating:
comes with openssl 1:3.0.8-1.amzn2023.0.3
comes with openssl 1:1.0.2k-24.amzn2.0.7
comes with openssl 3.0.9-1
comes with openssl 1.1.1q-r0
comes with openssl 1:1.0.2k-24.amzn2.0.7
While looking through similar aarch64-related Prisma issues I found that some may be related to https://github.com/openssl/openssl/issues/21541 – can this be the case here as well? A fix for that (https://github.com/openssl/openssl/commit/e7bb35e0c3dbd7ba5e6e4885d893191a3bf70356) is not included in any
opensslrelease yet, but maybe someone can compile and test it locally?@xlmnxp @ryanccn @marvinruder
I have the solution. At least for me. IF you use the version with the binary type and also get a connection error with the “findMany”, it is because there are problems with the openssl version of your linux distribution. In my case I use a Ubuntu Verson 22. If I explicitly go to a lower version (not this one which is given for the respective operating system… but a lower one), it works fine. Which version you have to use can be found here. https://www.prisma.io/docs/reference/api-reference/prisma-schema-reference#binarytargets-options
I don’t think it makes sense not to use ssl.
I also added my config for the ARM64 server:
generator client { provider = “prisma-client-js” binaryTargets = [“native”, “debian-openssl-1.1.x”] engineType = “binary” }
Interestingly that did not work for me, I still get Segfaults even without using SSL.
I can reproduce it as well on ARM64(Ampere) Hetzner machine.
Can also reproduce on AWS EC2 aarch64 instance, currently using emulated x86_64, which works but is very slow (to the point of timing out).
This seems to be a very serious bug.
@aqrln any progress on this?
Is there any progress with this issue?
I also see this behaviour in my project, but it’s very strange: Prisma version 5.2.0
AWS EC2 with Amazon Linux 2023 (arm64) - it crushes. AWS EC2 with Amazon Linux 2 (arm64) - it works.
AWS ECS Fargate with “arm64v8/node:18” image - it crushes. AWS ECS Fargate with “arm64v8/node:18.7-alpine” image - it works.
AWS Lambda with “lambda/nodejs:18-arm64” image - it works.
It seems like it’s a problem with new OS versions…
sorry @marvinruder, I have not continued my own investigation as I grew tired of aarch64 issues and bought an x86_64 machine
It’s not related to OpenSSL because it happen even when you disable TLS/SSL
Even with node v19.9 and sslmode disabled, the issue still occurs for me on Linux ARM64. Also on docker, with the images node:16 through 19 on
alpineandslim, have not tested other yet@cxcorp thanks so much for the investigation!
OpenSSL is at the moment statically linked in our cross-compiled ARM binaries (and looking at the diff in your dockerfile, that didn’t change), so no worries about it, it’s completely irrelevant here and removing the flag shouldn’t have affected anything 👍 And even if it was dynamically linked like in their x86_64 counterparts, the combination of Node.js 18 and
linux-arm64-openssl-3.0.xwould still be fine without this flag.What this flag does is it ensures that the dynamic linker picks up OpenSSL symbols from
libssl.soand not from the Node.js binary when loading the addon (otherwise things break when system OpenSSL and the one vendored with Node.js are not ABI-compatible).for me, when
engineType = "binary"I get error message tell me there no findFirst or findUnique functionsand sometimes:
I’m also having issues, because I use Docker I ended up changing the platform to
amd64and adding the engine to the schema, and it worked:Not a real fix but at least a workaround for local development.
Can a prisma dev add segfault-handler to the codebase so we can know where in the code this segfault is coming from?
Experiencing the same, on Hetzner Cloud CAX21 instance (aarch64-based Ampere Altra CPU). I am getting several different errors every time. Some examples:
malloc(): unaligned fastbin chunk detectedfree(): double free detected in tcache 2malloc_consolidate(): invalid chunk sizemalloc(): unaligned tcache chunk detectedLet me know if I need to provide more information on this.
Facing the same issue in an ARM64-based EC2 instance connected to PostgreSQL managed by RDS.
I’m also encountering this via Umami
I believe that this SIGSEGV is also happening in umami with the configuration below. Not sure if its useful, but the error is reproducible with the Docker images, containers and CPU architecture.
Reproducible using this Dockerfile and docker-compose.yml
Logs from the Docker container
Same issue on arm64, prisma 4.11, nodejs 16 on alpine. I am running keystonejs 6 project. Segmentation only occurred when the SQL result are large enough for once.
Running GraphQL:
types { apps {id} }-> crashtypes(where: {"id": {"lt": 6}}) { apps {id} }-> crashtypes(where: {"id": {"lt": 3}}) { apps {id} }-> no crashtypes(where: {"id": {"lt": 4}}) { apps {id} }-> no crash And if I gradually lift the where limit, there’re still no crash. Maybe due to cache or memory allocation?