netcdf-c: NetCDF 4.9.0: segmentation fault after repeatedly opening a NetCDF 4 file, reading a vector and closing the file

The julia user @sjdaines reported this segmentation fault (https://github.com/Alexander-Barth/NCDatasets.jl/issues/187 ), when repeatedly open a NetCDF 4 file, reading a vector and closing the file. After doing this ~1 000 000 times we have a segmentation fault. For the original use-case, the error occurs much earlier.

  • the version of the software with which you are encountering an issue

NetCDF 4.9.0 with HDF5 1.12.1 on Linux 5.15.0 with gcc 5.2.0 or gcc 12.1.0.

  • a description of the issue with the steps needed to reproduce it

NetCDF 4.9.0 is compiled with:

export CPPFLAGS="-I/workspace/destdir/include"
export CFLAGS="-std=c99"   
export LDFLAGS="-L/workspace/destdir/lib"    
./configure --prefix=/workspace/destdir --build=x86_64-linux-musl --host=x86_64-linux-gnu --enable-shared --disable-static --disable-dap-remote-tests --disable-plugins

The segmentation fault can also be reproduced with the following C code:

#include <stdlib.h>
#include <stdio.h>
#include <netcdf.h>
#define FILE_NAME "coords.nc"
#define NX 90
#define ERR(e) {printf("Error: %s\n", nc_strerror(e)); exit(2);}

int main() {
  int ncid, varid;
  float data_in[NX];
  int x, y, retval, niter;

  niter = 0;

  while (1) {
    if (niter % 1000 == 0) {
      printf("niter: %d\n",niter);
    }
    if ((retval = nc_open(FILE_NAME, NC_NOWRITE, &ncid)))
      ERR(retval);

    if ((retval = nc_inq_varid(ncid, "latitude", &varid)))
      ERR(retval);

    if ((retval = nc_get_var_float(ncid, varid, &data_in[0])))
      ERR(retval);

    if ((retval = nc_close(ncid)))
      ERR(retval);

    niter += 1;
  }
   
  return 0;
}

Compiled with:

gcc -g test_segfault6.c $(nc-config --cflags --libs)

After niter: 944000, the output is Segmentation fault (core dumped). Running the programm under gdb, we see the following stack trace:

[...]
niter: 944000
niter: 945000

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7a991fa in __GI___libc_free (mem=0xffffffff80000400) at malloc.c:3255
3255    malloc.c: No such file or directory.
(gdb) 
(gdb) where
#0  0x00007ffff7a991fa in __GI___libc_free (mem=0xffffffff80000400) at malloc.c:3255
#1  0x00007ffff7cef600 in nc4_rec_grp_del () from /workspace/destdir/lib/libnetcdf.so.19
#2  0x00007ffff7cefa2b in nc4_nc4f_list_del () from /workspace/destdir/lib/libnetcdf.so.19
#3  0x00007ffff7c975ee in nc4_close_netcdf4_file () from /workspace/destdir/lib/libnetcdf.so.19
#4  0x00007ffff7c976d5 in nc4_close_hdf5_file () from /workspace/destdir/lib/libnetcdf.so.19
#5  0x00007ffff7c97cd6 in NC4_close () from /workspace/destdir/lib/libnetcdf.so.19
#6  0x00007ffff7c3581a in nc_close () from /workspace/destdir/lib/libnetcdf.so.19
#7  0x0000000000400a42 in main () at test_segfault6.c:28

On a different system with HDF5 1.10.0 this error could not be reproduced (tested up to 5 000 000 iterations).

The NetCDF file is available at: https://github.com/Alexander-Barth/NCDatasets.jl/files/9393436/coords.zip and contains the following data:

$ ncdump -h -s coords.nc 
netcdf coords {
dimensions:
	latitude = 90 ;
	longitude = 144 ;
	bnds = 2 ;
variables:
	float latitude(latitude) ;
		latitude:axis = "Y" ;
		latitude:units = "degrees_north" ;
		latitude:standard_name = "latitude" ;
		latitude:_Storage = "contiguous" ;
		latitude:_Endianness = "little" ;
	float longitude(longitude) ;
		longitude:axis = "X" ;
		longitude:units = "degrees_east" ;
		longitude:standard_name = "longitude" ;
		longitude:_Storage = "contiguous" ;
		longitude:_Endianness = "little" ;

// global attributes:
		:source = "Data from Met Office Unified Model" ;
		:um_version = "11.9" ;
		:Conventions = "CF-1.7" ;
		:_NCProperties = "version=2,netcdf=4.7.4,hdf5=1.12.0," ;
		:_SuperblockVersion = 0 ;
		:_IsNetcdf4 = 1 ;
		:_Format = "netCDF-4" ;
}

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (20 by maintainers)

Most upvoted comments

I recall now. It turns out that, as you note, a number of functions like strdup are not officially part of c99. However it is also the case that, at least for gcc, the c library actually contains strdup implementation and if you had extern char* strdup(const char*) to a header, it actually finds it. If you look at ncconfig.h, you will see a number of declarations that try to deal with this.

OK, this a very interesting find! So strdup is not part of C99 and indeed the compiler emitted a warning which I did not see.

What is surprising, is that according to config.h, strdup is available (but it is not):

/* Define to 1 if you have the `strdup' function. */
#define HAVE_STRDUP 1

Despite the test using the option -std=c99:

configure:22897: checking for strdup
configure:22897: cc -o conftest -std=c99 -fno-strict-aliasing -I/workspace/destdir/include -L/workspace/destdir/lib conftest.c -lxml2 -lcurl  >&5
configure:22897: $? = 0
configure:22897: result: yes

For reference, there is previous discussion about this: https://github.com/Unidata/netcdf-c/issues/1408

I am closing this issue because because the option is not necessary anymore in NetCDF 4.9.0. After intensive testing by @sjdaines all reported failure cases are fixed by dropping -std=c99. Thanks a lot to all for your valuable help!!!

(And a learned a lot too; above all that C is really hard 😃)

@Alexander-Barth This is great, actually; it will help a lot to be able to replicate the environment. I will take a look at this tomorrow; my day-to-day machine is ARM, so I will move over to an x86-64 machine to test this out. I will also try under emulation if it comes down to it. Thanks!