mfem: Shared face information broken on distributed nc tet meshes

Hi,

found a bug with shared faces on nc tet meshes.

Reproduction should be easy:

mpirun -np 2 ex15p -m ../data/escher.mesh -r 2 -tf 0.3 -est 1

which yields

Options used:
   --mesh ../data/escher.mesh
   --problem 0
   --nfeatures 1
   --order 2
   --max-err 0.0001
   --hysteresis 0.25
   --ref-levels 2
   --nc-limit 3
   --t-final 0.3
   --estimator 1
   --visualization
   --no-visit-datafiles

Time 0

Refinement:
Iteration: 1, number of unknowns: 4401, total error: 0.00878944
Iteration: 2, number of unknowns: 5275

Assertion failed: (data && i >= 0 && i < height && j >= 0 && j < width) is false:
 --> 
 ... in function: double &mfem::DenseMatrix::operator()(int, int)
 ... in file: /home/termi/Repos/mfem-master/linalg/densemat.hpp:928

Backtrace:
0) [0x5ad41f]: mfem::mfem_error(char const*)
1) [0x5eaf11]: mfem::DenseMatrix::operator()(int, int)
2) [0x7e1098]: mfem::ParNCMesh::GetFaceNeighbors(mfem::ParMesh&)
3) [0x7ac684]: mfem::ParMesh::ExchangeFaceNbrData()
4) [0xbbb87e]: mfem::ParFiniteElementSpace::ExchangeFaceNbrData()
5) [0xbdbac5]: mfem::ParGridFunction::ExchangeFaceNbrData()
6) [0x9f67dc]: mfem::KellyErrorEstimator::ComputeEstimates()
7) [0x9faf0e]: mfem::KellyErrorEstimator::GetLocalErrors()
8) [0x6ef76f]: mfem::ThresholdRefiner::ApplyImpl(mfem::Mesh&)
9) [0x57d3be]: mfem::MeshOperator::Apply(mfem::Mesh&)
10) [0x57a302]: main
11) [0x7fe4f645ff89]: __libc_start_main
12) [0x578839]: _start

Lookup backtrace source lines:
addr2line -C -e ex15p 0x5ad41f 0x5eaf11 0x7e1098 0x7ac684 0xbbb87e 0xbdbac5 0x9f67dc 0x9faf0e 0x6ef76f 0x57d3be 0x57a302
addr2line -C -e /lib64/libc.so.6 0x7fe4f645ff89
addr2line -C -e ex15p 0x578839

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

I can reproduce this against latest release (4.3) and latest master (0843a87).

Edit: Just remember that this was already broken when implementing Kelly (see #1722).

Edit 2: Better MWE.

#include "mfem.hpp"
#include <fstream>
#include <iostream>

using namespace std;
using namespace mfem;

int main(int argc, char *argv[])
{
   // 1. Initialize MPI.
   int num_procs, myid;
   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
   MPI_Comm_rank(MPI_COMM_WORLD, &myid);

   // 2. Parse command-line options.
   const char *mesh_file = "../data/star-hilbert.mesh";

   OptionsParser args(argc, argv);
   args.AddOption(&mesh_file, "-m", "--mesh", "Mesh file to use.");
   args.Parse();
   if (!args.Good())
   {
      if (myid == 0)
      {
         args.PrintUsage(cout);
      }
      MPI_Finalize();
      return 1;
   }

   // 3. Read the (serial) mesh from the given mesh file on all processors.  We
   //    can handle triangular, quadrilateral, tetrahedral, hexahedral, surface
   //    and volume meshes with the same code.
   Mesh *mesh = new Mesh(mesh_file, 1, 1);
   int dim = mesh->Dimension();
   int sdim = mesh->SpaceDimension();

   // 4. Project a NURBS mesh to a piecewise-quadratic curved mesh. Make sure
   //    that the mesh is non-conforming if it has quads or hexes and refine it.
   mesh->EnsureNCMesh(true);
   // Make sure tet-only meshes are marked for local refinement.
   mesh->Finalize(true);

   // 5. Define a parallel mesh by partitioning the serial mesh.  Once the
   //    parallel mesh is defined, the serial mesh can be deleted.
   ParMesh pmesh(MPI_COMM_WORLD, *mesh);
   delete mesh;

   // pmesh.ExchangeFaceNbrData(); // <- works

   // 6. Do any refinement ...
   pmesh.RandomRefinement(0.25);

   // ... to crash this
   pmesh.ExchangeFaceNbrData();

   // 7. Exit
   MPI_Finalize();
   return 0;
}

Works:

mpirun -np 2 nc-tet-mwe -m ../data/amr-hex.mesh

Crashes:

mpirun -np 2 nc-tet-mwe -m ../data/escher.mesh

Assertion failed: (data && i >= 0 && i < height && j >= 0 && j < width) is false:
 --> 
 ... in function: double &mfem::DenseMatrix::operator()(int, int)
 ... in file: /home/termi/Repos/mfem-master/mesh/../linalg/densemat.hpp:928

Backtrace:


Assertion failed: (data && i >= 0 && i < height && j >= 0 && j < width) is false:
 --> 
 ... in function: double &mfem::DenseMatrix::operator()(int, int)
 ... in file: /home/termi/Repos/mfem-master/mesh/../linalg/densemat.hpp:928

Backtrace:
0) [0x582abf]: mfem::mfem_error(char const*)
0) [0x582abf]: mfem::mfem_error(char const*)
1) [0x5fbcf1]: mfem::DenseMatrix::operator()(int, int)
1) [0x5fbcf1]: mfem::DenseMatrix::operator()(int, int)
2) [0x736f18]: mfem::ParNCMesh::GetFaceNeighbors(mfem::ParMesh&)
2) [0x736f18]: mfem::ParNCMesh::GetFaceNeighbors(mfem::ParMesh&)
3) [0x700644]: mfem::ParMesh::ExchangeFaceNbrData()
3) [0x700644]: mfem::ParMesh::ExchangeFaceNbrData()
4) [0x572f70]: main
4) [0x572f70]: main
5) [0x7fbbf7d0cf89]: __libc_start_main
5) [0x7f093a1ccf89]: __libc_start_main
6) [0x572c19]: _start

[...]

so this is unrelated to my Kelly implementation.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 22 (22 by maintainers)

Most upvoted comments

  1. This might be the case, but I would be pretty skeptical until some high granularity profiling data is available. Moving around in NCMesh though, there is an enormous scope for using the stl rather then home rolled algorithms. For instance, most of the Table structures could be easily handled by an unordered_map, for which we can design perfect hashes given we know so much information about the keys a priori. Right now I am focused on correctness rather than performance however, so I have only changed these over in cases where it improves readability, leaving the serious refactors for once I have feature completeness.
  2. Not sure about this. My view is the user shouldn’t be comparing face/element indices at all from adapt to adapt. After a refinement/coarsen, so many predicates and invariants are going to have been violated, that the mesh should be thought of as conceptually entirely new. The NCMesh internals obviously are doing all the book keeping, but a user should not be going near this stuff imho, as it’s unbelievably easy to think you understand it and be radically incorrect (at least speaking from my personal experience diving into this). UI cleanliness suggests furnishing this information to users is a Bad Idea™.

Your example of the multigrid is interesting, I have tried putting a geometric multigrid on top of the NC structure already, and concluded it wasn’t worth the effort presently. However, it would likely be helpful to have an interface to generate a “coarsened” mesh that is one step of coarsening on all leaf elements (namely pruning the tips). Doing this a stage at a time would then generate a sequence of meshes defining a GMG, without relying on an inter element transfer that spans multiple NC levels. This would not however need an exposure of the full tree (again, I think such an interface is ripe for misuse and misunderstanding) but just a void CoarsenLeaves(int nlevel = 1) const; method on the ParMesh class that mutates the mesh by pruning nlevels of leaves. The GMG would then just be a sequence of copy and coarsen levels, with a very very simple user experience.

mpirun -np 16 ex15p -m ../data/escher-p2.mesh -r 1 -tf 1.3 -est 1 -e 0.001

I have managed to trigger the same error on linear tetrahedra meshes (oops, bad upper bound) elsewhere, after doing the fix for the bad access with operator() , so I don’t think it’s unique to q2 meshes. This was after a few rounds of amr within some application code, so I don’t have the mesh to hand however. The bug also triggers for the above parameters with -np 2 and with -r 0. I added a little extra print to the asserts and it looks like both ranks end up with one more entry than the bound value.

Now looking at making this work for tets, we have to reflect the orientations: https://github.com/mfem/mfem/blob/539f663fe6ad3738b6de9615f6c74dfa44927578/fem/geom.cpp#L909-L913 As you can see, there is no canonical swap of points that would flip the face similarly to Geometry::SQUARE. So the code highlighted by @v-dobrev should probably be put in a function called something like FlipSquareFacePointMatrix: https://github.com/mfem/mfem/blob/0843a87d7953cf23e556dcfd426d27bd9cfb3e21/mesh/pncmesh.cpp#L1122-L1128 And a specialized similar function should be written for Geometry::TRIANGLE.

I see your point, I think it’s a matter of definition, what is the flipped orientation of the face:

3-2
| |
0-1

? I think the answer is in geom.cpp:927-931:

Constants<Geometry::SQUARE>::Orient[8][4] =
{
   {0, 1, 2, 3}, {0, 3, 2, 1}, {1, 2, 3, 0}, {1, 0, 3, 2},
   {2, 3, 0, 1}, {2, 1, 0, 3}, {3, 0, 1, 2}, {3, 2, 1, 0}
};

Even orientations are non-flipped and Odd orientations are the flipped ones, so the flipped orientation of {0, 1, 2, 3} is {0, 3, 2, 1}; and not {2, 1, 0, 3} which is the flipped orientation of {2, 3, 0, 1}.

Regarding your second question, I’m not sure I understand what you are referring to. What I know about master/slave faces is that shared master faces have a special convention. There is usually two conventions that are conflicting in this one case, creating a special case, these two conventions are:

  • The local element is always elem 0 (of the face).
  • The slave element is always elem 0.

For shared master faces these two conventions are conflicting, so the local element is elem 0, but the slave element is elem 1 (special case). This means that the normal is inverted compared to the shared slave face on the other MPI rank. To compensate that, the face orientation on the shared master face is flipped, recovering a consistent state. Is it what was confusing you?

Note: Shared master faces are also called ghost slave faces

Regarding the PointMatrix class, I added some documentation that should be merged soon: https://github.com/mfem/mfem/blob/yohann/pa-dg-ncmesh/mesh/ncmesh.hpp#L824

You’re welcome and thanks for the quick response Veselin!

This is not urgent for me to get fixed, since AMR on full-hex meshes is working nicely already! I just tried to setup some quick tests to estimate how good the performance of my problem on tetrahedral meshes is (because full hexahedral meshing can be really painful on the geometries which I have). However using conforming meshes does not work for me, because I heavily rely on having a functional load balancing and derefinement available, because the features of my problem are highly localized in space and time, see e.g.

cycler-rd_monodomain-amr-niederer-test

If no one else proposes a fix in the meantime, then I will try to take a look at this after I finish the data collection and iterative solver PRs.

Okay, I can reproduce the error with mpirun -np 2 ex15p -m ../data/escher.mesh -r 2 -tf 0.3 -est 1.

The illegal access happens here: https://github.com/mfem/mfem/blob/0843a87d7953cf23e556dcfd426d27bd9cfb3e21/mesh/pncmesh.cpp#L1126

I think that this method does not really work for triangle faces, in general, as suggested several lines above the crash line, here: https://github.com/mfem/mfem/blob/0843a87d7953cf23e556dcfd426d27bd9cfb3e21/mesh/pncmesh.cpp#L1122-L1128 Basically, we are trying to swap columns 2 and 4 (corresponding the vertices 2 and 4 in a quad) in a 2x3 matrix whose columns are the vertices of a triangle.

Originally ex15p was using conforming refinement for simplices, however that was changed (in #949) to always use non-conforming refinement. You can switch back to conforming refinement for simplices, here: https://github.com/mfem/mfem/blob/0843a87d7953cf23e556dcfd426d27bd9cfb3e21/examples/ex15p.cpp#L160 by calling Mesh::EnsureNCMesh() (same as Mesh::EnsureNCMesh(false)). With this change the example seems to work fine.

In any case, this is a bug in the non-conforming refinement (when handling non-conforming shared triangle faces) that we should fix. Thanks for reporting it Dennis!