mfem: Shared face information broken on distributed nc tet meshes
Hi,
found a bug with shared faces on nc tet meshes.
Reproduction should be easy:
mpirun -np 2 ex15p -m ../data/escher.mesh -r 2 -tf 0.3 -est 1
which yields
Options used:
--mesh ../data/escher.mesh
--problem 0
--nfeatures 1
--order 2
--max-err 0.0001
--hysteresis 0.25
--ref-levels 2
--nc-limit 3
--t-final 0.3
--estimator 1
--visualization
--no-visit-datafiles
Time 0
Refinement:
Iteration: 1, number of unknowns: 4401, total error: 0.00878944
Iteration: 2, number of unknowns: 5275
Assertion failed: (data && i >= 0 && i < height && j >= 0 && j < width) is false:
-->
... in function: double &mfem::DenseMatrix::operator()(int, int)
... in file: /home/termi/Repos/mfem-master/linalg/densemat.hpp:928
Backtrace:
0) [0x5ad41f]: mfem::mfem_error(char const*)
1) [0x5eaf11]: mfem::DenseMatrix::operator()(int, int)
2) [0x7e1098]: mfem::ParNCMesh::GetFaceNeighbors(mfem::ParMesh&)
3) [0x7ac684]: mfem::ParMesh::ExchangeFaceNbrData()
4) [0xbbb87e]: mfem::ParFiniteElementSpace::ExchangeFaceNbrData()
5) [0xbdbac5]: mfem::ParGridFunction::ExchangeFaceNbrData()
6) [0x9f67dc]: mfem::KellyErrorEstimator::ComputeEstimates()
7) [0x9faf0e]: mfem::KellyErrorEstimator::GetLocalErrors()
8) [0x6ef76f]: mfem::ThresholdRefiner::ApplyImpl(mfem::Mesh&)
9) [0x57d3be]: mfem::MeshOperator::Apply(mfem::Mesh&)
10) [0x57a302]: main
11) [0x7fe4f645ff89]: __libc_start_main
12) [0x578839]: _start
Lookup backtrace source lines:
addr2line -C -e ex15p 0x5ad41f 0x5eaf11 0x7e1098 0x7ac684 0xbbb87e 0xbdbac5 0x9f67dc 0x9faf0e 0x6ef76f 0x57d3be 0x57a302
addr2line -C -e /lib64/libc.so.6 0x7fe4f645ff89
addr2line -C -e ex15p 0x578839
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
I can reproduce this against latest release (4.3) and latest master (0843a87).
Edit: Just remember that this was already broken when implementing Kelly (see #1722).
Edit 2: Better MWE.
#include "mfem.hpp"
#include <fstream>
#include <iostream>
using namespace std;
using namespace mfem;
int main(int argc, char *argv[])
{
// 1. Initialize MPI.
int num_procs, myid;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
// 2. Parse command-line options.
const char *mesh_file = "../data/star-hilbert.mesh";
OptionsParser args(argc, argv);
args.AddOption(&mesh_file, "-m", "--mesh", "Mesh file to use.");
args.Parse();
if (!args.Good())
{
if (myid == 0)
{
args.PrintUsage(cout);
}
MPI_Finalize();
return 1;
}
// 3. Read the (serial) mesh from the given mesh file on all processors. We
// can handle triangular, quadrilateral, tetrahedral, hexahedral, surface
// and volume meshes with the same code.
Mesh *mesh = new Mesh(mesh_file, 1, 1);
int dim = mesh->Dimension();
int sdim = mesh->SpaceDimension();
// 4. Project a NURBS mesh to a piecewise-quadratic curved mesh. Make sure
// that the mesh is non-conforming if it has quads or hexes and refine it.
mesh->EnsureNCMesh(true);
// Make sure tet-only meshes are marked for local refinement.
mesh->Finalize(true);
// 5. Define a parallel mesh by partitioning the serial mesh. Once the
// parallel mesh is defined, the serial mesh can be deleted.
ParMesh pmesh(MPI_COMM_WORLD, *mesh);
delete mesh;
// pmesh.ExchangeFaceNbrData(); // <- works
// 6. Do any refinement ...
pmesh.RandomRefinement(0.25);
// ... to crash this
pmesh.ExchangeFaceNbrData();
// 7. Exit
MPI_Finalize();
return 0;
}
Works:
mpirun -np 2 nc-tet-mwe -m ../data/amr-hex.mesh
Crashes:
mpirun -np 2 nc-tet-mwe -m ../data/escher.mesh
Assertion failed: (data && i >= 0 && i < height && j >= 0 && j < width) is false:
-->
... in function: double &mfem::DenseMatrix::operator()(int, int)
... in file: /home/termi/Repos/mfem-master/mesh/../linalg/densemat.hpp:928
Backtrace:
Assertion failed: (data && i >= 0 && i < height && j >= 0 && j < width) is false:
-->
... in function: double &mfem::DenseMatrix::operator()(int, int)
... in file: /home/termi/Repos/mfem-master/mesh/../linalg/densemat.hpp:928
Backtrace:
0) [0x582abf]: mfem::mfem_error(char const*)
0) [0x582abf]: mfem::mfem_error(char const*)
1) [0x5fbcf1]: mfem::DenseMatrix::operator()(int, int)
1) [0x5fbcf1]: mfem::DenseMatrix::operator()(int, int)
2) [0x736f18]: mfem::ParNCMesh::GetFaceNeighbors(mfem::ParMesh&)
2) [0x736f18]: mfem::ParNCMesh::GetFaceNeighbors(mfem::ParMesh&)
3) [0x700644]: mfem::ParMesh::ExchangeFaceNbrData()
3) [0x700644]: mfem::ParMesh::ExchangeFaceNbrData()
4) [0x572f70]: main
4) [0x572f70]: main
5) [0x7fbbf7d0cf89]: __libc_start_main
5) [0x7f093a1ccf89]: __libc_start_main
6) [0x572c19]: _start
[...]
so this is unrelated to my Kelly implementation.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 22 (22 by maintainers)
Your example of the multigrid is interesting, I have tried putting a geometric multigrid on top of the NC structure already, and concluded it wasn’t worth the effort presently. However, it would likely be helpful to have an interface to generate a “coarsened” mesh that is one step of coarsening on all leaf elements (namely pruning the tips). Doing this a stage at a time would then generate a sequence of meshes defining a GMG, without relying on an inter element transfer that spans multiple NC levels. This would not however need an exposure of the full tree (again, I think such an interface is ripe for misuse and misunderstanding) but just a
void CoarsenLeaves(int nlevel = 1) const;
method on theParMesh
class that mutates the mesh by pruningnlevel
s of leaves. The GMG would then just be a sequence of copy and coarsen levels, with a very very simple user experience.I have managed to trigger the same error on linear tetrahedra meshes (
oops, bad upper bound
) elsewhere, after doing the fix for the bad access withoperator()
, so I don’t think it’s unique to q2 meshes. This was after a few rounds of amr within some application code, so I don’t have the mesh to hand however. The bug also triggers for the above parameters with-np 2
and with-r 0
. I added a little extra print to the asserts and it looks like both ranks end up with one more entry than the bound value.Now looking at making this work for tets, we have to reflect the orientations: https://github.com/mfem/mfem/blob/539f663fe6ad3738b6de9615f6c74dfa44927578/fem/geom.cpp#L909-L913 As you can see, there is no canonical swap of points that would flip the face similarly to
Geometry::SQUARE
. So the code highlighted by @v-dobrev should probably be put in a function called something likeFlipSquareFacePointMatrix
: https://github.com/mfem/mfem/blob/0843a87d7953cf23e556dcfd426d27bd9cfb3e21/mesh/pncmesh.cpp#L1122-L1128 And a specialized similar function should be written forGeometry::TRIANGLE
.I see your point, I think it’s a matter of definition, what is the flipped orientation of the face:
? I think the answer is in
geom.cpp:927-931
:Even orientations are non-flipped and Odd orientations are the flipped ones, so the flipped orientation of
{0, 1, 2, 3}
is{0, 3, 2, 1}
; and not{2, 1, 0, 3}
which is the flipped orientation of{2, 3, 0, 1}
.Regarding your second question, I’m not sure I understand what you are referring to. What I know about master/slave faces is that shared master faces have a special convention. There is usually two conventions that are conflicting in this one case, creating a special case, these two conventions are:
For shared master faces these two conventions are conflicting, so the local element is elem 0, but the slave element is elem 1 (special case). This means that the normal is inverted compared to the shared slave face on the other MPI rank. To compensate that, the face orientation on the shared master face is flipped, recovering a consistent state. Is it what was confusing you?
Note: Shared master faces are also called ghost slave faces
Regarding the
PointMatrix
class, I added some documentation that should be merged soon: https://github.com/mfem/mfem/blob/yohann/pa-dg-ncmesh/mesh/ncmesh.hpp#L824You’re welcome and thanks for the quick response Veselin!
This is not urgent for me to get fixed, since AMR on full-hex meshes is working nicely already! I just tried to setup some quick tests to estimate how good the performance of my problem on tetrahedral meshes is (because full hexahedral meshing can be really painful on the geometries which I have). However using conforming meshes does not work for me, because I heavily rely on having a functional load balancing and derefinement available, because the features of my problem are highly localized in space and time, see e.g.
If no one else proposes a fix in the meantime, then I will try to take a look at this after I finish the data collection and iterative solver PRs.
Okay, I can reproduce the error with
mpirun -np 2 ex15p -m ../data/escher.mesh -r 2 -tf 0.3 -est 1
.The illegal access happens here: https://github.com/mfem/mfem/blob/0843a87d7953cf23e556dcfd426d27bd9cfb3e21/mesh/pncmesh.cpp#L1126
I think that this method does not really work for triangle faces, in general, as suggested several lines above the crash line, here: https://github.com/mfem/mfem/blob/0843a87d7953cf23e556dcfd426d27bd9cfb3e21/mesh/pncmesh.cpp#L1122-L1128 Basically, we are trying to swap columns 2 and 4 (corresponding the vertices 2 and 4 in a quad) in a 2x3 matrix whose columns are the vertices of a triangle.
Originally
ex15p
was using conforming refinement for simplices, however that was changed (in #949) to always use non-conforming refinement. You can switch back to conforming refinement for simplices, here: https://github.com/mfem/mfem/blob/0843a87d7953cf23e556dcfd426d27bd9cfb3e21/examples/ex15p.cpp#L160 by callingMesh::EnsureNCMesh()
(same asMesh::EnsureNCMesh(false)
). With this change the example seems to work fine.In any case, this is a bug in the non-conforming refinement (when handling non-conforming shared triangle faces) that we should fix. Thanks for reporting it Dennis!