opencv: Improper or bad rotation estimation with solvePnP in some cases

System information (version)
  • OpenCV => master (ee257ff)
  • Operating System / Platform => Ubuntu 16.04 / 64 bits
  • Compiler => gcc
Detailed description

In some cases, the pose estimated with solvePnP() give a rotation matrix that looks correct (det(R)=1, R.R^t=Identity) but will project the z-axis of the object frame in the opposite direction.

More information here: ArUco marker z-axis mirrored.

Example image from the linked question: 14939088089358535

My example (less obvious, with aruco::CORNER_REFINE_NONE):

bad_rotation_108

bad_rotation2_108

The issue can appear on one frame and disappear in the next frame.

Steps to reproduce

To get a complete reproducible example with Aruco detection see here.

A minimal reproducible code with some optional debug code/display
#include "opencv2/opencv.hpp"
#include "opencv2/aruco.hpp"

int main(int argc, char *argv[])
{
    cv::Mat frame = cv::imread(argc > 1 ? argv[1] : "img_fail.png");

    // Intrinsic camera parameters.
    cv::Mat camMatrix = cv::Mat::eye(3, 3, CV_64F);
    cv::Mat distortionCoeffs = cv::Mat::zeros(8, 1, CV_64F);
    camMatrix.at<double>(0, 0) = 2.3396381685789738e+03;
    camMatrix.at<double>(0, 2) = 960.;
    camMatrix.at<double>(1, 1) = 2.3396381685789738e+03;
    camMatrix.at<double>(1, 2) = 540.;
    std::cout << "camMatrix:\n" << camMatrix << std::endl;

//    distortionCoeffs.at<double>(0, 0) = -1.0982746232841779e-01;
//    distortionCoeffs.at<double>(1, 0) = 2.2689585715220828e-01;
//    distortionCoeffs.at<double>(2, 0) = 0.;
//    distortionCoeffs.at<double>(3, 0) = 0.;
//    distortionCoeffs.at<double>(4, 0) = -2.2112148171171589e-01;
    std::cout << "distortionCoeffs: " << distortionCoeffs.t() << std::endl;

    std::vector<cv::Point3d> objectPoints;
    objectPoints.push_back( cv::Point3d(-2.5, 2.5, 0.0) );
    objectPoints.push_back( cv::Point3d(2.5, 2.5, 0.0) );
    objectPoints.push_back( cv::Point3d(2.5, -2.5, 0.0) );
    objectPoints.push_back( cv::Point3d(-2.5, -2.5, 0.0) );

    std::vector<cv::Point2d> imagePoints;
    imagePoints.push_back( cv::Point2d(988, 512) );
    imagePoints.push_back( cv::Point2d(945, 575) );
    imagePoints.push_back( cv::Point2d(849, 544) );
    imagePoints.push_back( cv::Point2d(893, 480) );

    std::vector<int> solvePnP_methods;
    solvePnP_methods.push_back(cv::SOLVEPNP_ITERATIVE);
    solvePnP_methods.push_back(cv::SOLVEPNP_EPNP);
    solvePnP_methods.push_back(cv::SOLVEPNP_P3P);
    solvePnP_methods.push_back(cv::SOLVEPNP_AP3P);

    std::map<int, std::string> solvePnP_method_names;
    solvePnP_method_names[cv::SOLVEPNP_ITERATIVE] = "cv::SOLVEPNP_ITERATIVE";
    solvePnP_method_names[cv::SOLVEPNP_EPNP] = "cv::SOLVEPNP_EPNP";
    solvePnP_method_names[cv::SOLVEPNP_P3P] = "cv::SOLVEPNP_P3P";
    solvePnP_method_names[cv::SOLVEPNP_AP3P] = "cv::SOLVEPNP_AP3P";

    for (size_t idx = 0; idx < solvePnP_methods.size(); idx++)
    {
        cv::Mat debugFrame = frame.clone();
        cv::putText(debugFrame, solvePnP_method_names[solvePnP_methods[idx]], cv::Point(20,20), cv::FONT_HERSHEY_SIMPLEX , 0.5, cv::Scalar(0,0,255));

        cv::line(debugFrame, imagePoints[0], imagePoints[1], cv::Scalar(0,0,255), 2);
        cv::line(debugFrame, imagePoints[1], imagePoints[2], cv::Scalar(0,255,0), 2);
        cv::line(debugFrame, imagePoints[2], imagePoints[3], cv::Scalar(255,0,0), 2);

        cv::Mat rvec, tvec;
        cv::solvePnP(objectPoints, imagePoints, camMatrix, distortionCoeffs, rvec, tvec, false, solvePnP_methods[idx]);

        cv::Mat rotation_matrix;
        cv::Rodrigues(rvec, rotation_matrix);

        std::cout << "\nrotation_matrix:\n" << rotation_matrix << std::endl;
        std::cout << "R.R^t=" << rotation_matrix*rotation_matrix.t() << std::endl;
        std::cout << "det(R)=" << cv::determinant(rotation_matrix) << std::endl << std::endl;

        cv::aruco::drawAxis(debugFrame, camMatrix, distortionCoeffs, rvec, tvec, 5.0);

        std::vector<cv::Point3d> axis_vec;
        axis_vec.push_back( cv::Point3d(0, 0, 0) );
        axis_vec.push_back( cv::Point3d(1, 0, 0) );
        axis_vec.push_back( cv::Point3d(0, 1, 0) );
        axis_vec.push_back( cv::Point3d(0, 0, 1) );

        std::vector<cv::Point2d> axis_vec_proj;
        cv::projectPoints(axis_vec, rvec, tvec, camMatrix, distortionCoeffs, axis_vec_proj);
        for (size_t i = 0; i < axis_vec_proj.size(); i++)
        {
            std::cout << "axis_vec_proj[" << i << "]=" << axis_vec_proj[i].x << " ; " << axis_vec_proj[i].y << std::endl;
        }

        // Project cube.
        float length = 20.0f;
        std::vector<cv::Point3f> testObj3d;
        testObj3d.push_back(cv::Point3f(0, 0, 0));
        testObj3d.push_back(cv::Point3f(length, 0, 0));
        testObj3d.push_back(cv::Point3f(0, length, 0));
        testObj3d.push_back(cv::Point3f(length, length, 0));
        testObj3d.push_back(cv::Point3f(0, 0, length));
        testObj3d.push_back(cv::Point3f(length, 0, length));
        testObj3d.push_back(cv::Point3f(0, length, length));
        testObj3d.push_back(cv::Point3f(length, length, length));
        std::vector<cv::Point2f> testObj2d;
        cv::projectPoints(testObj3d, rvec, tvec, camMatrix, distortionCoeffs, testObj2d);

        cv::line(debugFrame, testObj2d[0], testObj2d[1], cv::Scalar(0,0,0), 1);
        cv::line(debugFrame, testObj2d[0], testObj2d[2], cv::Scalar(0,0,0), 1);
        cv::line(debugFrame, testObj2d[1], testObj2d[3], cv::Scalar(0,0,0), 1);
        cv::line(debugFrame, testObj2d[2], testObj2d[3], cv::Scalar(0,0,0), 1);
        cv::line(debugFrame, testObj2d[0], testObj2d[4], cv::Scalar(0,0,0), 1);
        cv::line(debugFrame, testObj2d[1], testObj2d[5], cv::Scalar(0,0,0), 1);
        cv::line(debugFrame, testObj2d[2], testObj2d[6], cv::Scalar(0,0,0), 1);
        cv::line(debugFrame, testObj2d[3], testObj2d[7], cv::Scalar(0,0,0), 1);
        cv::line(debugFrame, testObj2d[4], testObj2d[5], cv::Scalar(0,0,0), 1);
        cv::line(debugFrame, testObj2d[4], testObj2d[6], cv::Scalar(0,0,0), 1);
        cv::line(debugFrame, testObj2d[5], testObj2d[7], cv::Scalar(0,0,0), 1);
        cv::line(debugFrame, testObj2d[6], testObj2d[7], cv::Scalar(0,0,0), 1);

        cv::imshow("debug", debugFrame);
        cv::waitKey(0);
    }
}

The image from which the corners have been extracted here.


Note:

  • cv::aruco::drawAxis() should be ok as with manual matrix multiplications lead to same result
  • I don’t think the issue comes from cv::Rodrigues() as I get more or less (I suppose the correct rotation matrix should have a different sign somewhere, which is not the case?) the same rotation matrix in solvePnP function (just before the call to cv::Rodrigues(), as solvePnP returns a rvec and do internally the conversion rotation matrix -> rotation vector)
  • cv::SOLVEPNP_ITERATIVE, cv::SOLVEPNP_EPNP, cv::SOLVEPNP_P3P and cv::SOLVEPNP_AP3P all seem to fail, the estimated rotation is not the same for some methods
  • using cv::aruco::CORNER_REFINE_SUBPIX can solve the issue in the original question (in my case, as the markers are much smaller the refinement is not always correct and the issue remains)
  • while I can understand that some numerical imprecision in the corner locations can lead to a pose badly estimated, I cannot understand why in this case the translation is ok and the rotation is almost ok except a flipped z-axis (or why a left-handed rotation matrix seems to be returned)

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 12
  • Comments: 27 (12 by maintainers)

Most upvoted comments

~Cheap workaround: if you assume the marker always faces up relative to the camera, then a bad rotation can be detected if it violates this constraint. The pose’s rotation can be “fixed” by flipping the axes and doing a “fixup” rotation:~ Edit: this works only in special cases and shouldn’t be used!

rvecs, tvecs, _ = cv2.aruco.estimatePoseSingleMarkers(corners, ...)
T = tvecs[0,0]
R = cv2.Rodrigues(rvecs[0])[0]
# Unrelated -- makes Y the up axis, Z forward
R = R @ np.array([
	[1, 0, 0],
	[0, 0, 1],
	[0,-1, 0],
])
if 0 < R[1,1] < 1:
	# If it gets here, the pose is flipped.

	# Flip the axes. E.g., Y axis becomes [-y0, -y1, y2].
	R *= np.array([
		[ 1, -1,  1],
		[ 1, -1,  1],
		[-1,  1, -1],
	])
	
	# Fixup: rotate along the plane spanned by camera's forward (Z) axis and vector to marker's position
	forward = np.array([0, 0, 1])
	tnorm = T / np.linalg.norm(T)
	axis = np.cross(tnorm, forward)
	angle = -2*math.acos(tnorm @ forward)
	R = cv2.Rodrigues(angle * axis)[0] @ R

BTW, I don’t think filters on their own are enough to fix this (if it happens a lot). Bad pose solutions basically make the distribution multi-modal, so an (E)KF wouldn’t work well. You could use particle filters, cluster the particles (rather than averaging) to find modes, and pick the most likely mode.

@r2d3 Indeed, there could be multiple solutions given the measure error.


My confusion was about the z-axis “being flipped” and thus meaning that the rotation matrix estimated would be left-handed. The determinant of the rotation matrix was equal to 1 adding more confusion to me as I was convinced to see a left-handed frame.

Actually, it was just an optical illusion similar to this: image

  • Correct estimation of the pose (SOLVEPNP_P3P): aruco_pose_axis2

  • Bad estimation of the pose (rotation) (SOLVEPNP_ITERATIVE), the z-axis is pointing toward us and not going inside the table (the x-axis is going more or less inside the table) aruco_pose_axis1

This is more obvious by looking at the complete cube instead of just the marker frame.


To recap, the real issue (bad pose (rotation) estimated) comes from a limitation of the PnP algorithm used (subject to noises as intrinsics or 2D image coordinates) and possibly from the ambiguity of the projection of a planar object, see for instance: D. Oberkampf & L. S. Davis Daniel DeMenthon. Iterative pose estimation using coplanar feature points.:

image

(SOP means Scaled Orthographic Projection.)

Many PnP algorithms exist meaning that the pose estimation problem is not trivial.

The pose ambiguity problem is a known issue for PnP when the model points are co-planar. It occurs when the planar model is either small, or viewed far to the camera.

thanks to @catree who integrated IPPE in #14362, it is now possible to get the two pose solutions (rotation and translation).

@hellochenwang

is there a way to detect this kind of optical illusion situation?

Yes, from the two poses, if the reprojection error of both poses is similar, then there is an ambiguity. A feature of #14362 is to give the reprojection error of both poses, so this can be measured.

So how to resolve the ambiguity? You can’t do it using pnp, and indeed no pnp algorithm will be able to resolve it, because it is a property of the problem. More information is required to constrain the marker’s pose.

There are four main options:

  1. To use the effect of shading to determine the correct pose. Basically, the brightness of the marker is influenced by its orientation relative to the scene’s illumination. It is possible to determine the correct pose using this information. This is nicely shown in the optical illusion posted above by @catree. However this can be difficult in practice because it requires determining the illumination of the scene.

  2. To use temporal filtering, as suggested by @catree. But in practice this is very hard to use, and it cannot solve the problem reliably. It might reduce random flipping between the two solutions, but we cannot guarantee that the smoothed trajectory of poses is correct.

  3. To use additional non-coplanar markers. If you have one or more additional markers that are not co-planar to the original marker, then you can resolve the ambiguity. This is because the pnp problem no longer involves a planar model, and we can compute pose uniquely.

  4. To use additional coplanar markers. If you have a set of markers that all lie on the same plane, then is it possible to resolve the ambiguity, but there is a condition. The set of markers must span a region of space that is ‘sufficiently large’. You need this condition to prevent an ambiguity of the entire set of markers being flipped. There are two ways to solve this in practice. The first is to solve the pose of each marker independently and then find the correct pose using a consensus. The correct pose should be revealed as a tight cluster within the set of all poses (2 poses per marker). If you don’t know the relative positions of the markers on the plane, you can do a consensus with the estimated normals of each marker (with two normals per marker). Alternatively if you know the relative positions of the markers on the plane, things are simpler. You can simply detect all the markers in the image, then solve a single pnp problem with 4n point correspondences, where n is the number of detected markers.

Another reference on the same topic.

From the ArUco documentation:

It is extremely important to remark that the estimation of the pose using only 4 coplanar points is subject to ambiguity. As shown in the next figure, a marker could project at the same pixels on two different camera locations. In general, the ambiguity can be solved, if the camera is near to the marker. However, as the marker becomes small, the errors in the corner estimation grows and ambiguity comes as a problem.

image

Frequently Asked Questions

When I draw the axis of a marker, the z axis appears flipped or inverted on some occasions Answer: This is because of the ambiguity problem. It normally happens because the marker is small, or the corners are estimated without precision. Possible ways to deal with ambiguity are: Try using a different corner refinement method. Try using the MarkerPoseTracker.


@hellochenwang

is there a way to detect this kind of optical illusion situation?

Indirectly I think so. Instead of just detecting the marker(s), track the marker(s) and save the corresponding poses. Since the issue should come from an inaccuracy about the corner locations (+ uncertainties about camera intrinsic parameters and distortion) and should be spurious, it should be possible to detect when the orientation suddenly changes from one image frame to another image frame. Unfortunately, you will not be able to get easily the other solution (planar pose estimation can give two ambiguous poses and one pose is eliminated I think using the one that gives the minimal projection error).

This should be similar to what proposes the ArUco library:

Tracking methods to alleviate the ambiguity problem in planar pose estimation.

Another solution could be to filter the poses but this is not a trivial problem:

  • a Kalman filter on the Euler angles is not ideal:
    • gimbal lock
    • rotation is not a linear problem AFAIK, think about issues
  • possible solution should be to use an extended Kalman filter on quaternion, e.g. Quaternion based Extended Kalman Filter

Side note.

After a quick look at the ArUco library, it implements pose from homography / “solvePoseOfCentredSquare()” and some other functions that could be interesting to port into the OpenCV ArUco module. That makes me think also that we should have a generic solvePnP() function that returns all the possible solution to the pose estimation problem (for coplanar poses, two solutions, for P3P, four solutions).

See also:

There is a new version of Apriltag3 released.

They use pose from homography with a method to get the two solutions (estimate_tag_pose_orthogonal_iteration) of the planar pose estimation problem.

What could be done to avoid the frame axis flipping is to use the temporal pose consistency. Here a simplistic demo where I use the pose (from the two solutions) whose the rotation part is the closest from the last pose. Reprojection error can not be used since with calibration errors and corners location errors, there is not guarantee that the correct solution is the one with the lowest reprojection error.

Video

Since the tag is not flat, the pose ambiguity is more prominent. There is still an issue with maybe a singularity with the implemented method where we can observe the frames of all tags changing brusquely of orientation. Need more tests and maybe test also IPPE planar pose estimation method.

There is possibly another solution. See this paper, code is here:

P. Jin, P. Matikainen, and S. S. Srinivasa, “Sensor fusion for fiducial tags: Highly robust pose estimation from single frame rgbd,” 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5770–5776, 2017.

This can be a viable solution since low-cost stereo-cameras are available on the market (e.g. Intel RealSense, Structure sensors, …).


Some notes from my understanding of the paper.

This figure summarizes the planar pose ambiguity problem for the fiducial marker pose estimation:

image

Fig. 4: The ambiguity effect can be demonstrated with two rendered cubes in the perspective view. The two cubes are rotated such that two faces are interlaced. The red square in 4a is a simulated projection of a square tag.The red circular regions denote the region of potential corner detection in a noisy scene. 4b is a sketch of the potential resulting 2D projection. The pose can converge to either one of the two faces.

So the idea is to fuse pose estimated using 3D and 2D information:

image

Fig. 6: An abstract visualization of the optimization constraints. The blue curve is the initial pose estimation obtained from the depth plane. The red curves are the ambiguous poses from the RGB image. We constrained the region of optimization based on how well we fit the depth plane.

And a figure that recaps the whole method:

image


Some notes from my experiments to implement this method.

  1. Prerequisite: be able to acquire RGB images and depth map and to align the depth map with the RGB image (same [u,v] coordinates in the RGB and depth images represent the same physical 3D point)

  2. Detect the fiducial markers in the images to get the corners 2D locations.

  3. Estimate the plane equation for each marker using 3D data:

    • use a point in polygon test to get pixels inside the marker
    • the authors use Bayesian plane fitting algorithm
    • instead, I used some code I already have but there are multiple possibilities for plane fitting
    • the reference I used is described in the “Robotics, Vision and Control Fundamental Algorithms in MATLAB®” book by Peter Corke (SVD decomposition of the inertia matrix)
    • beware to not use invalid 3D points (depth = 0)
    • use the plane fitting algorithm in a robust scheme to reject the outliers (see M-Estimators)
  4. Compute the 3D coordinates for each marker corners in the camera frame:

    • from [u,v] coordinates, transform to the normalized camera frame, see undistortPoints() function. Without distortion coefficients, it is simply: image
    • compute the line-plane intersection or using the plane equation described by a normal n and a point x0 on the plane: image
    • any 3D point on the plane computed previously follows the equation: image
    • I think it is a little bit different from the authors code here and here:
	for i in range(dim[0]):
		x = image_pts[i, 0]
		y = image_pts[i, 1]
		depth = depth_image[y, x] / 1000.0 + 0.00001
		if(depth != 0):
			X = (x - px) * depth / fx
			Y = (y - py) * depth / fy
			Z = computeZ(n, d, X, Y)
                        all_depth_points = all_depth_points + [[X, Y, Z]]

and

def computeZ (n, d, x, y):
	sum = n[0] * x + n[1] * y
	z = (d - sum) / n[2]
        return z
  • next step is to estimate the rigid transformation between the four 3D points expressed in the tag frame and corresponding 3D points expressed in the camera frame. This is similar to the ICP equations.
  • relevant slides are 1, 2, 3 and 4
  • with 3D information, we have a first estimation of the tag pose and we can refine the pose using 3D-2D correspondences, see solvePnPRefineLM() or solvePnPRefineVVS() functions (the authors used a constrained optimization function)

So, the key idea is to exploit the 3D information to fit a plane and try to avoid the planar ambiguity problem by constraining the 3D coordinates to be on the plane. The pose is then refined using non-linear optimization from 3D-2D correspondences.

Some issues I got:

  • when fitting a plane, the normal can point in two opposite directions. When computing the Z from the plane equation, ensure that Z > 0 (just remove the sign).
  • when estimating the rigid transformation from 3D-3D correspondences, sometimes we have det(R) = -1. In this case normalize R, see this code:
    # special reflection case
    if linalg.det(R) < 0:
       # print "Reflection detected"
       Vt[2,:] *= -1
       R = dot(Vt.T, U.T)

Some video results I got using an Intel RealSense D435:

  • using OpenCV ArUco (click on the image for the video):

Alt text

  • using AprilTag 3 (much harder to have the “axis flipping” issue since AprilTag3 accurately extract the corners locations)

Alt text

I realize that it is natural and not a bug! In fact Z axis always point toward us and not flipped (You can check the axes generated). What happened is shown in picture bellow because of low resolution or far distance. In fact you see the XY plane the same and think that Z axis flipped in this scenario. Reply

Some recent fiducial marker methods:


librpp for robust planar pose estimation used in ARToolkitPlus. ARToolkitPlus has been proposed in 2007.

@antithing I don’t think so. Anyway, it was something really simple:

  • current estimation gives you two rotation vectors
  • chose the estimation whose the rotation part is the closest to the previous one

Maybe to avoid sporadic flipping, you can try to use a buffer of N past observations and use a voting scheme / median scheme.

To compute distance between two rotations, have a look at the literature:

Something like:

  • compute the rotation matrix Rdiff = Rprev . Rcur^T
  • convert using the Rodrigues formula and extract the theta rotation from the theta.u vector

It should be equivalent to:

image

You’re absolutely right. The ambiguity can also be resolved if you have a stereo rig or an rgbd camera. This can work when the marker is in the working range of the stereo rig or rgbd camera. But if the marker is far away, it can still be impossible to resolve it.

My response was assuming you only had a monocular (single lens) camera.

Hello @catree

As your 4 points are on a plane and on a square, there seems to be multiple solution given the measure error of the corners.

I think adding one or more points inside the marker will help the solution to be more stable. Adding more points will help as you cannot use cv::aruco::CORNER_REFINE_SUBPIX.

If you have multiple markers, you could use cv::aruco::estimatePoseSingleMarkers that will solve this instability in the solution

Regards

David

@antithing

(Note: @paland3 pointed out this code doesn’t work when the center of the marker isn’t at the origin. Haven’t found a fix. Further, I found cases where the 0 < R[1,1] < 1 condition doesn’t work, and haven’t found a fix for that either. Basically the “workaround” I posted is crap.)

  1. 0 < R.at<double>(1, 1) < 1 doesn’t do the same thing as in Python; you need 0 < ... && ... < 1
  2. You’re missing the premultication to “make Y the up axis, Z forward”; if you want to keep the coordinate system as is, you’ll need to make changes down the line, e.g. forward needs to be (0, 1, 0), R.at(1,1) should be R.at(2,2) (and maybe negative??), and FlipAxes will need to change… somehow, I can’t juggle these coordinates in my head 😃
  3. tnorm should be T / sqrt(...)
  4. Maybe double-check your cross-product calculation, or better yet, use tnorm.cross(forward)

@valtron Thanks for the workaround. As far as I can tell, the “flipping” also messes up the translation (origin of the coordinate system is predicted closer or further because of the rotation). Can you explain how your code address that?

Also, what does the angle * axis part produces?

EDIT: I implemented the workaround in Matlab, with some comments, might help someone.

function R = FixRotMatrix(R, T)

SwapAxes = [1 0 0;...
            0 0 1;...
            0 -1 0]; %swap axes

R = R*SwapAxes;

FlipAxes = [1 -1 1;... %flip the axes
            1 -1 1;...
            -1 1 -1];


R = R .* FlipAxes;

tnorm = T'/norm(T); %norm the translation vector
forward = [0 0 1]; % Z axis

ax = cross(tnorm, forward); %cross product of the pattern's T vector and Z

angle = -2*acos(forward*tnorm); %get the angle from the dot product

M = makehgtform('axisrotate', ax, angle); %rotate around axis 'ax' with angle 'angle'
RotR = M(1:3, 1:3); %get the rotation matrix part

R = RotR*R; %Rotate the rotation matrix to the new, good orientation
R = R*SwapAxes'; %Swap back the axes
end

As it only rotates the coord system, it won’t fix the translation error. Does anyone has a solution for that? Thanks again @valtron, you saved me a lot of time!