photutils: PSFPhotometry classes are really slow

I’m testing the DAOPhotPSFPhotometry class with a single iteration, which as far as I understand should equal a single run of the allstar IRAF task.

Even with fixed coordinates (obtained with a previous run of the star finder) in the IntegratedGaussianPRF psf_model I use, DAOPhotPSFPhotometry takes around 10 min to complete for ~900 stars in a modest laptop. The allstar task on the other hand, processing the same number of stars, takes just a few seconds.

Is this expected or is there something wrong with my parameter settings?

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 26 (22 by maintainers)

Most upvoted comments

Appears I completely forgot to actually reply to this thread when I was assigned this issue to investigate way back, sorry!

I actually did run some profiling – at least for zeroth order information on where the problem might lie – so I’ve dug out my old scripts, updated a few things, and thought I’d just (finally) write up the results.

I’ve compiled a fairly basic test: generate a few hundred IntegratedGaussianPRF sources on a ~1000x1000 pixel array via make_gaussian_prf_sources_image, add some background counts, add Poissonian noise with make_noise_image, and run a very similar pipeline to that in the documentation: find_peaks to generate a “real world” input data table, MMMBackground to get the background counts to set threshold, then call DAOPhotPSFPhotometry (as per the original comment on this issue, although that’s really just IterativelySubtractedPSFPhotometry; I also kept niter=3 by default, instead of limiting it to a single loop, but I’m not sure if it needs all of them).

For no particular reason I’ve run 5 loops of two different versions: one in which both __init__ and __call__ are run inside the loop, and one where the photometry is initialised outside the loop and just called Nloop times; this doesn’t seem to make any difference but I’ve just not bothered changing it from when I tested that before, the runtime is basically entirely within the call.

I’ve profiled the calls with line_profiler, and attach the results at the bottom of the post.

The end result is that DAOPhotPSFPhotometry via IterativelySubtractedPSFPhotometry, with no init_guesses supplied, subtracts a background and then calls self._do_photometry. The background subtraction takes ~1% of the (profiled) runtime, so we go a step deeper into _do_photometry. Here we see two calls to self.finder (in this instance DAOStarFinder), taking 3.5% of the runtime within _do_photometry between them, a percent runtime for self.group_maker (DAOGroup here), and 95% runtime for super().nstar.

Once again, following the overwhelming majority of the runtime, we can profile nstar (this time from BasicPSFPhotometry), and find 2% of THIS runtime (still 94% the total __call__ time) taken by get_grouped_psf_model, 3% from self._model_params2table, 2.5% from result_tab = vstack..., 5% from unc_tab = vstack..., and, the two major contributors: 32% from self.fitter and 53.5% from subtract_psf.

I can’t speak to the fitter, which is the default LevMarLSQFitter, but >half the runtime is from subtract_psf. Here, unfortunately, we hit a slightly less obvious culprit; subtract_psf is split evenly within its call: 2.5% from subbeddata = add_array... 4% relative runtime from psf.copy() 28% indices = np.indices(data.shape) 21.5% subbeddata = data.copy() 21.5% (twice) for y = extract_array... and x = extract_array...

EDIT: with scipy.__version__ of 1.5.2 the astropy LevMarLSQFitter calls spend 81% of their time in scipy.optimize.leastsq, with 7% in _validate_model, 2% in _model_to_fit_params, 3.5% in _fitter_to_model_params, and 5.5% in sum_sqrs = np.sum.... The trail for that ~third of the runtime goes dead for me there because I don’t particularly understand the scipy MINPACK wrappers, and figure the line profiling will break down at this point. I also assume that, being wrappers for some low-level language functions, these calls are fairly well optimised already.

So it seems that half of the runtime is from the subtraction of PSFs to form the next iteration’s residual image, but split four ways across four calls.

It’s printed in the script outputs, but for completeness, the versions of a few important packages are: photutils: 1.1.0.dev127+g2bba1caa astropy: 4.0.2 numpy: 1.19.2; photutils is just the master branch up to 1140 being merged.

Attached are (in .txt because it wont’ support .py apparently) the script I wrote to generate/fit test data, the output of the line profiling, and the print outputs from the script itself to verify the timings and whatnot: psf_photometry_speed.txt speed_results.txt script_outputs.txt

I wrote a new PSFPhotometry (and IterativePSFPhotometry) class from the ground up. It was released in 1.9.0 and is significantly faster (factors up to 200x). I plan to improve this further with multiprocessing in a future release. The old BasicPSFPhotometry and IterativelySubtractedPSFPhotometry classes were deprecated and should no longer be used (they will be removed in a couple releases).

Ah thanks, I forgot those were astropy functions. Running a quick profile on those two as well:

extract_array (21.5% x2 of subtract_psf): 12% isscalar(shape) check 2.5% isscalar(position) check 2% extracted_array = array_large[large_slices] 1.5% if (extracted_array.shape != shape) and (mode == 'partial'): 78.5% overlap_slices (perhaps unsurprisingly)

add_array (2.5% runtime): 7% the various checks on large_shape and small_shape within the if statement 75% overlap_slices 15% array_large[large_slices] += array_small[small_slices]

So those are mostly dominated by the overlap_slices calculation.

All this said, the way the profiler reports the timings there is loss in the system. For example, in the above speed_results.txt, nstar says its 8870 hits of line 422 are 242 seconds, but the total time we’re actually in subtract_psf is reported as 172 seconds. That could be overhead that’s actually in the system (i.e., time to enter/exit the function, return the parameters, pass by value, etc.), or overhead that the profiler is introducing (although it claims it subtracts its own overhead off the reported numbers, iirc).

This didn’t really matter in the above comment, as those runtimes were sufficiently large that the time spent in the function was the largest term (although 172 vs 242 is only 71% of the total time to call image = subtract_psf(image, self.psf_model, param_table, subshape=self.fitshape)…).

However, for the extra profiling I ran just now (for 200 sources, not 900), the profiler reports that subtract_psf spends 16 seconds calling the two extract_array functions, and 1.02 seconds calling add_array, but then reports that it is actually in extract_array for 0.63s, and in add_array for 0.24s! So actually it seems that overlap_slices (which has total of 0.7s runtime) is NOT the important part, but something within the function calls for those two lines. (This profiling can be seen in speed_results2.txt)

In full, those are: subbeddata = add_array(subbeddata, -psf(x, y), (y_0, x_0)) and y = extract_array(indices[0].astype(float), subshape, (y_0, x_0)) x = extract_array(indices[1].astype(float), subshape, (y_0, x_0))

Figuring the only actual code being run in the call is the astype(float) conversion I moved that out to its own line, adding inds_float = indices.astype(float) and then calling extract_array with inds_float[1]. As can be seen in speed_results3.txt, this moves 39% of the run time to the float conversion, which combines with indices = np.indices(data.shape) to 70% of subtract_psf’s runtime – with 22% left to the data.copy() a line below it.

Thus if something can be done to remove any, or all, of

    indices = np.indices(data.shape)
    inds_float = indices.astype(float)
    subbeddata = data.copy()

in subtract_psf the PSF fitting will get a big speed up. It can’t get much more than 3x faster unfortunately without improving the speed of the fitter we’re using by default, but on my 2019 macbook pro 13" I was fitting 900 sources in ~50 seconds so hopefully we’re already well optimised from the 10 minutes reported previously.

nope, I didn’t dig deep enough (neither saved the reports), but with a quick glance the output looked very similar to what I got with 3.2.1. Master run for ~17s while 3.2.1 ~16s.

Many thanks, @Onoddil! This is great.

@pllim That’s an old repo that had only some aperture photometry benchmarks. IIRC, it was running on @astrofrog 's personal mac mini. I’ll add it to the backlog to consider a more comprehensive set of benchmarks (using asv).

@fivestars95 Not yet, but it’s something I hope to have time to look at in the coming months.

fyi: the tbl = phot(data, init_guesses=init_tbl) step from that notebook hasn’t got quicker using post modeling refactor (though the PR linked above hasn’t been merged).

versions used:

In [25]: print(numpy.__version__, astropy.__version__, photutils.__version__)                                                                   
1.18.0.dev0+ea965e4 4.0.dev25533 0.7.dev3836

Thanks, @pllim!