cmssw: HLT crash caused by `SiPixelDigisClustersFromSoA` (run 357271)

In run-357271, one HLT job crashed with the following error message:

[2] Calling method for module SiPixelDigisClustersFromSoA/'hltSiPixelClustersFromSoA'
Exception Message:
DetSetVector::inserv called with index already in collection;
index value: 344794116

(the monitoring tool does not provide the full error message from cmsRun, afaik)

The error is reproducible (see recipe below). Since it originates from the GPU branch of the reconstruction sequence, it can be reproduced only on a machine with a GPU. The input file is currently on lxplus.

FYI: @fwyzard @silviodonato

cmsrel CMSSW_12_4_6
cd CMSSW_12_4_6/src
cmsenv

hltGetConfiguration \
  run:357271 \
  --globaltag 124X_dataRun3_HLT_v4 \
  --process HLT \
  --data \
  --unprescale \
  --output all \
  --input file:/afs/cern.ch/work/m/missirol/public/fog/edm_run357271_ls1351.root \
  > hlt.py

cmsRun hlt.py &> hlt.log

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 57 (56 by maintainers)

Most upvoted comments

Status as of today: https://github.com/cms-sw/cmssw/pull/39711 is merged, will wait for IBs this evening and then merge the 12_4 PR when completed, and cut a 12_4_10_patch1.

Having tested the configuration posted by @missirol above with a dummy fix in which we sort the digis in SiPixelDigisClustersFromSoA by the raw id:

--git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
index d36c345ecf0..d5d2ae8a0c6 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
@@ -17,6 +17,8 @@
 #include "Geometry/Records/interface/TrackerTopologyRcd.h"
 #include "Geometry/CommonTopologies/interface/SimplePixelTopology.h"
 
+#include <numeric>
+
 // local include(s)
 #include "PixelClusterizerBase.h"
 #include "SiPixelClusterThresholds.h"
@@ -143,7 +145,13 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
       spc.abort();
   };
 
-  for (uint32_t i = 0; i < nDigis; i++) {
+  std::vector<uint32_t> sortIdxs(nDigis);
+  std::iota(sortIdxs.begin(), sortIdxs.end(), 0);
+  std::sort(
+      sortIdxs.begin(), sortIdxs.end(), [&](int32_t const i, int32_t const j) { return digis.rawIdArr(i) > digis.rawIdArr(j); });
+
+  for (uint32_t id = 0; id < nDigis; id++) {
+    auto i = sortIdxs[id];

everything run smoothly. I don’t see any obvious drawback (the sorting itself takes $0.71 \pm 0.12 \text{ms}$ on a gpu machine at P5). If this makes sense I may quickly open a PR.

+        printf("wordGPU %d %d %d \n",i, ww, rawId);

Shouldn’t that be

    printf("wordGPU %ud %ud %ud \n",i, ww, rawId);

I’m surprised the compiler didn’t warn, modern compilers are generally pretty good at spotting implicit type conversions in printf.

I did a bit of investigation and the code is crashing in https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_6/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc#L114 where edmNew::DetSetVector<SiPixelCluster>::FastFiller is constructed. The constructor eventually end up calling edmNew::DetSetVector<T>::addItem which in turn ends up calling edmNew::dstvdetails::errorIdExists where an exception is thrown. The reason for the exception is that the edmNew::DetSetVector<SiPixelCluster>::FastFiller assumes that a single DetId appears only once in a loop. Instead, what I observe in run: 357271 lumi: 1351 event: 2627443577 are DetIds which are not contiguous in the loop over digis in SiPixelDigisClustersFromSoA. Here is the printout

>> DetId digi: 344659972 39690
>> DetId digi: 344659972 39691
>> DetId digi: 344659972 39692
>> DetId digi: 344659972 39693
>> DetId: 344659972
>>>> clusId: 0
>> DetId digi: 344794116 39694
>> DetId digi: 344794116 39695
>> DetId digi: 344794116 39697
>> DetId digi: 344794116 39698
>> DetId digi: 344794116 39699
>> DetId digi: 344794116 39700
>> DetId digi: 344794116 39701
>> DetId digi: 344794116 39702
>> DetId digi: 344794116 39703
>> DetId digi: 344794116 39704
>> DetId digi: 344794116 39705
>> DetId digi: 344794116 39706
>> DetId: 344794116
>>>> clusId: 0
>>>> clusId: 1
%MSG-w SiPixelDigisClustersFromSoA:   SiPixelDigisClustersFromSoA:hltSiPixelClustersFromSoA  14-Aug-2022 18:48:14 CEST Run: 357271 Event: 2627443577
cluster below charge Threshold Layer/DetId/clusId 0/344794116/1 size/charge 1/3741
%MSG
>>>> clusId: 2
>>>> clusId: 3
>>>> clusId: 4
>>>> clusId: 5
>>>> clusId: 6
>>>> clusId: 7
>> DetId digi: 344795140 39707
>> DetId digi: 344795140 39708
>> DetId: 344795140
>>>> clusId: 0
>>>> clusId: 1
>> DetId digi: 344794116 39710
>> DetId digi: 344794116 39711
>> DetId digi: 344794116 39712
>> DetId digi: 344794116 39713
>> DetId digi: 344794116 39714
%MSG-w SiPixelDigisClustersFromSoA:   SiPixelDigisClustersFromSoA:hltSiPixelClustersFromSoA  14-Aug-2022 18:48:14 CEST Run: 357271 Event: 2627443577
Problem det present twice in input! 344794116
%MSG
>> DetId digi: 344795140 39716
>> DetId digi: 344795140 39717
>> DetId digi: 344795140 39718
>> DetId digi: 344795140 39719
>> DetId digi: 344795140 39720
>> DetId digi: 344795140 39721
>> DetId digi: 344795140 39722
>> DetId digi: 344795140 39723
>> DetId digi: 344795140 39724
>> DetId digi: 344795140 39725
>> DetId digi: 344795140 39726
>> DetId digi: 344795140 39727
>> DetId digi: 344795140 39728
>> DetId digi: 344795140 39729
>> DetId digi: 344795140 39730
>> DetId digi: 344795140 39731
>> DetId digi: 344795140 39732
>> DetId digi: 344795140 39733
>> DetId digi: 344795140 39734
>> DetId digi: 344795140 39735
%MSG-w SiPixelDigisClustersFromSoA:   SiPixelDigisClustersFromSoA:hltSiPixelClustersFromSoA  14-Aug-2022 18:48:14 CEST Run: 357271 Event: 2627443577
Problem det present twice in input! 344795140
%MSG

obtained by adding the following changes to SiPixelDigisClustersFromSoA

diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
index d36c345ecf0..c0328d665b0 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
@@ -43,6 +43,8 @@ private:
   const bool produceDigis_;
   const bool storeDigis_;
   const bool isPhase2_;
+  const std::string moduleType_;
+  const std::string moduleLabel_;
 };
 
 SiPixelDigisClustersFromSoA::SiPixelDigisClustersFromSoA(const edm::ParameterSet& iConfig)
@@ -53,7 +55,9 @@ SiPixelDigisClustersFromSoA::SiPixelDigisClustersFromSoA(const edm::ParameterSet
                          iConfig.getParameter<int>("clusterThreshold_otherLayers")},
       produceDigis_(iConfig.getParameter<bool>("produceDigis")),
       storeDigis_(iConfig.getParameter<bool>("produceDigis") & iConfig.getParameter<bool>("storeDigis")),
-      isPhase2_(iConfig.getParameter<bool>("isPhase2")) {
+      isPhase2_(iConfig.getParameter<bool>("isPhase2")),
+      moduleType_(iConfig.getParameter<std::string>("@module_type")),
+      moduleLabel_(iConfig.getParameter<std::string>("@module_label")) {
   if (produceDigis_)
     digiPutToken_ = produces<edm::DetSetVector<PixelDigi>>();
 }
@@ -111,10 +115,18 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
   auto fillClusters = [&](uint32_t detId) {
     if (nclus < 0)
       return;  // this in reality should never happen
+    if (outputClusters->exists(detId)) {
+      edm::LogWarning("SiPixelDigisClustersFromSoA")
+              << "Problem det present twice in input! " << detId;
+      nclus = -1;
+      return;
+    }
     edmNew::DetSetVector<SiPixelCluster>::FastFiller spc(*outputClusters, detId);
+    std::cout << ">> DetId: " << detId << std::endl;
     auto layer = (DetId(detId).subdetId() == 1) ? ttopo.pxbLayer(detId) : 0;
     auto clusterThreshold = clusterThresholds_.getThresholdForLayerOnCondition(layer == 1);
     for (int32_t ic = 0; ic < nclus + 1; ++ic) {
+      std::cout << ">>>> clusId: " << ic << std::endl;
       auto const& acluster = aclusters[ic];
       // in any case we cannot  go out of sync with gpu...
       if (acluster.charge < clusterThreshold and !isPhase2_)
@@ -143,6 +155,9 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
       spc.abort();
   };
 
+  std::cout << "===========================================================================" << std::endl;
+  std::cout << moduleType_ << ":" << moduleLabel_ << " START" << std::endl;
+  std::cout << "===========================================================================" << std::endl;
   for (uint32_t i = 0; i < nDigis; i++) {
     // check for uninitialized digis
     if (digis.rawIdArr(i) == 0)
@@ -172,6 +187,7 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
         }
       }
     }
+    std::cout << ">> DetId digi: " << detId << " " << i << std::endl;
     PixelDigi dig(digis.pdigi(i));
     if (storeDigis_)
       (*detDigis).data.emplace_back(dig);
@@ -186,7 +202,9 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
     SiPixelCluster::PixelPos pix(row, col);
     aclusters[digis.clus(i)].add(pix, digis.adc(i));
   }
-
+  std::cout << "===========================================================================" << std::endl;
+  std::cout << moduleType_ << ":" << moduleLabel_ << " END" << std::endl;
+  std::cout << "===========================================================================" << std::endl;
   // fill final clusters
   if (detId > 0)
     fillClusters(detId);

The root cause of the problem therefore seems to be somewhere else, presumably in either SiPixelDigisSoAFromCUDA or SiPixelRawToClusterCUDA. @VinInn or @AdrianoDee might have a better idea what could be the cause.