bookkeeper: Missing ledgers when inducing network packet loss in Bookkeeper 4.15
BUG REPORT
Describe the bug
During our tests of Pravega (https://cncf.pravega.io/), we have detected a regression after upgrading to Bookkeeper 4.15 (https://github.com/pravega/pravega/pull/6676). The regression is consistently reproducible when doing a fault injection test (30% network packet loss). Note that this kind of test consistently pass using Bookkeeper 4.14, for which we believe that is a regression or a change in the behavior of the new Bookkeeper release.
A reproduction of this issue goes as follows:
- Pravega creates a ledger, in this case ledger
3246
:
2022-07-20T14:45:25.706422066+00:00 stdout F 2022-07-20 14:45:25,706 7043990 [ZKC-connect-executor-0-SendThread(zookeeper-client:2181)] DEBUG org.apache.zookeeper.ClientCnxn - Reading reply session id: 0x2016bc13e0a001c, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 127,1 replyHeader:: 127,38654707364,0 request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,#426f6f6b69654d65746164617461466f726d617456657273696f6e933affffff8c18210218020ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff12813238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f674964122313360ffffffe2ffffffe5ffffffcdffffff8bffffff81ffffff93ffffffd8ffffff9c10,v{s{31,s{'world,'anyone}}},0 response:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246
2022-07-20T14:45:25.706687124+00:00 stdout F 2022-07-20 14:45:25,706 7043990 [ZKC-connect-executor-0-EventThread] INFO o.a.b.client.LedgerCreateOp - Ensemble: [bookkeeper-bookie-2-11580, bookkeeper-bookie-0-28816] for ledger: 3246
2022-07-20T14:45:25.706725535+00:00 stdout F 2022-07-20 14:45:25,706 7043990 [core-13] INFO i.p.s.s.i.bookkeeper.BookKeeperLog - Log[13]: Created Ledger 3246.
2022-07-20T14:45:47.426112317+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [ZKC-connect-executor-0-SendThread(zookeeper-client:2181)] DEBUG org.apache.zookeeper.ClientCnxn - Reading reply session id: 0x2016bc13e0a001c, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 145,4 replyHeader:: 145,38654707423,0 request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,F response:: #426f6f6b69654d65746164617461466f726d617456657273696f6e933affffff8c18210218020ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff12813238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f674964122313360ffffffe2ffffffe5ffffffcdffffff8bffffff81ffffff93ffffffd8ffffff9c10,s{38654707364,38654707364,1658328325290,1658328325290,0,0,0,0,172,0,38654707364}
2022-07-20T14:45:47.426207948+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [ZKC-connect-executor-0-EventThread] DEBUG o.a.b.meta.AbstractZkLedgerManager - Registered ledger metadata listener ReadOnlyLedgerHandle(lid = 3246, id = 672259987) on ledger 3246.
- Probably after some network failures induced by the test, we see that this ledger is unregistered by Bookkeeper client (has it been actually removed from Zookeeper?):
2022-07-20T14:45:47.426475848+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [ZKC-connect-executor-0-SendThread(zookeeper-client:2181)] DEBUG org.apache.zookeeper.ClientCnxn - Reading reply session id: 0x2016bc13e0a001c, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 146,4 replyHeader:: 146,38654707423,0 request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,T response:: #426f6f6b69654d65746164617461466f726d617456657273696f6e933affffff8c18210218020ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff12813238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f674964122313360ffffffe2ffffffe5ffffffcdffffff8bffffff81ffffff93ffffffd8ffffff9c10,s{38654707364,38654707364,1658328325290,1658328325290,0,0,0,0,172,0,38654707364}
2022-07-20T14:45:47.426637438+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [BookKeeperClientWorker-OrderedExecutor-7-0] DEBUG o.a.b.client.PendingReadLacOp - Read LAC complete with enough validResponse for ledger: 3246 LAC: -1
2022-07-20T14:45:47.426737033+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [core-13] DEBUG o.a.b.meta.AbstractZkLedgerManager - Unregistered ledger metadata listener ReadOnlyLedgerHandle(lid = 3246, id = 672259987) on ledger 3246.
2022-07-20T14:46:16.612224510+00:00 stdout F 2022-07-20 14:46:16,612 7094896 [BookKeeperClientWorker-OrderedExecutor-7-0] DEBUG o.a.b.proto.PerChannelBookieClient - Could not write Add request to bookie 192.169.1.60/192.169.1.60:3181 for ledger 3246, entry 0
2022-07-20T14:46:16.612271701+00:00 stdout F 2022-07-20 14:46:16,612 7094896 [BookKeeperClientWorker-OrderedExecutor-7-0] ERROR o.a.bookkeeper.client.PendingAddOp - Write of ledger entry to quorum failed: L3246 E0
2022-07-20T14:46:16.616440402+00:00 stdout F 2022-07-20 14:46:16,616 7094900 [BookKeeperClientWorker-OrderedExecutor-7-0] DEBUG o.a.b.proto.PerChannelBookieClient - Could not write ReadLAC request to bookie null for ledger 3246, entry -1
- After that, Pravega still assumes that the ledger is there (as Pravega itself has not deleted it), but any further operation on this ledger fails:
2022-07-20T14:46:16.630253410+00:00 stdout F 2022-07-20 14:46:16,629 7094913 [core-30] DEBUG o.a.bookkeeper.client.LedgerHandle - Failed to close ledger 3246 :
2022-07-20T14:46:16.630795088+00:00 stdout F 2022-07-20 14:46:16,630 7094914 [core-30] ERROR i.p.s.s.i.bookkeeper.BookKeeperLog - Log[13]: Unable to close LedgerHandle for Ledger 3246.
2022-07-20T14:46:16.630795088+00:00 stdout F io.pravega.segmentstore.storage.DurableDataLogException: Unable to close ledger 3246.
- This makes that, when finding the ledger missing, that Segment Container in Pravega is stuck and cannot recover:
2022-07-20T15:59:48.793802385+00:00 stdout F 2022-07-20 15:59:48,793 11507077 [BookKeeperClientWorker-OrderedExecutor-7-0] DEBUG o.a.bookkeeper.client.PendingReadOp - No such entry found on bookie. L3246 E0 bookie: bookkeeper-bookie-0-28816
2022-07-20T15:59:48.793815178+00:00 stdout F 2022-07-20 15:59:48,793 11507077 [BookKeeperClientWorker-OrderedExecutor-7-0] ERROR o.a.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L3246 E0-E0, Sent to [bookkeeper-bookie-0-28816, bookkeeper-bookie-2-11580], Heard from [] : bitset = {}, Error = 'No such ledger exists on Bookies'. First unread entry is (-1, rc = null)
2022-07-20T15:59:48.793906376+00:00 stdout F 2022-07-20 15:59:48,793 11507077 [core-1] INFO i.p.s.s.c.StreamSegmentContainerMetadata - SegmentContainer[13]: Exit RecoveryMode.
...
2022-07-20T15:59:48.794030332+00:00 stdout F 2022-07-20 15:59:48,794 11507078 [core-8] INFO i.p.s.s.r.StreamSegmentReadIndex - ReadIndex[13-123]: Closed.
2022-07-20T15:59:48.794605613+00:00 stdout F 2022-07-20 15:59:48,794 11507078 [core-1] ERROR i.p.s.server.logs.DurableLog - DurableLog[13] Recovery FAILED.
2022-07-20T15:59:48.794605613+00:00 stdout F org.apache.bookkeeper.client.BKException$BKNoSuchLedgerExistsException: No such ledger exists on Bookies
2022-07-20T15:59:48.794605613+00:00 stdout F at org.apache.bookkeeper.client.BKException.create(BKException.java:74)
2022-07-20T15:59:48.794605613+00:00 stdout F at org.apache.bookkeeper.client.PendingReadOp.submitCallback(PendingReadOp.java:654)
2022-07-20T15:59:48.794605613+00:00 stdout F at org.apache.bookkeeper.client.PendingReadOp$LedgerEntryRequest.fail(PendingReadOp.java:176)
2022-07-20T15:59:48.794605613+00:00 stdout F at org.apache.bookkeeper.client.PendingReadOp$SequenceReadRequest.sendNextRead(PendingReadOp.java:397)
2022-07-20T15:59:48.794605613+00:00 stdout F at org.apache.bookkeeper.client.PendingReadOp$SequenceReadRequest.logErrorAndReattemptRead(PendingReadOp.java:440)
- At the Bookies side, we can see that the fence request is received by one Bookie (bookie-2) at
2022-07-20T14:48:11
:
2-bookie-9c88965c-2022-07-20T14-49-58Z.log.gz:2022-07-20T14:48:11.157496389+00:00 stdout F 14:48:11,157 INFO Ledger fence request received for ledger: 3246 from address: /192.168.0.228:57782
2-bookie-9c88965c-2022-07-20T14-49-58Z.log.gz:2022-07-20T14:48:11.160872045+00:00 stdout F 14:48:11,160 INFO Ledger fence request received for ledger: 3246 from address: /192.168.0.228:57782
- And that ledger seems to be present in the metadata of ledgers, at least for some time (look for
L3246
):
2-bookie-9c88965c-2022-07-20T14-55-24Z.log.gz:2022-07-20T14:51:48.716729331+00:00 stdout F 14:51:48,716 DEBUG Reading reply session id: 0x3016baeab7f000f, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000 finished:false header:: 36287,8 replyHeader:: 36287,38654709412,0 request:: '/pravega/pravega/bookkeeper/ledgers/00/0000,F response:: v{'L3195,'L3070,'L3071,'L3192,'L2784,'L3078,'L3199,'L3079,'L3076,'L3197,'L3077,'L3198,'L3081,'L0377,'L3089,'L3087,'L0372,'L3052,'L3173,'L3053,'L3295,'L3050,'L3171,'L3172,'L3293,'L3290,'L3170,'L3291,'L0463,'L3058,'L0464,'L3059,'L3056,'L3177,'L3057,'L3178,'L0461,'L3054,'L3175,'L3055,'L3176,'L3297,'L3063,'L3184,'L3064,'L3061,'L3182,'L3062,'L3183,'L3181,'L3069,'L3067,'L3188,'L3068,'L3065,'L3066,'L3310,'L1814,'L3317,'L3203,'L3324,'L3322,'L3202,'L3323,'L3200,'L0734,'L1705,'L0737,'L3209,'L0736,'L3207,'L3328,'L3208,'L3329,'L3205,'L0732,'L3327,'L0739,'L0738,'L3096,'L3094,'L3095,'L3092,'L3093,'L3091,'L0388,'L0702,'L0704,'L0703,'L2327,'L0709,'L0706,'L0705,'L0708,'L0707,'L0390,'L3302,'L2452,'L0713,'L0712,'L3308,'L0714,'L3306,'L3307,'L0711,'L0710,'L3305,'L3115,'L3236,'L3116,'L3237,'L0762,'L3113,'L3234,'L3114,'L3233,'L3230,'L3110,'L3231,'L0768,'L0404,'L0767,'L0769,'L0764,'L3119,'L0763,'L3238,'L0765,'L3118,'L0771,'L3126,'L1983,'L3127,'L3248,'L3245,'L3125,'L3246,'L3122,'L3243,'L3123,'L3244,'L3120,'L3000,'L0418,'L2957,'L3009,'L2955,'L3007,'L3214,'L0740,'L2485,'L3211,'L3330,'L0745,'L3218,'L0741,'L0744,'L3104,'L3105,'L3226,'L3102,'L3103,'L3224,'L3101,'L2130,'L0757,'L0756,'L2816,'L0759,'L0758,'L3108,'L3107,'L3030,'L3151,'L3031,'L3152,'L3273,'L3150,'L3038,'L3159,'L2984,'L3278,'L2982,'L3037,'L3279,'L3155,'L2980,'L3156,'L3277,'L2064,'L3032,'L3153,'L3274,'L3033,'L3154,'L3275,'L2989,'L0444,'L2988,'L0447,'L2985,'L2986,'L2073,'L3041,'L3162,'L3283,'L3042,'L3163,'L3160,'L3281,'L3040,'L3161,'L3280,'L0694,'L0696,'L2992,'L3047,'L3048,'L3169,'L3166,'L3167,'L0692,'L3043,'L3164,'L3165,'L2999,'L2996,'L3130,'L3137,'L3017,'L3014,'L3135,'L3015,'L3136,'L3012,'L3134,'L3010,'L3131,'L3252,'L3132,'L2969,'L2966,'L3018,'L3139,'L3019,'L3140,'L3020,'L3141,'L3148,'L3269,'L0671,'L2610,'L2973,'L3149,'L3025,'L2971,'L3147,'L3024,'L3145,'L3142,'L3143,'L2979,'L2735,'L2974,'L3029}
2-bookie-9c88965c-2022-07-20T14-55-24Z.log.gz:2022-07-20T14:51:48.736027964+00:00 stdout F 14:51:48,735 DEBUG Reading reply session id: 0x3016baeab7f000f, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 36521,4 replyHeader:: 36521,38654709412,0 request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,F response:: #426f6f6b69654d65746164617461466f726d617456657273696f6e933a7c8210218ffffff9f92002833238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f6749641223133600,s{38654707364,38654708391,1658328325290,1658328491164,2,0,0,0,155,0,38654707364}
2-bookie-9c88965c-2022-07-20T14-55-52Z.log.gz:2022-07-20T14:55:21.070507213+00:00 stdout F 14:55:21,070 DEBUG Reading reply session id: 0x3016baeab7f000f, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000 finished:false header:: 36739,8 replyHeader:: 36739,38654709965,0 request:: '/pravega/pravega/bookkeeper/ledgers/00/0000,F response:: v{'L3195,'L3070,'L3071,'L3192,'L2784,'L3078,'L3199,'L3079,'L3076,'L3197,'L3077,'L3198,'L3081,'L0377,'L3089,'L3087,'L0372,'L3052,'L3053,'L3295,'L3050,'L3171,'L3172,'L3293,'L3170,'L3291,'L0463,'L3058,'L0464,'L3059,'L3056,'L3177,'L3057,'L3178,'L0461,'L3054,'L3175,'L3055,'L3176,'L3063,'L3184,'L3064,'L3061,'L3182,'L3062,'L3183,'L3181,'L3069,'L3067,'L3188,'L3068,'L3065,'L3066,'L1814,'L3317,'L3203,'L3324,'L3322,'L3202,'L3323,'L3200,'L0734,'L1705,'L0737,'L3209,'L0736,'L3207,'L3208,'L3329,'L3205,'L0732,'L3327,'L0739,'L0738,'L3096,'L3094,'L3095,'L3092,'L3093,'L3091,'L0388,'L0702,'L0704,'L0703,'L2327,'L0709,'L0706,'L0705,'L0708,'L0707,'L0390,'L3302,'L2452,'L0713,'L0712,'L3308,'L0714,'L3306,'L3307,'L0711,'L0710,'L3305,'L3350,'L3115,'L3236,'L3116,'L3237,'L3358,'L0762,'L3113,'L3234,'L3114,'L3353,'L3233,'L3354,'L3230,'L3351,'L3110,'L3231,'L3352,'L0768,'L0404,'L0767,'L0769,'L0764,'L3119,'L0763,'L3238,'L0765,'L3118,'L3360,'L0771,'L3126,'L3368,'L1983,'L3248,'L3245,'L3366,'L3125,'L3246,'L3122,'L3243,'L3364,'L3123,'L3244,'L3365,'L3120,'L3000,'L3363,'L0418,'L2957,'L3009,'L2955,'L3007,'L3214,'L0740,'L2485,'L3211,'L3330,'L0745,'L3218,'L0741,'L0744,'L3338,'L3104,'L3346,'L3105,'L3226,'L3347,'L3102,'L3103,'L3224,'L3345,'L3101,'L2130,'L3341,'L0757,'L0756,'L2816,'L0759,'L0758,'L3108,'L3348,'L3107,'L3349,'L3030,'L3151,'L3031,'L3152,'L3273,'L3150,'L3390,'L3038,'L3159,'L2984,'L3278,'L2982,'L3037,'L3279,'L3155,'L2980,'L3156,'L3277,'L2064,'L3032,'L3153,'L3274,'L3033,'L3154,'L3275,'L2989,'L0444,'L2988,'L0447,'L2985,'L2986,'L2073,'L3041,'L3162,'L3042,'L3163,'L3160,'L3281,'L3040,'L3161,'L3280,'L0694,'L0696,'L2992,'L3047,'L3048,'L3169,'L3166,'L3167,'L0692,'L3043,'L3164,'L3165,'L2999,'L2996,'L3130,'L3372,'L3370,'L3137,'L3379,'L3017,'L3014,'L3135,'L3377,'L3015,'L3136,'L3378,'L3012,'L3375,'L3134,'L3010,'L3131,'L3252,'L3373,'L3132,'L3374,'L2969,'L2966,'L3018,'L3139,'L3019,'L3140,'L3382,'L3020,'L3141,'L3383,'L3380,'L3381,'L3148,'L3269,'L0671,'L2610,'L2973,'L3149,'L3025,'L3388,'L2971,'L3147,'L3389,'L3024,'L3387,'L3142,'L3384,'L3143,'L3385,'L2979,'L2735,'L2974,'L3029}
2-bookie-9c88965c-2022-07-20T14-55-52Z.log.gz:2022-07-20T14:55:21.090055084+00:00 stdout F 14:55:21,090 DEBUG Reading reply session id: 0x3016baeab7f000f, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 36970,4 replyHeader:: 36970,38654709965,0 request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,F response:: #426f6f6b69654d65746164617461466f726d617456657273696f6e933a7c8210218ffffff9f92002833238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f6749641223133600,s{38654707364,38654708391,1658328325290,1658328491164,2,0,0,0,155,0,38654707364}
2-bookie-9c88965c-2022-07-20T14-55-52Z.log.gz:2022-07-20T14:55:21.095978533+00:00 stdout F 14:55:21,095 INFO Following ledgers: [3329, 771, 3076, 3077, 3081, 3338, 2064, 3089, 3346, 3091, 3092, 3348, 1814, 3094, 3350, 2327, 3095, 3351, 3096, 3352, 2073, 3354, 3101, 3102, 3104, 3105, 3107, 3108, 3114, 3115, 3116, 3118, 3119, 2610, 3122, 3123, 3125, 3130, 3131, 3132, 3134, 3135, 3136, 3139, 3140, 3141, 3143, 3147, 3148, 3150, 3151, 3152, 3153, 3155, 3156, 3159, 3160, 3162, 3163, 3164, 3165, 3166, 3167, 3169, 3171, 3172, 3175, 3176, 3177, 3182, 3183, 3184, 372, 3199, 3202, 388, 3205, 390, 3208, 3209, 3211, 2957, 3214, 2452, 3224, 3226, 2971, 3230, 671, 3231, 3233, 3236, 1705, 3243, 2988, 3244, 3245, 3246, 2735, 692, 2485, 694, 2999, 3000, 702, 447, 703, 704, 707, 708, 710, 3014, 712, 713, 3273, 714, 3274, 3277, 3278, 463, 3025, 3281, 3029, 3031, 3032, 3033, 3291, 732, 3037, 734, 3038, 3295, 736, 3040, 737, 3041, 738, 3042, 739, 3043, 740, 741, 3047, 744, 745, 3305, 3050, 3306, 3307, 3052, 3308, 3053, 3054, 3055, 3057, 3058, 756, 3061, 3317, 758, 759, 3063, 3064, 3065, 3066, 763, 3068, 765, 3069, 767, 3071, 3327] of bookie: [bookkeeper-bookie-0-28816] are identified as underreplicated
2-bookie-9c88965c-2022-07-20T15-00-53Z.log.gz:2022-07-20T14:56:19.912265378+00:00 stdout F 14:56:19,912 DEBUG Reading reply session id: 0x1016bac9b630005, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000 finished:false header:: 6,8 replyHeader:: 6,38654710371,0 request:: '/pravega/pravega/bookkeeper/ledgers/00/0000,F response:: v{'L3195,'L3070,'L3071,'L3192,'L2784,'L3078,'L3199,'L3079,'L3076,'L3197,'L3077,'L3198,'L3081,'L3401,'L0377,'L3400,'L3089,'L3087,'L0372,'L3407,'L3408,'L3405,'L3406,'L3403,'L3404,'L3052,'L3053,'L3295,'L3050,'L3171,'L3172,'L3293,'L3170,'L3291,'L0463,'L3058,'L0464,'L3059,'L3056,'L3177,'L3057,'L3178,'L0461,'L3054,'L3175,'L3055,'L3176,'L3063,'L3184,'L3064,'L3061,'L3182,'L3062,'L3183,'L3181,'L3069,'L3067,'L3188,'L3068,'L3065,'L3066,'L1814,'L3317,'L3203,'L3324,'L3322,'L3202,'L3323,'L3200,'L0734,'L1705,'L0737,'L3209,'L0736,'L3207,'L3208,'L3205,'L0732,'L3327,'L0739,'L0738,'L3096,'L3094,'L3095,'L3092,'L3093,'L3091,'L3412,'L3413,'L0388,'L3410,'L3411,'L0702,'L0704,'L3418,'L0703,'L3419,'L2327,'L3416,'L3417,'L3414,'L3415,'L0709,'L0706,'L0705,'L0708,'L0707,'L0390,'L3302,'L3423,'L3424,'L3421,'L3422,'L2452,'L3420,'L0713,'L0712,'L3308,'L0714,'L3306,'L3307,'L0711,'L3425,'L0710,'L3305,'L3426,'L3350,'L3115,'L3236,'L3116,'L3237,'L3358,'L0762,'L3113,'L3234,'L3114,'L3353,'L3233,'L3354,'L3230,'L3351,'L3110,'L3231,'L3352,'L0768,'L0404,'L0767,'L0769,'L0764,'L3119,'L0763,'L3238,'L0765,'L3118,'L3360,'L0771,'L3126,'L3368,'L1983,'L3248,'L3245,'L3366,'L3125,'L3246,'L3122,'L3243,'L3364,'L3123,'L3244,'L3365,'L3120,'L3000,'L3363,'L0418,'L2957,'L3009,'L2955,'L3007,'L3214,'L0740,'L2485,'L3211,'L3330,'L0745,'L3218,'L0741,'L0744,'L3338,'L3104,'L3346,'L3105,'L3226,'L3347,'L3102,'L3103,'L3224,'L3345,'L3101,'L2130,'L3341,'L0757,'L0756,'L2816,'L0759,'L0758,'L3108,'L3348,'L3107,'L3349,'L3030,'L3151,'L3393,'L3031,'L3152,'L3273,'L3391,'L3150,'L3392,'L3038,'L3159,'L2984,'L3278,'L3399,'L2982,'L3037,'L3279,'L3155,'L2980,'L3156,'L3277,'L3398,'L2064,'L3032,'L3153,'L3274,'L3395,'L3033,'L3154,'L3275,'L3396,'L2989,'L0444,'L2988,'L0447,'L2985,'L2986,'L2073,'L3041,'L3162,'L3042,'L3163,'L3160,'L3281,'L3040,'L3161,'L3280,'L0694,'L0696,'L2992,'L3047,'L3048,'L3169,'L3166,'L3167,'L0692,'L3043,'L3164,'L3165,'L2999,'L2996,'L3130,'L3372,'L3137,'L3017,'L3014,'L3135,'L3015,'L3136,'L3378,'L3012,'L3134,'L3010,'L3131,'L3252,'L3132,'L3374,'L2969,'L2966,'L3018,'L3139,'L3019,'L3140,'L3382,'L3020,'L3141,'L3148,'L3269,'L0671,'L2610,'L2973,'L3149,'L3025,'L3388,'L2971,'L3147,'L3389,'L3024,'L3142,'L3384,'L3143,'L3385,'L2979,'L2735,'L2974,'L3029}
2-bookie-9c88965c-2022-07-20T15-00-53Z.log.gz:2022-07-20T14:56:19.932194546+00:00 stdout F 14:56:19,932 DEBUG Reading reply session id: 0x1016bac9b630005, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000 finished:false header:: 12,8 replyHeader:: 12,38654710371,0 request:: '/pravega/pravega/bookkeeper/ledgers/00/0000,F response:: v{'L3195,'L3070,'L3071,'L3192,'L2784,'L3078,'L3199,'L3079,'L3076,'L3197,'L3077,'L3198,'L3081,'L3401,'L0377,'L3400,'L3089,'L3087,'L0372,'L3407,'L3408,'L3405,'L3406,'L3403,'L3404,'L3052,'L3053,'L3295,'L3050,'L3171,'L3172,'L3293,'L3170,'L3291,'L0463,'L3058,'L0464,'L3059,'L3056,'L3177,'L3057,'L3178,'L0461,'L3054,'L3175,'L3055,'L3176,'L3063,'L3184,'L3064,'L3061,'L3182,'L3062,'L3183,'L3181,'L3069,'L3067,'L3188,'L3068,'L3065,'L3066,'L1814,'L3317,'L3203,'L3324,'L3322,'L3202,'L3323,'L3200,'L0734,'L1705,'L0737,'L3209,'L0736,'L3207,'L3208,'L3205,'L0732,'L3327,'L0739,'L0738,'L3096,'L3094,'L3095,'L3092,'L3093,'L3091,'L3412,'L3413,'L0388,'L3410,'L3411,'L0702,'L0704,'L3418,'L0703,'L3419,'L2327,'L3416,'L3417,'L3414,'L3415,'L0709,'L0706,'L0705,'L0708,'L0707,'L0390,'L3302,'L3423,'L3424,'L3421,'L3422,'L2452,'L3420,'L0713,'L0712,'L3308,'L0714,'L3306,'L3307,'L0711,'L3425,'L0710,'L3305,'L3426,'L3350,'L3115,'L3236,'L3116,'L3237,'L3358,'L0762,'L3113,'L3234,'L3114,'L3353,'L3233,'L3354,'L3230,'L3351,'L3110,'L3231,'L3352,'L0768,'L0404,'L0767,'L0769,'L0764,'L3119,'L0763,'L3238,'L0765,'L3118,'L3360,'L0771,'L3126,'L3368,'L1983,'L3248,'L3245,'L3366,'L3125,'L3246,'L3122,'L3243,'L3364,'L3123,'L3244,'L3365,'L3120,'L3000,'L3363,'L0418,'L2957,'L3009,'L2955,'L3007,'L3214,'L0740,'L2485,'L3211,'L3330,'L0745,'L3218,'L0741,'L0744,'L3338,'L3104,'L3346,'L3105,'L3226,'L3347,'L3102,'L3103,'L3224,'L3345,'L3101,'L2130,'L3341,'L0757,'L0756,'L2816,'L0759,'L0758,'L3108,'L3348,'L3107,'L3349,'L3030,'L3151,'L3393,'L3031,'L3152,'L3273,'L3391,'L3150,'L3392,'L3038,'L3159,'L2984,'L3278,'L3399,'L2982,'L3037,'L3279,'L3155,'L2980,'L3156,'L3277,'L3398,'L2064,'L3032,'L3153,'L3274,'L3395,'L3033,'L3154,'L3275,'L3396,'L2989,'L0444,'L2988,'L0447,'L2985,'L2986,'L2073,'L3041,'L3162,'L3042,'L3163,'L3160,'L3281,'L3040,'L3161,'L3280,'L0694,'L0696,'L2992,'L3047,'L3048,'L3169,'L3166,'L3167,'L0692,'L3043,'L3164,'L3165,'L2999,'L2996,'L3130,'L3372,'L3137,'L3017,'L3014,'L3135,'L3015,'L3136,'L3378,'L3012,'L3134,'L3010,'L3131,'L3252,'L3132,'L3374,'L2969,'L2966,'L3018,'L3139,'L3019,'L3140,'L3382,'L3020,'L3141,'L3148,'L3269,'L0671,'L2610,'L2973,'L3149,'L3025,'L3388,'L2971,'L3147,'L3389,'L3024,'L3142,'L3384,'L3143,'L3385,'L2979,'L2735,'L2974,'L3029}
2-bookie-9c88965c-2022-07-20T15-00-53Z.log.gz:2022-07-20T14:56:32.051382550+00:00 stdout F 14:56:32,051 DEBUG Reading reply session id: 0x3016baeab7f000f, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 40548,4 replyHeader:: 40548,38654710417,0 request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,F response:: #426f6f6b69654d65746164617461466f726d617456657273696f6e933a7c8210218ffffff9f92002833238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f6749641223133600,s{38654707364,38654708391,1658328325290,1658328491164,2,0,0,0,155,0,38654707364}
- But the ledger does not exist when we attempt to read from it:
0-bookie-adfe5832-2022-07-20T18-23-58Z.log.gz:2022-07-20T15:59:48.625362817+00:00 stdout F 15:59:48,625 INFO No ledger found while reading entry: 0 from ledger: 3246
2-bookie-9c88965c-2022-07-20T17-26-08Z.log.gz:2022-07-20T15:59:48.624934442+00:00 stdout F 15:59:48,624 INFO No ledger found while reading entry: 0 from ledger: 3246
...
2-bookie-9c88965c-2022-07-20T17-26-08Z.log.gz:2022-07-20T16:39:42.753841016+00:00 stdout F 16:39:42,753 INFO No ledger found while reading entry: 0 from ledger: 3246
In my understanding, from the Pravega point of view ledger 3246
is created and not explicitly deleted by Pravega itself, which means that there has been a data loss, most likely induced by some corner case exposed via network packet loss. In normal conditions, where there are no induced packet loss, things work fine. But it does not mean that, in a real cluster, a similar network failure could happen and the same issue could be reproduced. Again, we have been doing fault injection tests with Bookkeeper 4.14 and probably previous versions as well (e.g., Bookkeeper 4.11), and this is the first time we encounter this behavior.
To Reproduce
Steps to reproduce the behavior:
- Write to Bookkeeper using Pravega (https://github.com/pravega/pravega).
- Introduce 30% network packet loss with LitmusChaos (see https://www.youtube.com/watch?v=EU_g8cKa1G8 for a demo on how to reproduce this experiment)
- Consistently, Pravega gets a
No Such Ledger Exception
in some containers using Bookkeeper 4.15 (with Bookkeeper 4.14, the system is able to work under such conditions)
Expected behavior
Bookkeeper 4.15 should offer a similar behavior in how ledgers are managed compared to Bookkeeper 4.14 (and previous versions).
Screenshots
n/a
Additional context
At the moment, there are 2 changes that could be related to this issue (but I cannot confirm that they actually cause the problem):
- Upgrade of Zookeeper library from 3.6.x to 3.8.x.
- This PR in 4.15 and not in 4.14 seems related to how ledgers are managed: https://github.com/apache/bookkeeper/pull/3361
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 22 (22 by maintainers)
I am working on this. I am preparing a set of Pravega docker images to pass to Dell-EMC Quality Eng team who can run the tests.
The. docker images contain a build of Pravega compatible with BK 4.15 (thanks to @RaulGracia ) but with the BK client compiled at different points in the GIT history of 4.15.0.
This way we will see which commit broke the compatibility.
It will take much time because those system tests take at least 24h before failing
@zymap thanks a lot for your analysis. Yes, you are right, we are using 4.14.2 concretely, as the last commit we have from mainline is . And we do not have that PR #2794 when Bookkeeper works well with our fault injection tests. Maybe we could try another test with latest 4.14 commit, to see if the same issue is reproduced.
FYI @zymap @eolivelli @dlg99 we have performed the same test again with Bookkeeper 4.14.3 client as suggested by @zymap, but we have not been able to reproduce the issue this time. So, at the moment, with the evidences we have we cannot say that #2794 is the root cause of the problem.