HDFS-16664. Use correct GenerationStamp when invalidating corrupt block replicas #4568

KevinWikant · 2022-07-16T16:50:33Z

Description of PR

Under certain conditions the Namenode can send the incorrect generationStamp to a datanode when invalidating a corrupt block replica.

the generationStamp sent in the DNA_INVALIDATE is based on the generationStamp of the block sent in the block report
the problem is that the datanode with the corrupt block replica (that receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the block report
this can cause the above exception when the corrupt block replica on the datanode receiving the DNA_INVALIDATE & the block replica on the datanode that sent the block report have different generationStamps

Results in the following datanode exception:

2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] WARN  datanode.DataNode (BPServiceActor.java:processCommand(887)) - Error processing datanode Command
java.io.IOException: Failed to delete 1 (out of 1) replica(s):
0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, existing replica is blk_1073741825_1001
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
        at java.lang.Thread.run(Thread.java:750)

See JIRA for additional details: https://issues.apache.org/jira/browse/HDFS-16664

How was this patch tested?

Validated the fix by leveraging the unit test "TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock"

Failed Test - Before this change

> mvn test -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock


[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419]

> cat target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | grep 'Expected Replicas:\|XXX\|FINALIZED\|Block now\|Failed to delete'


2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 live replica on 127.0.0.1:61366
2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 decommissioning replica on 127.0.0.1:61366
XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003
XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005
XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003
XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED
2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] INFO  impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED
XXX addBlock dn=127.0.0.1:61419 , blk=1073741825_1005   <<<  block report is coming from 127.0.0.1:61419 which has genStamp=1005
XXX invalidateCorruptReplicas dn=127.0.0.1:61415 , reported_blk=1073741825_1005   <<<  corrupt replica is on 127.0.0.1:61415 which is expecting genStamp=1001
XXX addToInvalidates dn=127.0.0.1:61415 , blk=1073741825_1005
2022-07-16 08:07:49,431 [DatanodeAdminMonitor-0] INFO  BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(417)) - Block: blk_1073741825_1005, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 127.0.0.1:61366 127.0.0.1:61419 , Current Datanode: 127.0.0.1:61366, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
XXX addBlocksToBeInvalidated dn=127.0.0.1:61415 , blk=1073741825_1005   <<<  Namenode sends wrong genStamp to 127.0.0.1:61415
XXX DNA_INVALIDATE dn=/127.0.0.1:61415 , blk=1073741825_1005
XXX invalidate(on DN) dn=/127.0.0.1:61415 , invalidBlk=blk_1073741825_1005 , blkByIdAndGenStamp = null
XXX invalidate(on DN) dn=/127.0.0.1:61415 , invalidBlk=blk_1073741825_1005 , blkById = FinalizedReplica, blk_1073741825_1001, FINALIZED
2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] WARN  datanode.DataNode (BPServiceActor.java:processCommand(887)) - Error processing datanode Command
java.io.IOException: Failed to delete 1 (out of 1) replica(s):
0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, existing replica is blk_1073741825_1001
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
        at java.lang.Thread.run(Thread.java:750)
2022-07-16 08:07:52,384 [DataXceiver for client  at /127.0.0.1:61434 [Receiving block BP-958471676-X-1657973243350:blk_1073741825_1005]] INFO  datanode.DataNode (DataXceiver.java:writeBlock(939)) - opWriteBlock BP-958471676-X-1657973243350:blk_1073741825_1005 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-958471676-X-1657973243350:blk_1073741825_1005 already exists in state FINALIZED and thus cannot be created.
2022-07-16 08:07:52,385 [DataXceiver for client  at /127.0.0.1:61434 [Receiving block BP-958471676-X-1657973243350:blk_1073741825_1005]] INFO  datanode.DataNode (DataXceiver.java:run(307)) - 127.0.0.1:61415:DataXceiver error processing WRITE_BLOCK operation  src: /127.0.0.1:61434 dst: /127.0.0.1:61415; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-958471676-X-1657973243350:blk_1073741825_1005 already exists in state FINALIZED and thus cannot be created.
2022-07-16 08:07:54,422 [DatanodeAdminMonitor-0] INFO  BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(417)) - Block: blk_1073741825_1005, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 127.0.0.1:61366 127.0.0.1:61419 , Current Datanode: 127.0.0.1:61366, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
...
2022-07-16 08:08:24,426 [DatanodeAdminMonitor-0] INFO  BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(417)) - Block: blk_1073741825_1005, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 127.0.0.1:61366 127.0.0.1:61419 , Current Datanode: 127.0.0.1:61366, Is current datanode decommissioning: true, Is current datanode entering maintenance: false

Note the inline comments above which illustrate the bug

Successful Test - After this change

> mvn test -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock


[INFO] Results:
[INFO] 
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

Logs:

> cat target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | grep 'Expected Replicas:\|XXX\|FINALIZED\|Block now\|Failed to delete'


2022-07-16 07:54:30,648 [Listener at localhost/60376] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - Block now has 2 corrupt replicas on [127.0.0.1:60364 , 127.0.0.1:60368] and 1 live replica on 127.0.0.1:60373
2022-07-16 07:54:30,669 [Listener at localhost/60376] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - Block now has 2 corrupt replicas on [127.0.0.1:60364 , 127.0.0.1:60368] and 1 decommissioning replica on 127.0.0.1:60373
XXX invalidateBlock dn=127.0.0.1:60423 , blk=1073741825_1001
XXX postponeBlock dn=127.0.0.1:60423 , blk=1073741825_1001
XXX invalidateBlock dn=127.0.0.1:60427 , blk=1073741825_1003
XXX addToInvalidates dn=127.0.0.1:60427 , blk=1073741825_1003
XXX addBlocksToBeInvalidated dn=127.0.0.1:60427 , blk=1073741825_1003
XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005
XXX DNA_INVALIDATE dn=/127.0.0.1:60427 , blk=1073741825_1003
XXX invalidate(on DN) dn=/127.0.0.1:60427 , invalidBlk=blk_1073741825_1003 , blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED
2022-07-16 07:54:32,831 [BP-1469857843-X-1657972447604 heartbeating to localhost/127.0.0.1:60363] INFO  impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED
2022-07-16 07:54:33,772 [DatanodeAdminMonitor-0] INFO  BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(417)) - Block: blk_1073741825_1005, Expected Replicas: 2, live replicas: 0, corrupt replicas: 1, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 127.0.0.1:60373 127.0.0.1:60423 , Current Datanode: 127.0.0.1:60373, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
XXX addBlock dn=127.0.0.1:60427 , blk=1073741825_1005
XXX invalidateCorruptReplicas dn=127.0.0.1:60423 , reported_blk=1073741825_1005
XXX getCorruptReplicaGenerationStamp dn=127.0.0.1:60423 , genStamp=1001
XXX addToInvalidates dn=127.0.0.1:60423 , blk=1073741825_1001
XXX addBlocksToBeInvalidated dn=127.0.0.1:60423 , blk=1073741825_1001
XXX DNA_INVALIDATE dn=/127.0.0.1:60423 , blk=1073741825_1001
XXX invalidate(on DN) dn=/127.0.0.1:60423 , invalidBlk=blk_1073741825_1001 , blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1001, FINALIZED
2022-07-16 07:54:35,796 [BP-1469857843-X-1657972447604 heartbeating to localhost/127.0.0.1:60363] INFO  impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling blk_1073741825_1001 replica FinalizedReplica, blk_1073741825_1001, FINALIZED
XXX addBlock dn=127.0.0.1:60423 , blk=1073741825_1005
2022-07-16 07:54:40,768 [Listener at localhost/60430] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2050)) - Block now has 2 live replicas on [127.0.0.1:60423 , 127.0.0.1:60427] and 1 decommissioned replica on 127.0.0.1:60373

Using "getCorruptReplicaGenerationStamp" allows the Namenode to get the correct generationStamp for the corrupt block replica

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
[n/a] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
[n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
[n/a] If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

replicas

hadoop-yetus · 2022-07-17T01:02:17Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 55s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ trunk Compile Tests _
+1 💚	mvninstall	40m 30s		trunk passed
+1 💚	compile	1m 41s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	1m 32s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	1m 19s		trunk passed
+1 💚	mvnsite	1m 42s		trunk passed
+1 💚	javadoc	1m 20s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 45s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 47s		trunk passed
+1 💚	shadedclient	26m 13s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 25s		the patch passed
+1 💚	compile	1m 28s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javac	1m 28s		the patch passed
+1 💚	compile	1m 20s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	1m 20s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 1s	/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs-project/hadoop-hdfs: The patch generated 3 new + 88 unchanged - 1 fixed = 91 total (was 89)
+1 💚	mvnsite	1m 26s		the patch passed
+1 💚	javadoc	0m 59s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 33s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
-1 ❌	spotbugs	3m 35s	/new-spotbugs-hadoop-hdfs-project_hadoop-hdfs.html	hadoop-hdfs-project/hadoop-hdfs generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
+1 💚	shadedclient	26m 9s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	unit	372m 9s	/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs in the patch passed.
+1 💚	asflicense	1m 1s		The patch does not generate ASF License warnings.
		490m 23s

Reason	Tests
SpotBugs	module:hadoop-hdfs-project/hadoop-hdfs
	Should org.apache.hadoop.hdfs.server.blockmanagement.CorruptReplicasMap$CorruptBlockReplica be a static inner class? At CorruptReplicasMap.java:inner class? At CorruptReplicasMap.java:[lines 61-64]
Failed junit tests	hadoop.hdfs.server.namenode.TestFsck
	hadoop.hdfs.server.namenode.ha.TestObserverNode

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/1/artifact/out/Dockerfile
GITHUB PR	#4568
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 653bab42d979 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `e66897e`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/1/testReport/
Max. process+thread count	2700 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

KevinWikant · 2022-07-17T11:42:26Z

[ERROR] Tests run: 6803, Failures: 0, Errors: 2, Skipped: 24, Flakes: 8
[INFO] 
[ERROR] There are test failures

[ERROR] Errors: 
[ERROR] org.apache.hadoop.hdfs.server.namenode.TestFsck.testFsckCorruptWhenOneReplicaIsCorrupt(org.apache.hadoop.hdfs.server.namenode.TestFsck)
[ERROR]   Run 1: TestFsck.testFsckCorruptWhenOneReplicaIsCorrupt Â» Remote java.lang.NullPointer...
[ERROR]   Run 2: TestFsck.testFsckCorruptWhenOneReplicaIsCorrupt Â» Remote java.lang.NullPointer...
[ERROR]   Run 3: TestFsck.testFsckCorruptWhenOneReplicaIsCorrupt Â» Remote java.lang.NullPointer...
[INFO] 
[ERROR] org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode.testMkdirsRaceWithObserverRead(org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode)
[ERROR]   Run 1: TestObserverNode.testMkdirsRaceWithObserverRead:557 Client #3 lastSeenStateId=-9223372036854775808 activStateId=37
null
[ERROR]   Run 2: TestObserverNode.testMkdirsRaceWithObserverRead:511 Â» Connect Call From 653bab...
[ERROR]   Run 3: TestObserverNode.cleanUp:111 Â» Connect Call From 653bab42d979/172.17.0.2 to lo...
[ERROR]   Run 4: TestObserverNode.testMkdirsRaceWithObserverRead:511 Â» Connect Call From 653bab...
[ERROR]   Run 5: TestObserverNode.cleanUp:111 Â» Connect Call From 653bab42d979/172.17.0.2 to lo...

TestObserverNode#testMkdirsRaceWithObserverRead is a flaky test: https://issues.apache.org/jira/browse/HDFS-15646

The other test failure is relevant TestFsck#testFsckCorruptWhenOneReplicaIsCorrupt:

[ERROR] Tests run: 35, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 281.876 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestFsck
[ERROR] testFsckCorruptWhenOneReplicaIsCorrupt(org.apache.hadoop.hdfs.server.namenode.TestFsck)  Time elapsed: 7.446 s  <<< ERROR!
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException
	at org.apache.hadoop.hdfs.server.blockmanagement.CorruptReplicasMap.removeFromCorruptReplicasMap(CorruptReplicasMap.java:145)
	at org.apache.hadoop.hdfs.server.blockmanagement.CorruptReplicasMap.removeFromCorruptReplicasMap(CorruptReplicasMap.java:134)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeStoredBlock(BlockManager.java:4255)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeStaleReplicas(BlockManager.java:4262)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.updateLastBlock(BlockManager.java:4772)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updatePipelineInternal(FSNamesystem.java:6004)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updatePipeline(FSNamesystem.java:5966)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updatePipeline(NameNodeRpcServer.java:1009)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updatePipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:1195)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:620)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:588)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1192)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1193)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1116)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1919)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3131)

…ck replicas rev2

hadoop-yetus · 2022-07-17T20:20:41Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 56s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ trunk Compile Tests _
+1 💚	mvninstall	40m 9s		trunk passed
+1 💚	compile	1m 40s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	1m 30s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	1m 19s		trunk passed
+1 💚	mvnsite	1m 42s		trunk passed
+1 💚	javadoc	1m 19s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 40s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 49s		trunk passed
+1 💚	shadedclient	26m 0s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 23s		the patch passed
+1 💚	compile	1m 28s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javac	1m 28s		the patch passed
+1 💚	compile	1m 19s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	1m 19s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 1s	/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs-project/hadoop-hdfs: The patch generated 3 new + 88 unchanged - 1 fixed = 91 total (was 89)
+1 💚	mvnsite	1m 26s		the patch passed
+1 💚	javadoc	0m 59s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 33s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 34s		the patch passed
+1 💚	shadedclient	25m 51s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	362m 11s		hadoop-hdfs in the patch passed.
+1 💚	asflicense	1m 0s		The patch does not generate ASF License warnings.
		479m 0s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/2/artifact/out/Dockerfile
GITHUB PR	#4568
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 233b4644866c 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `9884cf0`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/2/testReport/
Max. process+thread count	1862 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/2/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

KevinWikant · 2022-07-18T14:42:27Z

Yetus failing because no new unit test was added:

The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.

Note that this change was unit tested via the existing test "TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock", the only caveat is that this unit test is behaving differently when backporting HDFS-16064 to older Hadoop versions, which is what caused this bug to be discovered in the first place.

See the section "Why does unit test failure not reproduce in Hadoop trunk?" in HDFS-16664 for additional details on difference in behavior. In summary:

on trunk the corrupt block replicas are invalidated immediately when first block report is sent
on branch-3.2.1 the corrupt block replica invalidation is getting postponed for some reason

Need additional time to investigate this difference in behavior, if possible I will try to make the unit test behavior more consistent

Can also add some unit testing for new code in CorruptReplicasMap

…ck replicas rev3

KevinWikant · 2022-08-02T20:10:49Z

I have confirmed that the change in unit test behavior that uncovered this bug is related to a change in this condition which was made as part of: https://issues.apache.org/jira/browse/HDFS-15200

In trunk the block invalidation does not get postponed because "dfs.namenode.corrupt.block.delete.immediately.enabled" defaults to true

I have added some additional test coverage for this change

"testDeleteCorruptReplicaForUnderReplicatedBlock" test without postponing (trunk behavior)

> mvn test -Dtest=org.apache.hadoop.hdfs.TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock

> cat target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | grep XXX
XXX invalidateBlock? dn=127.0.0.1:63741 , blk=1073741825_1001 , replicasOnStaleNodes=1
XXX addToInvalidates
XXX invalidateBlock? dn=127.0.0.1:63745 , blk=1073741825_1003 , replicasOnStaleNodes=0
XXX addToInvalidates

"testDeleteCorruptReplicaForUnderReplicatedBlock" test with postponing (branch-3.2.1 behavior)

> mvn test -Dtest=org.apache.hadoop.hdfs.TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlockWithInvalidationPostponed

> cat target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | grep XXX
XXX invalidateBlock? dn=127.0.0.1:63927 , blk=1073741825_1001 , replicasOnStaleNodes=1
XXX postponeBlock
XXX invalidateBlock? dn=127.0.0.1:63931 , blk=1073741825_1003 , replicasOnStaleNodes=0
XXX addToInvalidates
XXX invalidateBlock? dn=127.0.0.1:63927 , blk=1073741825_1001 , replicasOnStaleNodes=0
XXX addToInvalidates

CorruptReplicasMap test

> mvn test -Dtest=org.apache.hadoop.hdfs.server.blockmanagement.TestCorruptReplicaInfo#testGetCorruptReplicaGenerationStamp

hadoop-yetus · 2022-08-03T04:50:42Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	19m 47s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 2 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	40m 59s		trunk passed
+1 💚	compile	1m 40s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	1m 32s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	1m 20s		trunk passed
+1 💚	mvnsite	1m 44s		trunk passed
+1 💚	javadoc	1m 21s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 42s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 52s		trunk passed
+1 💚	shadedclient	26m 23s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 24s		the patch passed
+1 💚	compile	1m 28s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javac	1m 28s		the patch passed
+1 💚	compile	1m 22s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	1m 22s		the patch passed
-1 ❌	blanks	0m 0s	/blanks-eol.txt	The patch has 5 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
-0 ⚠️	checkstyle	1m 1s	/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs-project/hadoop-hdfs: The patch generated 5 new + 109 unchanged - 2 fixed = 114 total (was 111)
+1 💚	mvnsite	1m 27s		the patch passed
+1 💚	javadoc	1m 1s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 34s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 34s		the patch passed
+1 💚	shadedclient	26m 40s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	383m 49s		hadoop-hdfs in the patch passed.
+1 💚	asflicense	0m 58s		The patch does not generate ASF License warnings.
		522m 22s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/3/artifact/out/Dockerfile
GITHUB PR	#4568
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 2219b5d96d1c 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `637b3b2`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/3/testReport/
Max. process+thread count	2056 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/3/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

…ck replicas rev4

hadoop-yetus · 2022-08-03T18:57:51Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 44s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 2 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	38m 28s		trunk passed
+1 💚	compile	1m 44s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	1m 38s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	1m 27s		trunk passed
+1 💚	mvnsite	1m 49s		trunk passed
+1 💚	javadoc	1m 25s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 50s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 48s		trunk passed
+1 💚	shadedclient	22m 55s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 26s		the patch passed
+1 💚	compile	1m 27s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javac	1m 27s		the patch passed
+1 💚	compile	1m 22s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	1m 22s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 1s	/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 108 unchanged - 2 fixed = 110 total (was 110)
+1 💚	mvnsite	1m 28s		the patch passed
+1 💚	javadoc	0m 59s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 32s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 23s		the patch passed
+1 💚	shadedclient	25m 16s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	unit	252m 58s	/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs in the patch passed.
+1 💚	asflicense	1m 21s		The patch does not generate ASF License warnings.
		366m 13s

Reason	Tests
Failed junit tests	hadoop.hdfs.TestDFSStorageStateRecovery
	hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy
	hadoop.hdfs.TestReconstructStripedFile
	hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/4/artifact/out/Dockerfile
GITHUB PR	#4568
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 3c44b7f1763e 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `60433bf`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/4/testReport/
Max. process+thread count	3121 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/4/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

KevinWikant · 2022-08-03T21:20:51Z

I am ignoring the 2 checkstyle violations for the following reasons:

./hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CorruptReplicasMap.java:285:  /**: First sentence should end with a period. [JavadocStyle]

This is an existing comment:

hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CorruptReplicasMap.java

Line 285 in 60433bf

/**

Since I am not modifying this comment (i.e. lines of code) in any way, I think its better that I don't touch it

./hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java:1932:  public void testDeleteCorruptReplicaForUnderReplicatedBlockInternal() throws Exception {:3: Method length is 233 lines (max allowed is 150). [MethodLength]

"testDeleteCorruptReplicaForUnderReplicatedBlockInternal" is an existing method "testDeleteCorruptReplicaForUnderReplicatedBlock" which was renamed. Since the method was already merged I don't think its necessary that I reduce the number of lines in the method.

KevinWikant · 2022-08-03T21:22:49Z

Unit test failures:

[ERROR] Errors: 
[ERROR] org.apache.hadoop.hdfs.TestDFSStorageStateRecovery.testDNStorageStates(org.apache.hadoop.hdfs.TestDFSStorageStateRecovery)
[ERROR]   Run 1: TestDFSStorageStateRecovery.testDNStorageStates:399 Â» OutOfMemory unable to cr...
[ERROR]   Run 2: TestDFSStorageStateRecovery.setUp:449 Â» OutOfMemory unable to create new nativ...
[ERROR]   Run 3: TestDFSStorageStateRecovery.setUp:449 Â» OutOfMemory unable to create new nativ...
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testCountNodes(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testDecommission2NodeWithBusyNode(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testDecommissionTwoNodes(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testDecommissionWithBusyNode(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testDecommissionWithFailedReplicating(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:127->TestDecommissionWithStriped.writeConfigFile:594 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testDecommissionWithMissingBlock(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testDecommissionWithURBlockForSameBlockGroup(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testFileChecksumAfterDecommission(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testFileFullBlockGroup(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.teardown:169 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testFileMultipleBlockGroups(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMo
8000
nitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testFileSmallerThanOneCell(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:127->TestDecommissionWithStriped.writeConfigFile:594 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testFileSmallerThanOneStripe(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.teardown:169 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor.testRecoveryWithDecommission(org.apache.hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor)
[ERROR]   Run 1: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 2: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[ERROR]   Run 3: TestDecommissionWithStripedBackoffMonitor>TestDecommissionWithStriped.setup:151 Â» OutOfMemory
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestReconstructStripedFile.testErasureCodingWorkerXmitsWeight(org.apache.hadoop.hdfs.TestReconstructStripedFile)
[ERROR]   Run 1: TestReconstructStripedFile.testErasureCodingWorkerXmitsWeight:549->testErasureCodingWorkerXmitsWeight:568->writeFile:318 Â» OutOfMemory
[ERROR]   Run 2: TestReconstructStripedFile.tearDown:171 Â» OutOfMemory unable to create new nat...
[ERROR]   Run 3: TestReconstructStripedFile.testErasureCodingWorkerXmitsWeight:545->testErasureCodingWorkerXmitsWeight:559 Â» OutOfMemory
[ERROR]   Run 4: TestReconstructStripedFile.tearDown:171 Â» OutOfMemory unable to create new nat...
[ERROR]   Run 5: TestReconstructStripedFile.setup:155 Â» OutOfMemory unable to create new native...
[INFO] 
[ERROR] org.apache.hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy.testRecoverOneDataBlock2(org.apache.hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy)
[ERROR]   Run 1: TestReconstructStripedFileWithRandomECPolicy>TestReconstructStripedFile.setup:155 Â» OutOfMemory
[ERROR]   Run 2: TestReconstructStripedFileWithRandomECPolicy>TestReconstructStripedFile.setup:155 Â» OutOfMemory
[ERROR]   Run 3: TestReconstructStripedFileWithRandomECPolicy>TestReconstructStripedFile.setup:155 Â» OutOfMemory

KevinWikant · 2022-08-03T21:24:56Z

All the unit test failures are caused by:

java.lang.OutOfMemoryError: unable to create new native thread

I strongly suspect this is not related to my change & is perhaps related to the runtime environment of the tests, perhaps the unit tests will pass if re-run again

hadoop-yetus · 2022-08-04T18:57:32Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 44s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 2 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	38m 46s		trunk passed
+1 💚	compile	1m 43s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	1m 38s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	1m 20s		trunk passed
+1 💚	mvnsite	1m 47s		trunk passed
+1 💚	javadoc	1m 26s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 46s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 42s		trunk passed
+1 💚	shadedclient	23m 18s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 24s		the patch passed
+1 💚	compile	1m 26s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javac	1m 26s		the patch passed
+1 💚	compile	1m 20s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	1m 20s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 1s	/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 108 unchanged - 2 fixed = 110 total (was 110)
+1 💚	mvnsite	1m 28s		the patch passed
+1 💚	javadoc	1m 0s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 29s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 23s		the patch passed
+1 💚	shadedclient	23m 43s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	243m 41s		hadoop-hdfs in the patch passed.
+1 💚	asflicense	1m 14s		The patch does not generate ASF License warnings.
		355m 29s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/5/artifact/out/Dockerfile
GITHUB PR	#4568
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux dec11872578f 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `99d9914`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Mult 6D40 i-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/5/testReport/
Max. process+thread count	3606 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4568/5/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

XiaohuiSun1 · 2022-08-04T19:37:45Z

-1 (Sorry for coming up so straight).

This is a complete unnecessary change. It was mentioned in your last JIRA HDFS-16064 by yanbin.zhang that you are finding a wrong root cause in that JIRA and you didnt coordinate or reply them ?

We have never seen this issue in our downstream process running cluster of 1000+ nodes since so many years. You should try to create first experience this issue in real world than just depending on a UT.

This current change seems to unnecessary. If it would be a real production bug , someone in community would have experienced it. It is not like that you are reading the code someday and you identified some bug like this. Block management is very much critical and unnecessary changes always lead to more critical bugs...

KevinWikant · 2022-08-04T21:21:32Z

Hey Xiaohui, appreciate you taking the time to review this :)

Just wanted to clarify a few things:

This is a complete unnecessary change. It was mentioned in your last JIRA [HDFS-16064](https://issues.apache.org/jira/browse/HDFS-16064) by [yanbin.zhang](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=it_singer) that you are finding a wrong root cause in that JIRA and you didnt coordinate or reply them ?

I did reply to yanbin.zhang in the JIRA, you can see my comment just under theirs. I would also add:
a) the root cause is correct as per the testing that was conducted
b) I have seen this issue (i.e. HDFS-16064) impact production Hadoop clusters

This current change seems to unnecessary. If it would be a real production bug , someone in community would have experienced it. It is not like that you are reading the code someday and you identified some bug like this. Block management is very much critical and unnecessary changes always lead to more critical bugs...

I understand that there is no clear reporting of this issue beforehand by the community. That being said, the behavior is reproducible (as per the testing details in this PR/JIRA) and the behavior results in block invalidation requests (i.e. DNA_INVALIDATE) sent to datanodes failing due to incorrect GenerationStamp. I would argue that sending an incorrect GenerationStamp in a DNA_INVALIDATE request is a bug because it causes the request to fail when it otherwise could have succeeded if the correct GenerationStamp is used.

I would ask that you more clearly articulate why this change is un-necessary. Do you believe that sending an incorrect GenerationStamp in a DNA_INVALIDATE request causing the request to fail is not a bug? Or do you believe that this reproducible behavior should not be addressed simply because it has not been frequently reported as production impacting by the community?

KevinWikant · 2022-08-09T11:43:40Z

@XiaohuiSun1 XiaohuiSun1

Do you have any follow-up comments on this PR? Just wanted to give you a chance to reply before I reach out to a Hadoop committer for review

KevinWikant · 2022-09-07T15:53:34Z

Concrete Reproduce Steps

Create a Hadoop cluster with:
- single Namenode (i.e. non-HA)
- 5 datanodes (DN1, DN2, DN3, DN4, DN5)
- dfs.namenode.corrupt.block.delete.immediately.enabled = false
- dfs.replication = 3
Create the block with 3 replicas

echo "hello" > /tmp/test;
export HADOOP_USER_NAME=hdfs;
hdfs dfs -put /tmp/test /tmp/test;
hdfs dfs -ls /tmp/test;

Determine the block locations of the 3 replicas

> hdfs fsck /tmp/test -files -blocks -locations;
...
0. BP-452161995-NN-1662558403599:blk_1073741825_1001 len=6 Live_repl=3  [DatanodeInfoWithStorage[DN1:9866,DS-XXX,DISK], DatanodeInfoWithStorage[DN2:9866,DS-XXX,DISK], DatanodeInfoWithStorage[DN3:9866,DS-XXX,DISK]]
...

Stop DN1 & DN2

sudo systemctl disable hadoop-hdfs-datanode.service;
sudo systemctl stop hadoop-hdfs-datanode.service;

Append the block which will cause it to be written to 2 new block locations

> hdfs dfs -appendToFile /tmp/test /tmp/test;
2022-09-07 13:49:58,779 INFO hdfs.DataStreamer: Exception in createBlockOutputStream blk_1073741825_1001
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
        at org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
        at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1725)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1507)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:720)
2022-09-07 13:49:58,783 WARN hdfs.DataStreamer: Error Recovery for BP-452161995-NN-1662558403599:blk_1073741825_1001 in pipeline [DatanodeInfoWithStorage[DN1:9866,DS-XXX,DISK], DatanodeInfoWithStorage[DN2:9866,DS-XXX,DISK], DatanodeInfoWithStorage[DN3:9866,DS-XXX,DISK]]: datanode 0(DatanodeInfoWithStorage[DN1:9866,DS-XXX,DISK]) is bad.
2022-09-07 13:49:58,808 WARN hdfs.DFSClient: Error transferring data from DatanodeInfoWithStorage[DN2:9866,DS-XXX,DISK] to DatanodeInfoWithStorage[DN4:9866,DS-XXX,DISK]: Connection refused
2022-09-07 13:49:58,996 INFO hdfs.DataStreamer: Exception in createBlockOutputStream blk_1073741825_1001
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
        at org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
        at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1725)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1507)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:720)
2022-09-07 13:49:58,996 WARN hdfs.DataStreamer: Error Recovery for BP-452161995-NN-1662558403599:blk_1073741825_1001 in pipeline [DatanodeInfoWithStorage[DN2:9866,DS-XXX,DISK], DatanodeInfoWithStorage[DN3:9866,DS-XXX,DISK], DatanodeInfoWithStorage[DN5:9866,DS-XXX,DISK]]: datanode 0(DatanodeInfoWithStorage[DN2:9866,DS-XXX,DISK]) is bad.

> hdfs dfs -cat /tmp/test;
hello
hello

Determine the new block locations of the 3 replicas

> hdfs fsck /tmp/test -files -blocks -locations;
...
0. BP-452161995-NN-1662558403599:blk_1073741825_1004 len=12 Live_repl=3  [DatanodeInfoWithStorage[DN3:9866,DS-XXX,DISK], DatanodeInfoWithStorage[DN4:9866,DS-XXX,DISK], DatanodeInfoWithStorage[DN5:9866,DS-XXX,DISK]]
...

Restart the Namenode so that the block replicas are marked as "stale"
- enable Namenode BlockManager DEBUG logging by setting Log4J configuration "log4j.logger.BlockStateChange=DEBUG"

sudo systemctl restart hadoop-hdfs-namenode.service;

Restart DN1 & DN2

sudo systemctl start hadoop-hdfs-datanode.service;

Check the Namenode logs to confirm the block invalidation is postponed

2022-09-07 13:50:58,194 DEBUG BlockStateChange (Block report processor): BLOCK* invalidateBlocks: postponing invalidation of blk_1073741825_1001(stored=blk_1073741825_1004) on DN2:9866 because 1 replica(s) are located on nodes with potentially out-of-date block reports
2022-09-07 13:51:06,780 DEBUG BlockStateChange (Block report processor): BLOCK* invalidateBlocks: postponing invalidation of blk_1073741825_1001(stored=blk_1073741825_1004) on DN1:9866 because 1 replica(s) are located on nodes with potentially out-of-date block reports

Restart any of DN3, DN4, or DN5 so that they send a block report
- note that the block must be sufficiently replicated for this code path to occur

sudo systemctl start hadoop-hdfs-datanode.service;

Check the Namenode logs to validate the invalidation requests were sent to DN1 & DN2

2022-09-07 13:52:07,729 INFO org.apache.hadoop.hdfs.StateChange (IPC Server handler 26 on default port 8020): BLOCK* registerDatanode: from DatanodeRegistration(DN4:9866, datanodeUuid=2792b414-8c97-4a36-bb3c-1bda67ea9f28, infoPort=9864, infoSecurePort=0, ipcPort=9867, storageInfo=lv=-57;cid=CID-4227a61e-3071-4e90-a8bd-0ca0ded28a9f;nsid=1332362960;c=1662558403599) storage 2792b414-8c97-4a36-bb3c-1bda67ea9f28
2022-09-07 13:52:07,729 INFO org.apache.hadoop.net.NetworkTopology (IPC Server handler 26 on default port 8020): Removing a node: /default-rack/DN4:9866
2022-09-07 13:52:07,730 INFO org.apache.hadoop.net.NetworkTopology (IPC Server handler 26 on default port 8020): Adding a new node: /default-rack/DN4:9866
2022-09-07 13:52:07,792 DEBUG BlockStateChange (IPC Server handler 22 on default port 8020): *BLOCK* NameNode.blockReport: from DatanodeRegistration(DN4:9866, datanodeUuid=2792b414-8c97-4a36-bb3c-1bda67ea9f28, infoPort=9864, infoSecurePort=0, ipcPort=9867, storageInfo=lv=-57;cid=CID-4227a61e-3071-4e90-a8bd-0ca0ded28a9f;nsid=1332362960;c=1662558403599), reports.length=2
2022-09-07 13:52:07,793 INFO BlockStateChange (Block report processor): BLOCK* processReport 0x5b965acbde378e45: Processing first storage report for DS-XXX from datanode 2792b414-8c97-4a36-bb3c-1bda67ea9f28
2022-09-07 13:52:07,793 DEBUG BlockStateChange (Block report processor): BLOCK* addStoredBlock: Redundant addStoredBlock request received for blk_1073741825_1004 on node DN4:9866 size 12
2022-09-07 13:52:07,793 DEBUG BlockStateChange (Block report processor): BLOCK* invalidateBlock: blk_1073741825_1004(stored=blk_1073741825_1004) on DN1:9866
2022-09-07 13:52:07,793 DEBUG BlockStateChange (Block report processor): BLOCK* InvalidateBlocks: add blk_1073741825_1004 to DN1:9866
2022-09-07 13:52:07,793 DEBUG BlockStateChange (Block report processor): BLOCK* removeStoredBlock: blk_1073741825_1004 from DN1:9866
2022-09-07 13:52:07,795 DEBUG BlockStateChange (Block report processor): BLOCK* invalidateBlocks: blk_1073741825_1004(stored=blk_1073741825_1004) on DN1:9866 listed for deletion.
2022-09-07 13:52:07,795 DEBUG BlockStateChange (Block report processor): BLOCK* invalidateBlock: blk_1073741825_1004(stored=blk_1073741825_1004) on DN2:9866
2022-09-07 13:52:07,795 DEBUG BlockStateChange (Block report processor): BLOCK* InvalidateBlocks: add blk_1073741825_1004 to DN2:9866
2022-09-07 13:52:07,795 DEBUG BlockStateChange (Block report processor): BLOCK* removeStoredBlock: blk_1073741825_1004 from DN2:9866
2022-09-07 13:52:07,795 DEBUG BlockStateChange (Block report processor): BLOCK* invalidateBlocks: blk_1073741825_1004(stored=blk_1073741825_1004) on DN2:9866 listed for deletion.
2022-09-07 13:52:07,795 INFO BlockStateChange (Block report processor): BLOCK* processReport 0x5b965acbde378e45: from storage DS-XXX node DatanodeRegistration(DN4:9866, datanodeUuid=2792b414-8c97-4a36-bb3c-1bda67ea9f28, infoPort=9864, infoSecurePort=0, ipcPort=9867, storageInfo=lv=-57;cid=CID-4227a61e-3071-4e90-a8bd-0ca0ded28a9f;nsid=1332362960;c=1662558403599), blocks: 1, hasStaleStorage: false, processing time: 2 msecs, invalidatedBlocks: 0
2022-09-07 13:52:07,795 INFO BlockStateChange (Block report processor): BLOCK* processReport 0x5b965acbde378e45: Processing first storage report for DS-617e1346-8e62-40f0-a35a-5999c3fb2f64 from datanode 2792b414-8c97-4a36-bb3c-1bda67ea9f28
2022-09-07 13:52:07,795 INFO BlockStateChange (Block report processor): BLOCK* processReport 0x5b965acbde378e45: from storage DS-617e1346-8e62-40f0-a35a-5999c3fb2f64 node DatanodeRegistration(DN4:9866, datanodeUuid=2792b414-8c97-4a36-bb3c-1bda67ea9f28, infoPort=9864, infoSecurePort=0, ipcPort=9867, storageInfo=lv=-57;cid=CID-4227a61e-3071-4e90-a8bd-0ca0ded28a9f;nsid=1332362960;c=1662558403599), blocks: 0, hasStaleStorage: false, processing time: 0 msecs, invalidatedBlocks: 0
2022-09-07 13:52:09,128 DEBUG BlockStateChange (RedundancyMonitor): BLOCK* neededReconstruction = 0 pendingReconstruction = 0
2022-09-07 13:52:09,128 DEBUG BlockStateChange (RedundancyMonitor): BLOCK* BlockManager: ask DN2:9866 to delete [blk_1073741825_1004]
2022-09-07 13:52:09,128 DEBUG BlockStateChange (RedundancyMonitor): BLOCK* BlockManager: ask DN1:9866 to delete [blk_1073741825_1004]

Check the datanode logs (for DN1 & DN2) to validate the "GenerationStamp not matched" exception occurs

2022-09-07 13:50:58,206 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (BP-452161995-NN-1662558403599 heartbeating to NN/NN:8020): Successfully sent block report 0x65d28e4f71df62a6,  containing 2 storage report(s), of which we sent 2. The reports had 1 total blocks and used 1 RPC(s). This took 3 msec to generate and 19 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2022-09-07 13:50:58,206 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (BP-452161995-NN-1662558403599 heartbeating to NN/NN:8020): Got finalize command for block pool BP-452161995-NN-1662558403599
2022-09-07 13:52:10,159 WARN org.apache.hadoop.hdfs.server.datanode.DataNode (BP-452161995-NN-1662558403599 heartbeating to NN/NN:8020): Error processing datanode Command
java.io.IOException: Failed to delete 1 (out of 1) replica(s):
0) Failed to delete replica blk_1073741825_1004: GenerationStamp not matched, existing replica is blk_1073741825_1001
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2135)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:734)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
        at java.lang.Thread.run(Thread.java:750)

2022-09-07 13:51:06,797 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (BP-452161995-NN-1662558403599 heartbeating to NN/NN:8020): Successfully sent block report 0x43c9d07e94b8c90b,  containing 2 storage report(s), of which we sent 2. The reports had 1 total blocks and used 1 RPC(s). This took 4 msec to generate and 26 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2022-09-07 13:51:06,797 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (BP-452161995-NN-1662558403599 heartbeating to NN/NN:8020): Got finalize command for block pool BP-452161995-NN-1662558403599
2022-09-07 13:52:09,738 WARN org.apache.hadoop.hdfs.server.datanode.DataNode (BP-452161995-NN-1662558403599 heartbeating to NN/NN:8020): Error processing datanode Command
java.io.IOException: Failed to delete 1 (out of 1) replica(s):
0) Failed to delete replica blk_1073741825_1004: GenerationStamp not matched, existing replica is blk_1073741825_1001
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2135)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:734)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
        at java.lang.Thread.run(Thread.java:750)

KevinWikant · 2022-09-07T15:57:02Z

More abstractly the conditions to reproduce the issue are:

block is sufficiently replicated (because this code path needs to be invoked)
block has at least 1 corrupt replica which has been reported to the Namenode (i.e. sent in a block report) but which was not invalidated because of postponing invalidation logic this causes the block/datanode to be added to the CorruptReplicasMap
a datanode sends a block report which contains a different GenerationStamp than the corrupt replica. This should generally be any other datanode, except for the case where another datanode has a corrupt replica with the same Generation Stamp.

In this case the "GenerationStamp not matched" exception will continue to occur for each invalidation request sent to the Datanode with the corrupt replica up until that Datanode sends its next block report & the block replica is properly invalidated. An invalidation request will be sent for each block report from another datanode containing that block.

HDFS-16664. Use correct GenerationStamp when invalidating corrupt block

e66897e

replicas

KevinWikant changed the title ~~HDFS-16664. Use correct GenerationStamp when invalidating corrupt replicas~~ HDFS-16664. Use correct GenerationStamp when invalidating corrupt block replicas Jul 16, 2022

HDFS-16664. Use correct GenerationStamp when invalidating corrupt blo…

9884cf0

…ck replicas rev2

HDFS-16664. Use correct GenerationStamp when invalidating corrupt blo…

637b3b2

…ck replicas rev3

HDFS-16664. Use correct GenerationStamp when invalidating corrupt blo…

60433bf

…ck replicas rev4

Empty commit to re-trigger flaky Yetus workflow

99d9914

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDFS-16664. Use correct GenerationStamp when invalidating corrupt block replicas #4568

HDFS-16664. Use correct GenerationStamp when invalidating corrupt block replicas #4568

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HDFS-16664. Use correct GenerationStamp when invalidating corrupt block replicas #4568

Are you sure you want to change the base?

HDFS-16664. Use correct GenerationStamp when invalidating corrupt block replicas #4568

Uh oh!

Conversation

Uh oh!

Description of PR

How was this patch tested?

Failed Test - Before this change

Successful Test - After this change

For code changes:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Concrete Reproduce Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!