service-fabric: Service Fabric Stateful Service in Azure - RAM usage keeps growing. Possible memory leak

I am running a Service Fabric application on a cluster in Azure. The cluster has two scale sets:

  • 4x B2ms nodes where a stateful service type is placed with Placement Constraints (Primary Scale Set)
  • 2x F1 nodes where a stateless service type is placed.

There are two types of services in the application

  • WebAPI - stateless service used for receiving statuses from a system via HTTP and sending them to the StatusConsumer.
  • StatusConsumer - stateful service which processes the statuses and keeps the last one. Service instance is created for each system. It communicates via RemotingV2.

For my tests I am using Application Insights and Service Fabric Analytics to track performance. I am observing the following parameters:

Metrics for the stateful scale set: Percentage CPU, Disk Read/Write operations/sec

Applicaiton Insights: Server Response Time - which corresponds to the execution time of the method that receives the statuses in the stateful StatusConsumer.

Service Fabric Analytics: Performance Counters with log analytics agent on the stateful nodes - used to observe the RAM usage of the nodes.

Each simulated system sends its status every 30 seconds.

In the beginning of the test the RAM usage is around 40% on each node, Avg Server Response time is around 15ms, CPU usage is around 10%, and Read/Write operations are under 100/s.

Immediately after the start of the test the RAM usage starts slowly building up, but there is no difference in the other observed metrics.

After about an hour of simulating 1000 systems the RAM usage is around 90-95% and problems start to show in the other metrics - Avg Server response time peaks with values around 5-10 seconds and Disk Read/Write operations reach around 500/sec.

This continues for 1-3 minutes then the RAM usage drops and everything goes back to normal. RAM_Usage ServerResponseTime On the images you can see that the RAM peak corresponds to the server response time peak. In the end of the RAM graph the usage is flat in order to show what is the behaviour without simulating any systems.

The number of the systems simulated only reduces or increases the time it takes for the RAM to reach critical levels - in one of the tests 200 systems were simulated and the RAM usage rise was slower.

As for the code: In the first tests the code was more complicated, but to find the cause of the problem I started removing functionality. Still there was no improvement. The only time the RAM usage did not rise was when I commented all the code and in the body of the method that receives the status were only a try/catch block and a return. Currently the code in the StatusConsumer is this:

public async Task PostStatus(SystemStatusInfo status){
	try{
		Stopwatch stopWatch = new Stopwatch();
		
		IReliableDictionary<string, SystemStatusInfo> statusDictionary = await this.StateManager.GetOrAddAsync<IReliableDictionary<string, SystemStatusInfo>>("status");

		stopWatch.Start();
		
		using (ITransaction tx = this.StateManager.CreateTransaction())
		{
			await statusDictionary.AddOrUpdateAsync(tx,"lastConsumedStatus",(key) => { return status; },(key, oldvalue) => status);
			await tx.CommitAsync();
		}

		stopWatch.Stop();

		if (stopWatch.ElapsedMilliseconds / 1000 > 4) //seconds
		{
			Telemetry.TrackTrace($"Queue Status Duration: { stopWatch.ElapsedMilliseconds / 1000 } for {status.SystemId}", SeverityLevel.Critical);
		}
		
	}
	catch (Exception e)	{Telemetry.TrackException(e);}
}

After connecting to the nodes with remote desktop, in Task Manager I can see that when the RAM usage is around 85% the “Memory” of the SystemStatusConsumer process, which “holds” the microservice’s instances, is not more than 600 MB. It is the highest consumption but it is still not that high - the node is with 8 GB of RAM. However I don`t know if this is useful information in this case.

How can I diagnose and/or fix this?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 25

Most upvoted comments

I was using the default mode of garbage collection which I understand is workstation. To be sure I explicitly set it to false (<gcServer enabled="false"/>) in the App.config file of my stateful service. There was no difference in the results. I started observing more performance counters with Service Fabric Analytics:

  • .NET CLR Memory(*)\% Time in GC : When RAM usage is normal the observed values are around 0.5 and when the RAM usage is around 90% - around 2.5. There are some peak values, but they are less than 9 and last for 2-3 minutes. However they occur in random times - mostly when RAM is fairly normal and are not frequent - maybe 3 times per 5 hours. In the above mentioned article is said that “% Time in GC” under 10 should not be considered a problem.
  • .NET CLR Memory(*)\# Total committed Bytes is between 55m and 70m.
  • .NET CLR Memory(*)\# Total reserved Bytes rises in steps to around 750m and stays there until the end of the test no matter how many times the RAM usage rises and drops.
  • .NET CLR Memory(*)\Gen 0 heap size varies between 17m and 25m
  • .NET CLR Memory(*)\Gen 1 heap size is between 0.5m and 1.5m
  • .NET CLR Memory(*)\Gen 2 heap size`s peak values are under 30m

I also tried setting CheckpointThresholdInMB to 1 in the Settings.xml of the stateful service and there was still no noticeable difference - the RAM usage is still the same as the screenshot in the first comment. I also changed MaxAccumulatedBackupLogSizeInMB to 1 (default 800) for the stateful service and set KtlLogger/WriteBufferMemoryPoolMinimumInKB and KtlLogger/WriteBufferMemoryPoolMaximumInKB to 16384 in the cluster configuration.

With all these changes there is no improvement.

Actually for almost every test I create a new cluster and delete it when I`m done. Only one application is running on this cluster. Very rarely I make changes to the code and publish the application, but don’t upgrade it, to a cluster with running application on it.

In one “test” I create one service for each system after receiving the first request posting its status. After this the service should stay alive until the cluster is deleted even if there are no statuses for it and it sits idle.

When I am referring to the simulator or simulation I mean the tool we have built to send requests posting the statuses of the systems. Sorry if this is misleading. When I stop it I just stop the requests going to the cluster, but I don’t change the cluster or the applications/services living in it.

The build up to 4.3GB happens after an hour and a half after starting to simulate the systems. When the memory usage is at 95% it drops sometimes to 40% total and sometimes to 60% or more. This happens with no action on my side. At this point I can see that the amount of non-paged pool bytes has dropped significantly. I can also reproduce this if I stop the simulator - there are no requests posting the statuses. On each request I update the currents status, stored in a reliable dictionary, I don’t add new data.

Also every time I create a cluster I set the following settings via resources.azure.com: KtlLogger:

  • SharedLogSizeInMB = 4096
  • WriteBufferMemoryPoolMinimumInKB = 16384
  • WriteBufferMemoryPoolMaximumInKB = 16384

Diagnostics:

  • MaxDiskQuotaInMB = 1024

Settings in the Settings.xml file of the problematic stateful reliable service:

  • CheckpointThresholdInMB = 1
  • MaxAccumulatedBackupLogSizeInMB = 1