model_analyzer: Error starting embedded DCGM engine
I’m trying to run model-analyzer in kubernetes but it is failing with the following error:
│ Unhandled exception. System.TypeInitializationException: The type initializer for 'Triton.MemoryAnalyzer.Metrics.GpuMetrics' threw an exception. │
│ ---> System.InvalidOperationException: Error starting embedded DCGM engine. DCGM initialization error. │
│ at Triton.MemoryAnalyzer.Metrics.GpuMetrics..cctor() │
│ --- End of inner exception stack trace --- │
│ at Triton.MemoryAnalyzer.Metrics.GpuMetrics..ctor() │
│ at Triton.MemoryAnalyzer.MetricsCollector..ctor(MetricsCollectorConfig config) │
│ at Triton.MemoryAnalyzer.Program.<>c__DisplayClass7_0.<Main>b__2(K8sOptions options) │
│ at CommandLine.ParserResultExtensions.MapResult[T1,T2,TResult](ParserResult`1 result, Func`2 parsedFunc1, Func`2 parsedFunc2, Func`2 notParsedFunc) │
│ at Triton.MemoryAnalyzer.Program.Main(String[] args) │
│ stream closed
Has anyone seen it before ?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (7 by maintainers)
@fabito Not yet. The Triton team is integrating this tool into the Triton universe, including a complete rewrite in C++. I briefly skimmed the code in that branch. My assumption is that it will support more current versions of Triton and better integrate into the experience you are accustomed to with Triton.
I’m not involved in that project, so I do not know the timeline. @deadeyegoodwin and @dzier would be better contacts for that. In the meantime, hopefully this version of Triton Memory Analyzer can provide you with approximate memory metrics.
We have just pushed out the new rewrite in Python to the
mainbranch. Please try the newer version and see if the issues still persist. Note we will officially releasing v1.0.0 of ModelAnalyzer in the 20.12 release of Triton SDK, which will be sometime in December.@fabito Ignore my earlier comment, you’re right. Analyzer surveys the memory usage on the system, not a specific GPU. I suppose it would work fine, but I don’t know why allocating a GPU would make a difference unless there’s an underlying issue with the configuration. Feel free to test it out.
They should be in the same pod, not container. Are you sure none of the modifications you made were critical? For example, securityContext should be set to privileged: true for analyzer and sharePorcessNamespace should be set to true for the pod.