Open-XML-SDK: Large parts cannot be written on .NET Core due to OutOfMemoryException

Description

I was using ClosedXML to create large excel files (100k rows with 63 colums) and I faced issue with OutOfMemoryException. I found some examples of OpenXML using SAX, so I tried to switch to OpenXML, but it didn’t help me. I tried to remove all my code (reading from DB, etc.) to try if that works, but I still get OutOfMemoryException.

My code is based on http://polymathprogrammer.com/2012/08/06/how-to-properly-use-openxmlwriter-to-write-large-excel-files/

Information

  • .NET Target: .NET Core 3.1
  • DocumentFormat.OpenXml Version: 2.11.3

EDIT: Same code is working fine in .NET Framework 4.7.2 with same DocumentFormat.OpenXml version.

Repro This is simple code that I use at the moment. I am testing with RAM limited to 500 MB (testing purpose). I don’t think that this code can consume so much RAM.

using (SpreadsheetDocument document = SpreadsheetDocument.Create(filePath, SpreadsheetDocumentType.Workbook))
{
    document.AddWorkbookPart();

    WorksheetPart wsp = document.WorkbookPart.AddNewPart<WorksheetPart>();

    using (OpenXmlWriter writer = OpenXmlWriter.Create(wsp))
    {
        List<OpenXmlAttribute> oxa;
        writer.WriteStartElement(new Worksheet());
        writer.WriteStartElement(new SheetData());

        for (int i = 0; i < 100000; i++)
        {
            oxa = new List<OpenXmlAttribute>();
            oxa.Add(new OpenXmlAttribute("r", null, i.ToString()));
            writer.WriteStartElement(new Row(), oxa);

            for (int j = 0; j < 40; j++)
            {
                oxa = new List<OpenXmlAttribute>();
                oxa.Add(new OpenXmlAttribute("t", null, "str"));
                writer.WriteStartElement(new Cell(), oxa);
                writer.WriteElement(new CellValue("test"));
                writer.WriteEndElement();
            }

            writer.WriteEndElement();
        }

        writer.WriteEndElement(); // end of sheetdata
        writer.WriteEndElement(); //end of worksheet
    }

    using (OpenXmlWriter writer = OpenXmlWriter.Create(document.WorkbookPart))
    {
        writer.WriteStartElement(new Workbook());
        writer.WriteStartElement(new Sheets());

        writer.WriteElement(new Sheet() { Id = document.WorkbookPart.GetIdOfPart(wsp), SheetId = 1, Name = "Test" });

        writer.WriteEndElement();
        writer.WriteEndElement();
    }
}

Observed

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.IO.MemoryStream.set_Capacity(Int32 value)
   at System.IO.MemoryStream.EnsureCapacity(Int32 value)
   at System.IO.MemoryStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   at System.Xml.XmlUtf8RawTextWriter.FlushBuffer()
   at System.Xml.XmlUtf8RawTextWriter.RawText(Char* pSrcBegin, Char* pSrcEnd)
   at System.Xml.XmlUtf8RawTextWriter.RawText(String s)
   at System.Xml.XmlUtf8RawTextWriter.WriteEndElement(String prefix, String localName, String ns)
   at System.Xml.XmlWellFormedWriter.WriteEndElement()
   at DocumentFormat.OpenXml.OpenXmlPartWriter.WriteEndElement()

Expected

Excel file in filePath with 100k rows and 40 columns with string “test” in all cells.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 4
  • Comments: 21 (5 by maintainers)

Most upvoted comments

I’d like to get this fixed (at least a work around) for v3.0. I’ve created a set of abstractions that allow for more control over things and I think we could automate some of the work arounds here (at least in an opt-in way). For the abstractions, see: #1295.

My thoughts would be to model what @M4urici0GM did, but in a more transparent way. Of course, it would be better to have this fixed in the underlying package model, but that hasn’t gone anywhere in too many years.

My initial thoughts to implementing this would be:

(1) Provide an abstraction of IPackage that would intercept calls to GetStream and write them to some temporary location (2) On save, first, save the package as normal (3) Then reopen the package in just write mode (this should allow the replacing of things without the explosion of memory) (4) write the streams from the temporary location (5) Close the package again and reopen with original mode/access

The abstractions I have should allow building this, except we’d need a way to “Reload” the underlying package. Building off of the abstractions, I’m thinking of enabling the following:

public interface IPackageFeature
{
  IPackage Package { get; }

+ bool CanReload { get; }

+ void Reload();
}

This could automatically be supported for files opened with paths or streams, but if a package is given, then it would not be supported (since we didn’t manage the package) without additional information from a user.

Here’s more of a working sample for .NET Core that should get people started on the workaround. I just wrote out a 500,000 row dataset and the memory footprint stayed pretty low. There are a few places where some extension methods are used (like SafeLeft), you can remove those and put in what you need (GetCell in particular isn’t super clean, keep in mind, proof of concept). What you’ll be interested really is the order of the ToFile static method and it follows the outline in the bulleted list on my last comment.

The biggest limitation I can see is that you can only write one large sheet to the document (after that ReadWrite is required, I could never get a second Write only stream to work).

https://gist.github.com/blakepell/8fe938624f1dad8c28ff93a334687d77

@sorensenmatias @clement911 @rsdelapaz

Sorry for the spam, I got something working and thought perhaps it might help someone looking for a workaround. I don’t love my solution but, it did work. Since I wasn’t able to write to a second OpenXmlWriter I did this (order is important):

  1. Create a FileStream (I used File.Create).
  2. Create a Package, pass in the FileStream and use FileMode.Create and FileAccess.Write
  3. Create a SpreadsheetDocument via SpreadsheetDocument.Create
  4. Write your large WorksheetPart via an OpenXmlWriter
  5. Close and Dispose of the writer, the package, the file stream, etc.
  6. Create a FileStream (open this time, File.Open with FileMode.Open, FileAccess.ReadWrite and FileShare.None)
  7. Create a Package, pass in the FileStream and use FileMode.Open and FileAccess.ReadWrite
  8. Create a SpreadsheetDocument via SpreadsheetDocument.Open
  9. Create an OpenXmlWriter for the WorkbookPart and add the elements for Workbook and Sheets, then you’ll associate the Sheet you added on the original create, close and dispose of those objects and done.

What I found was I was only able to use FileAccess.Write on the first dataset I wrote to the spreadsheet, if I tried to write anything else it would throw an exception (where ReadWrite did not, so I can only assume once I start a second writer the OpenXml library needs to read something). That’s why I had to two sets of operations (the ReadWrite part at the end is very small, so no memory concerns there as it doesn’t uncompress into memory the large sheet).

Outside of the acrobatics, the limitation of this approach seems to be that you can only have 1 large sheet (but from the examples I’ve seen, many people are only writing one large sheet per spreadsheet so this might help someone).

This is definitely dotnet/runtime#1544. I am unsure of a good workaround at this time.

Any Updates on that?

Our company is currently being hit hard by this issue for both WordProcessing, Presentation and Spreadsheet. We are seeing big spikes of memory consumption on our production environment when modifying the OpenXml of certain documents. This gives significant problems for our enterprise customers. Any updates or workarounds would be most appreciated.

I made a hacky solution for this. Obviously not production-ready, but the idea itself, I think, is good. Instead of creating the Cell object from the start, we create it on the fly and drop it right after use. Made a poc repo for this: https://github.com/pre-alpha-final/openxml-memory-usage-hack The results are promising. I went from 6.1GB of RAM to 2.1GB out of which around 300MB+ is just the data itself.

EDIT: Run it with @Viktor-36 's parameters form original post (100k rows, 40 cells each, with text “test”). The standard save used up 2GB or RAM, the hacky save used 321MB.

@blakepell 's solution gave us an error (“Cannot retrieve parts of writeonly container”) until we updated to the latest version of the System.IO.Packaging NuGet package.

We are also running into that exact same issue, even though we call SpreadsheetDocument.Create with a FileStream. Our environment is .Net 5 on Azure App service.

I closed the issue by mistake, so I am reopening it… I was trying to find more information about this issue and I found some issues with System.IO.Packaging… Is there some workaround to temporarily fix my problem? I really need to generate large excel files in .NET Core.

Related issues: https://github.com/dotnet/runtime/issues/23750 https://github.com/dotnet/runtime/issues/1544