python-docx: Large Table Creation Slow

I saw a closed issue you couldn’t reproduce that sounded similar to what I’m seeing. I was trying to create a table with 7 columns and around 1,000 rows and it took a very long time. I was using a fork that had HTML tags so I wrote a test script, downloaded the most recent version of python-docx and I tried this on both ubuntu and windows machines running python2.7.

My test script is at the bottom, but all I did was create a document, add a 7 column table and then add blank rows. Here is my time output on an i7 running python 2.7.9 and 64bit windows 7. What I see is one core is pegged at 100% while this is running, but minimal memory usage. You can see the line additions take longer and longer as the table gets bigger.

-Thanks

SCRIPT OUTPUT:

docx Version: 0.8.5
    0.45s:     50 lines complete.  (  0.45 seconds for last 50 lines)
    1.54s:    100 lines complete.  (  1.09 seconds for last 50 lines)
    3.31s:    150 lines complete.  (  1.76 seconds for last 50 lines)
    5.74s:    200 lines complete.  (  2.43 seconds for last 50 lines)
    8.81s:    250 lines complete.  (  3.07 seconds for last 50 lines)
   12.56s:    300 lines complete.  (  3.74 seconds for last 50 lines)
   16.94s:    350 lines complete.  (  4.38 seconds for last 50 lines)
   22.01s:    400 lines complete.  (  5.07 seconds for last 50 lines)
   27.75s:    450 lines complete.  (  5.74 seconds for last 50 lines)
   34.16s:    500 lines complete.  (  6.41 seconds for last 50 lines)
   41.23s:    550 lines complete.  (  7.07 seconds for last 50 lines)
   48.92s:    600 lines complete.  (  7.69 seconds for last 50 lines)
   57.28s:    650 lines complete.  (  8.36 seconds for last 50 lines)
   66.38s:    700 lines complete.  (  9.10 seconds for last 50 lines)
   76.13s:    750 lines complete.  (  9.75 seconds for last 50 lines)
   86.52s:    800 lines complete.  ( 10.39 seconds for last 50 lines)
   97.58s:    850 lines complete.  ( 11.06 seconds for last 50 lines)
  109.37s:    900 lines complete.  ( 11.79 seconds for last 50 lines)
  121.71s:    950 lines complete.  ( 12.34 seconds for last 50 lines)
Total Runtime 134.46 seconds

SCRIPT CODE:

import time
import docx

STEP = 50
ROWS = 1000

print "docx Version: %s" % docx.__version__
document = docx.Document()
table = document.add_table(rows=1, cols=7)
tstart = time.time()
t1 = tstart
for i in range(ROWS):
    row_cells = table.add_row().cells
    if i and (i % STEP) == 0:
        t2 = time.time()
        print "%8.2fs:  %5d lines complete.  (%6.2f seconds for last %d lines)" % (t2 - tstart, i, t2-t1, STEP)
        t1 = t2

document.save("table_test.docx")
t2 = time.time()
print "Total Runtime %.2f seconds" % (t2 - tstart)

About this issue

  • Original URL
  • State: open
  • Created 9 years ago
  • Reactions: 6
  • Comments: 15

Most upvoted comments

In looking at this a bit more, all of the time is taken up by the table._cells call which happens every time I fetch a row.cells. To retrieve a row, _cells has to iterate through every cell in the entire table to deal with merged cells, and doesn’t have a mechanism to regenerate only on change.

As a work-around, since I just need a simple table I’m fetching all the cells once and indexing the rows:

COLUMNS = 7
table = document.add_table(rows=1000, columns=COLUMNS)
table_cells = table._cells
for i in range(ROWS):
    row_cells = table_cells[i*COLUMNS:(i+1)*COLUMNS]
    #Add text to row_cells

This takes around 4 seconds to populate 1000 rows.

Another year, and another person this has helped. Just got a 6000+ row table generating in a few minutes, as opposed to hours.

+1 I can also confirm that I have used this technique when READING a large word document. It is critical that the table_cells = table._cells technique is used. Otherwise the performance is unacceptable. Thanks again stumpyyy.

It’s been a few years, but I needed this again! I wrote a little recipe to export a pandas data frame to word, kind of like pandas has a to_csv method, using the “fast access” approach. Sharing for future travelers:


def df_to_table(doc: Document, df: pd.DataFrame):
    """Quickly generate a word table from a pandas data frame.

    Optimized for speed - see https://github.com/python-openxml/python-docx/issues/174
    """
    n_rows = df.shape[0] + 1
    n_col = df.shape[1]
    tbl = doc.add_table(n_rows, n_col)
    cells = tbl._cells
    data = df.to_dict("tight", index=False)
    for i, header in enumerate(data["columns"]):
        cell = cells[i]
        cell.paragraphs[0].text = str(header)
    for i, row in enumerate(data["data"]):
        for j, value in enumerate(row):
            cell = cells[(i + 1) * n_col + j]
            cell.paragraphs[0].text = str(value)

An API that uses a contextmanager could be a good fit here.

with table.cells as cells:
    # do stuff with cells
    pass

# changes are written back.

It could give you back some cached cells, and then apply the changes at the end of the block.

I did this:

import docx
from docx.table import Table


class CachedTable(Table):
    def __init__(self, tbl, parent):
        super(Table, self).__init__(parent)
        self._element = self._tbl = tbl
        self._cached_cells = None

    @property
    def _cells(self):
        if self._cached_cells is None:
            self._cached_cells = super(CachedTable, self)._cells
        return self._cached_cells

    @staticmethod
    def transform(table):
        cached_table = CachedTable(table._tbl, table._parent)
        return cached_table


ROWS=1000
COLUMNS = 7
document=docx.Document()
table = CachedTable.transform(document.add_table(rows=ROWS, columns=COLUMNS))
for i in range(ROWS):
    for j in range(COLUMNS):
        cell = table.cell(i, j)
        #Add text to row_cells

this “transform” method should to be called after all cells have been created.

Hope one day we can add a mechanism to regenerate _cells only on change.

All I needed to do was add text to the cells once I created them, I didn’t have to do any merging.

row_cells is just a reference to the array in the table, so modifying it modifies the table as well (just like a = [1]; b=a; b[0] = 5; would make a=[5])

If you need to merge cells as you’re going along it might be a bit more difficult to work around.  You can see it gets slower and slower as row_cells grows, so it might be possible to build a bunch of small tables and then append their row_cells together to make one big table when you’re done.  I’ve not run into that need yet so I haven’t tried anything.