python-docx: Large Table Creation Slow
I saw a closed issue you couldn’t reproduce that sounded similar to what I’m seeing. I was trying to create a table with 7 columns and around 1,000 rows and it took a very long time. I was using a fork that had HTML tags so I wrote a test script, downloaded the most recent version of python-docx and I tried this on both ubuntu and windows machines running python2.7.
My test script is at the bottom, but all I did was create a document, add a 7 column table and then add blank rows. Here is my time output on an i7 running python 2.7.9 and 64bit windows 7. What I see is one core is pegged at 100% while this is running, but minimal memory usage. You can see the line additions take longer and longer as the table gets bigger.
-Thanks
SCRIPT OUTPUT:
docx Version: 0.8.5
0.45s: 50 lines complete. ( 0.45 seconds for last 50 lines)
1.54s: 100 lines complete. ( 1.09 seconds for last 50 lines)
3.31s: 150 lines complete. ( 1.76 seconds for last 50 lines)
5.74s: 200 lines complete. ( 2.43 seconds for last 50 lines)
8.81s: 250 lines complete. ( 3.07 seconds for last 50 lines)
12.56s: 300 lines complete. ( 3.74 seconds for last 50 lines)
16.94s: 350 lines complete. ( 4.38 seconds for last 50 lines)
22.01s: 400 lines complete. ( 5.07 seconds for last 50 lines)
27.75s: 450 lines complete. ( 5.74 seconds for last 50 lines)
34.16s: 500 lines complete. ( 6.41 seconds for last 50 lines)
41.23s: 550 lines complete. ( 7.07 seconds for last 50 lines)
48.92s: 600 lines complete. ( 7.69 seconds for last 50 lines)
57.28s: 650 lines complete. ( 8.36 seconds for last 50 lines)
66.38s: 700 lines complete. ( 9.10 seconds for last 50 lines)
76.13s: 750 lines complete. ( 9.75 seconds for last 50 lines)
86.52s: 800 lines complete. ( 10.39 seconds for last 50 lines)
97.58s: 850 lines complete. ( 11.06 seconds for last 50 lines)
109.37s: 900 lines complete. ( 11.79 seconds for last 50 lines)
121.71s: 950 lines complete. ( 12.34 seconds for last 50 lines)
Total Runtime 134.46 seconds
SCRIPT CODE:
import time
import docx
STEP = 50
ROWS = 1000
print "docx Version: %s" % docx.__version__
document = docx.Document()
table = document.add_table(rows=1, cols=7)
tstart = time.time()
t1 = tstart
for i in range(ROWS):
row_cells = table.add_row().cells
if i and (i % STEP) == 0:
t2 = time.time()
print "%8.2fs: %5d lines complete. (%6.2f seconds for last %d lines)" % (t2 - tstart, i, t2-t1, STEP)
t1 = t2
document.save("table_test.docx")
t2 = time.time()
print "Total Runtime %.2f seconds" % (t2 - tstart)
About this issue
- Original URL
- State: open
- Created 9 years ago
- Reactions: 6
- Comments: 15
In looking at this a bit more, all of the time is taken up by the table._cells call which happens every time I fetch a row.cells. To retrieve a row, _cells has to iterate through every cell in the entire table to deal with merged cells, and doesn’t have a mechanism to regenerate only on change.
As a work-around, since I just need a simple table I’m fetching all the cells once and indexing the rows:
This takes around 4 seconds to populate 1000 rows.
Another year, and another person this has helped. Just got a 6000+ row table generating in a few minutes, as opposed to hours.
+1 I can also confirm that I have used this technique when READING a large word document. It is critical that the table_cells = table._cells technique is used. Otherwise the performance is unacceptable. Thanks again stumpyyy.
It’s been a few years, but I needed this again! I wrote a little recipe to export a pandas data frame to word, kind of like pandas has a
to_csv
method, using the “fast access” approach. Sharing for future travelers:An API that uses a contextmanager could be a good fit here.
It could give you back some cached cells, and then apply the changes at the end of the block.
I did this:
this “transform” method should to be called after all cells have been created.
Hope one day we can add a mechanism to regenerate _cells only on change.
All I needed to do was add text to the cells once I created them, I didn’t have to do any merging.
row_cells is just a reference to the array in the table, so modifying it modifies the table as well (just like a = [1]; b=a; b[0] = 5; would make a=[5])
If you need to merge cells as you’re going along it might be a bit more difficult to work around. You can see it gets slower and slower as row_cells grows, so it might be possible to build a bunch of small tables and then append their row_cells together to make one big table when you’re done. I’ve not run into that need yet so I haven’t tried anything.