wagtail: Optimise representation of deeply-nested StreamField blocks in migrations

Because the actual database backend of a StreamField is just a simple LONGTEXT populated by json-serialized block contents, the database does not need to know about changes made to the Block structure within the StreamField.

However, since Wagtail currently does provide that information to the migrations, this causes complex StreamField instances to generate gargantuan migrations whenever even the tiniest change is made to a single Block used by the field. Here’s an example. This one model’s migration code is nearly 500,000 characters.

And another half megabyte of text will be repeated in each subsequent migration if even a single field in a single Block has a single attribute changed. This is highly undesirable, and as far as I can tell, completely pointless. The auto-generated migration files don’t need to care about the internal structure of a StreamField; only manually written migrations that migrate the data from an old format to a new one need to care, and those don’t need to be half a megabyte of code.

I’m not yet entirely sure how best to remedy this, but I think it will have something to do with StreamField.deconstruct().:

def deconstruct(self):
    name, path, _, kwargs = super(StreamField, self).deconstruct()
    block_types = self.stream_block.child_blocks.items()
    args = [block_types]
    return name, path, args, kwargs

I don’t think there’s any good reason for it to care about its child blocks. Though while I do have a lot of experience with dealing with, and hacking around, StreamFields, I’m not an expert. Maybe I’m missing something?

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 23 (17 by maintainers)

Most upvoted comments

This was discussed in the core team meeting on 25/06/20.

In general, it was agreed that including the streamfield block definitions in migrations was the correct default behaviour. This is because at any point in a model’s migration history, you should be able to access both the structured content/data AND the full streamblock definition, in case they are both needed for use in a data migration. As unlikely as this may seem, until there are more established solutions for migrating streamfield content, we feel it’s important to preserve this as an option.

However, we also recognise the issues that many (even core team members themselves) have reported in relation to this, and would like to offer a way for developers to opt out of this behaviour (including streamfield definitions in migrations) completely on a per-project basis, provided they are happy to accept the consequences (as outlined above).

EDIT: If there are other ways to optimise the representation (as outlined above), then they are still well worth exploring, as simpler migrations by default would be an obvious win.

@gasman following up on this:

your example code shows a level of configurability that goes way beyond anything I anticipated, or would recommend

This problem (large migrations due to highly configurable StreamFields) is something that we’ve run into as well, and is a major pain point around maintenance of our large Wagtail site.

I’m curious if you have any guidance on how or where to draw the line between having many Page models that differ slightly versus having fewer Pages but making them more configurable. Consider a use case where you have a few core page types that share certain common design elements (headers, footers, sidebars), but their main content may differ in numerous subtle ways. Now compare these few pages:

Each of these pages has the same basic structure but the main content area is very different (different images, callouts, links, expandables). In our case we accomplish this through the use of at least one highly configurable StreamField, which can contain various blocks (which may be StructBlocks and hold, e.g. lists of other blocks).

This usage results in similar migration issues to that reported by @coredumperror. We’ve discussed various other ways to set this up, but it’s not clear how best to do this with a “many page types” approach. You’d need something like PageWithImageAndFormAndLinks and then PageWithImageAndLinksButNoForm - the combinations would quickly get out of hand. Then you’d also have a more difficult editor experience where a page creator would have to choose from a large list of page types that differ in subtle ways. The inability to convert a page from one type to another also makes using page types less attractive than configurable StreamFields.

Any insight/suggestion in how best to set up pages like this in “the Wagtail way”?

@loicteixeira thanks for your comments. Much of what you said rings true with our experience. We have adopted an atomic design philosophy and so try to encourage content editors to pick and choose the blocks that make the most sense for a particular page.

Regarding migrations, I hear you about squashing. I’d like to better understand what functionality is gained by being able to access a frozen version of StreamFields, though, besides just breaking Django convention. What bad things would happen if the block details of the SF weren’t stored in the migration?

@gasman I agree in part about the wasted effort. One issue we’ve faced is that on a large site with so many pages and types of content, the final design of a page might go through many iterations before its final publication. So instead of being able to standardize the design of, say, a SpeakerListingPage, creating the model, and publishing it to content editors, we instead seek to work iteratively, allowing content editors to use the Wagtail admin to experiment with page designs before finalizing them.

A common use case is one where editors use an existing page as a template but alter it slightly. So we might have a SpeakerListingPage and then decide we want to add expanded bios for the speakers on a different page. Using multiple page models for this would require considerably more engineering time versus giving the content editors the ability to experiment.

We ran into the same issue(s) on a couple websites too. Past a certain point, it is simply not possible to create separate page types for each combination, especially when converting to Wagtail an existing website with very bespoke pages (even after cutting in half the requirements by re-conciliating patterns together).

While I can agree with the general idea of keeping the right hat (i.e. separating content and design), I think the page type isn’t the only valid point of separation between design and content. Given that a block has specific fields and renders consistently via a template, I think the separation is present too (as opposed to putting everything in a rich text field). So by giving multiple blocks, we’re mostly allowing the content editor to re-organise the blocks and omit some others. This type of pattern reminds me of Atomic Design where the blocks would be molecules, nested blocks atoms used in a page/template (skipping organisms because that’s pushing probably going too far).

However to be fair, the more flexible Streamfield will be, the lazier the developer will be to organise the content into separate logic units and, can arguably lead to inconsistent page structures (but consistent design).

Back on the migration issue, I don’t know what to think of it really. I like the idea of following Django’s convention and being able to access a frozen version of the model, but given that it’s still rather difficult and fragile to modify the Streamfield data in migrations and that the data isn’t converted under the hook when migrating (not sure how feasible it would be anyway), it feels a bit useless.

That being said, when encountering this issue, one of the solution we looked at was to overwrite the deconstruct method of a custom SteamBlock which would be used as a base of all the StreamBlock for all the StreamField, we decided against it because as annoying as the current situation can be, this would have removed something quite fundamental from how Django migrations work and could be more painful down the line. Instead we squash migrations more often than usual so we have only a few big migrations and created a custom runserver_fast (used in development only) which disable migration checks (because with many big migrations files, the server would take up to 30 seconds to restart on a file change).

Two years later, and I still stand by the solution above.

It’s part of our default project setup, has saved hours in collective CI time, and many more in developer hours spent on alternative solutions like regular squashing or manual removal of block definitions (or doing nothing, and having to deal with migration tree conflicts in multiple environments 😜).

Personally speaking, I have yet to encounter a situation where I couldn’t do something in a data migration that I could before. However, it should be noted that losing access to historical block definitions makes it impossible to use the Streamfield migration helpers that have since been added to Wagtail.

My team solved this issue by simply cutting all StreamField details out of our migrations entirely, since they don’t actually write anything to the DB, anyway. We did so by adding the following code to our project:

import wagtail.core.fields


################################################################################################################
# Remove the database field definition override that Wagtail adds to StreamFields. It creates unnecessary churn
# in our migration files that ends up being really annoying.
################################################################################################################
def deconstruct_without_block_definition(self):
    name, path, _, kwargs = super(wagtail.core.fields.StreamField, self).deconstruct()
    block_types = list()
    args = [block_types]
    return name, path, args, kwargs
wagtail.core.fields.StreamField.deconstruct = deconstruct_without_block_definition

For those looking for an interim solution… we’re currently using this modified version of StreamField internally, which still allows access to the raw values in migrations:

import json

from wagtail.fields import StreamField as WagtailStreamfield


class StreamField(WagtailStreamfield):
    def __init__(self, *args, **kwargs):
        """
        Overrides StreamField.__init__() to account for `block_types` no longer
        being received as an arg when migrating (because there is no longer a
        `block_types` value in the migration to provide).
        """
        if args:
            block_types = args[0] or []
            args = args[1:]
        else:
            block_types = kwargs.pop("block_types", [])
        super().__init__(block_types, *args, **kwargs)

    def deconstruct(self):
        """
        Overrides StreamField.deconstruct() to remove `block_types` and
        `verbose_name` values so that migrations remain smaller in size,
        and changes to those attributes do not require a new migration.
        """
        name, path, args, kwargs = super().deconstruct()
        if args:
            args = args[1:]
        else:
            kwargs.pop("block_types", None)
        kwargs.pop("verbose_name", None)
        return name, path, args, kwargs

    def to_python(self, value):
        """
        Overrides StreamField.to_python() to make the return value
        (a `StreamValue`) more useful when migrating. When migrating, block
        definitions are unavailable to the field's underlying StreamBlock,
        causing self.stream_block.to_python() to not recognise any of the
        blocks in the stored value.
        """
        stream_value = super().to_python(value)

        # There is no way to be absolutely sure this is a migration,
        # but the combination of factors below is a pretty decent indicator
        if not self.stream_block.child_blocks and value and not stream_value._raw_data:
            stream_data = None
            if isinstance(value, list):
                stream_data = value
            elif isinstance(value, str):
                try:
                    stream_data = json.loads(value)
                except ValueError:
                    stream_value.raw_text = value

            if stream_data:
                return type(stream_value)(self, stream_data, is_lazy=True)

        return stream_value

I would greatly appreciate such an opt-out mechanism. My team does not care about keeping the history of our streamfields’ structure, and this change would save us several megabytes of useless new migration code being shoved into our repo every week.

One potential optimisation (which might well also link into #3062) would be to recognise when the same sub-block is re-used in multiple places in a streamfield definition, and assign that block an ID that we can point back to, instead of repeating the definition in full each time.

So, instead of:

body = StreamField([
    ('one_column', StreamBlock([big_long_list_of_sub_blocks])),
    ('two_column', StructBlock(
        ('left', StreamBlock([big_long_list_of_sub_blocks])),
        ('right', StreamBlock([big_long_list_of_sub_blocks])),
    )
])

we could have something like:

body = StreamField([
    ('one_column', StreamBlock([big_long_list_of_sub_blocks], id=1)),
    ('two_column', StructBlock(
        ('left', BlockReference(id=1)),
        ('right', BlockReference(id=1)),
    )
])

or perhaps

body = StreamField([
    ('one_column', BlockReference(id=1)),
    ('two_column', StructBlock(
        ('left', BlockReference(id=1)),
        ('right', BlockReference(id=1)),
    )
], block_refs={1: StreamBlock([big_long_list_of_sub_blocks])})

(Bear in mind that the graph of references might be arbitrarily complex: a block used as a reference might itself contain block references as sub-blocks. For extra bonus points, support circular references to achieve arbitrary-depth nesting. Yes, that’s something people have actually requested…)

@ssyberg our approach of excluding all streamfield details from our migrations is still working fine.

I wonder if this issue is related to a question I had on my mind (please tell me if this is too off topic): is StructBlock functionally duplicating what django.Model already handles? Both have a similar objective, which is to store a collection of fields.

What if - instead of inserting a StructBlock - I could insert a new django.Model instance into a StreamField? Perhaps this would add a few requirements on the particular django.Model (e.g. the existence of a few extra member functions). This instance would have an id, and it could be stored in its dedicated database table. If I change the django.Model’s definition, it would not directly affect any StreamField.

ps this is equivalent to the SnippetChooserField, except that the django.Model instance is not chosen from the database but created in the editor