Sometimes we make mistakes. And sometimes we make mistakes in production.

I  believe we should not be judged for the mistakes we make but for how we handle our mistakes. Very often, we see teams diving straight into the production database to correct their mistakes.

Just kidding, of course! We've all been there. A wise man once said: "He that is without sin among you, let him first cast a stone." That doesn't mean we can’t improve, though. Right?

What are the limits of migrations?

For a long time, I have been writing small scripts when I need to execute commands on a production server. These scripts have some virtues.

They are (or should be) testable; they are repeatable (for repeated mistakes. No judgment!); they can be reviewed. They can even serve as a basis for a new data-correcting functionality in your admin backend.

When arriving at my latest project, I was happy to see this practice was already in use, just in a slightly different form than what I was used to. The team used the database migrations to execute their correcting scripts.

I didn't think much of it until I started creating a proper reproducible development environment. One of the key components for this is running database migrations.

However, the data corrections that were happening were breaking my migrations. Down migrations failed. Certain data was assumed to be present. It took forever to fix all the issues.

Are migrations the right tool for data manipulation?

In an attempt to fix this, I turned to what I already knew. I challenged the use of database migrations for data correction, only to discover the team had actually chosen this approach on purpose.

To understand why this approach makes sense, it's important to understand their development workflow. Usually, we strive to deploy continuously. For this particular project, however, deployments are release-based.

This means that feature goals for a particular release are set, and the code is deployed whenever those features are completed and tested. I have opinions on this, but we're not at a point where we can start deploying and releasing continuously. It is what it is.

This approach implies that there is usually a delay between development and deployment. In this time span, the responsible developer has already started something new and would often simply forget to run a necessary script when it comes time to deploy.

If only there were a system where we could run scripts that manipulated the database automatically when the deployment occurs. Hey, we have something like this: migrations!

Enter manipulations

So at that point, I understood why migrations were being used. And to be fair, it makes sense. Unfortunately, understanding this didn't solve my problem. What it did, however, was put me on the right track towards a proper fix.

In the end, I came up with a solution based on two different migration tracks. One, I called data migrations; the other, I called data manipulations.

Migrations and manipulations are similar. For instance, they are both run immediately after deployment. The biggest difference is that manipulations don't need to be run in other environments, such as your dev environment or a testing environment.

It involved some unconventional configuration in Symfony, but, in the end, I made it work. The first thing I did was create a configuration so Doctrine knew what to do:

# config/data_manipulations/config.yaml transactional: false migrations_paths: DataManipulations: data/Manipulations table_storage: table_name: 'doctrine_data_manipulation_versions'

Hardening manipulations for error cases

I added an extra constraint to the manipulations. I don't want them to be reverted by migrating down. Allowing this seemed rather dangerous since once data is in the database, it should be considered used.

Removing this data by reverting the migrations should not be the default. Instead, I prefer to fail forward and write a follow-up migration to correct the data. This approach gives you the opportunity to think about the best way to re-correct your data.

Blocking rollbacks can be implemented by writing a Doctrine event subscriber. I first check whether we are working with a manipulation by looking at the namespace. This technique seems a bit flawed, but it works perfectly for us.

# config/services.yaml

Data\BlockRollbackSubscriber:
 class: Data\BlockRollbackSubscriber
 tags:
   - { name: 'doctrine.event_subscriber', connection: 'default' }
// data/BlockRollbackSubscriber.php

namespace Data;

use Doctrine\Common\EventSubscriber;
use Doctrine\Migrations\Event\MigrationsEventArgs;
use Doctrine\Migrations\Events;

class BlockRollbackSubscriber implements EventSubscriber
{
   public function getSubscribedEvents()
   {
       return [
           Events::onMigrationsMigrating,
       ];
   }

   public function onMigrationsMigrating(MigrationsEventArgs $event): void
   {
       $migrationClass = $event->getPlan()->getFirst()->getMigration();
       $reflection = new \ReflectionClass($migrationClass);
       $namespace = $reflection->getNamespaceName();

       $isDataManipulation = ('Data\Manipulations' === $namespace);
       $direction = $event->getPlan()->getDirection();

       if ($isDataManipulation && 'down' === $direction) {
           throw new \Exception('Data manipulations should not be rolled back');
       }
   }
}

What is a data manipulation?

I tried to come up with some rules for when something should be a migration or a manipulation since this may not always be obvious. The most general rule I could come up with is whether the changes are heavily coupled to changes in your code.

More concretely: if rolling back code without rolling back data  will cause the code to fail, the data change should be a migration.

Another heuristic could be whether a change should happen in all environments. If it should happen in all environments (e.g. a new configuration value is added), perhaps a migration is better suited.

Here’s an example of a data manipulation class for the sake of completeness. Make sure the class is located in the correct namespace and directory.

// data/Manipulations/Version20230803161038.php

namespace Data\Manipulations;

use Doctrine\DBAL\Schema\Schema;
use Doctrine\Migrations\AbstractMigration;

final class Version20230803161038 extends AbstractMigration
{
   public function getDescription(): string
   {
       return '#42 Correct typo in status field';
   }

   public function up(Schema $schema): void
   {
       $this->addSql("UPDATE items SET status='active' WHERE status='actife'");
   }

   public function down(Schema $schema): void
   {
   }

}

Managing best practices

Sometimes you arrive in a codebase where practices are different from what you’re used to. Sometimes those practices even go against what you consider best practices.

Very often, there is a reason why things are the way they are. It’s very valuable to research those reasons. This will help you in shaping a solution that improves the quality of the codebase, while still solving the problem the team faced.

Do I think data manipulations are the way forward for every project? No. In an ideal project, I would still prefer to run manipulation scripts manually and not tie them so closely to data migration libraries. This gives me more control and gives more options for fast feedback.

However, in the context of this team and this project, I think they’re a very good solution. They allowed me to set up a local environment the way I wanted, while still allowing the team to deploy without the risk of forgetting to run the manipulation scripts. And no one has to go to jail.