Tiny struggles

Always hacking something 🧑‍🔬.

Django - growing up from SQLite to PostgreSQL

Django by default comes with SQLite as a database. It’s a great choice for initial development and for testing, but not if you want to run multiple app servers, e.g. in a SaaS project. At some point it’s time to pick up a DB like MySQL or PostgreSQL that is more suitable for this purpose.

I’ve been working on my open source side project Invertimo for a while and at some point I decided to change the db engine from SQLite to PostgreSQL. Well… it wasn’t an entirely smooth transition.

I’m presenting you three stories of my database migration struggles with some valuable lessons learned along the way.

Configuration glitches

I don’t set up new databases like that every day, so I looked up some tutorials and followed this one. There was only one issue, my tests would not work due to permission issues!

Thanks goodness, Django has pretty good documentation for its settings, you can find relevant sections here and here.

Aside from using a separate database, the test runner will otherwise use all of the same database settings you have in your settings file: ENGINE, USER, HOST, etc. The test database is created by the user specified by USER, so you’ll need to make sure that the given user account has sufficient privileges to create a new database on the system.

The key takeaways are:

  • when using PostgreSQL - the user needs to have Create DB permissions
  • test db is basically the same as the ‘normal’ db, you can do very minor overrides, like change a NAME

To solve the permission problem I needed to give my PostgreSQL user permissions to create new databases:

ALTER USER myuser CREATEDB;

As a side note, I have tried a configuration like below to make my tests run with SQLite:

** BAD **

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'USER': 'mydatabaseuser',
        'NAME': 'mydatabase',
        'TEST': {
            # Will not work, since Django will take the ENGINE from
            # the value one level up.
            # The bad thing is that it will not show you any warning,
            # that the configuration is invalid.
            'ENGINE': 'django.db.backends.sqlite3',
            'NAME': 'mytestdatabase',
        },
    },
}

But it didn’t work and I would generally not recommend that approach. I will explain the issues with that later in the article.

If you really needed to do that though, here is a configuration that would work:

import sys
if 'test' in sys.argv:
    DATABASES['default'] = {
        'ENGINE': 'django.db.backends.sqlite3',
        'NAME': 'mydatabase'
    }

Instead of setting overrides within the ‘TEST’ dictionary, you could override the settings for your ‘default’ db directly when the program is run in the context of tests. More about this solution can be found here.

Problems with fixtures

I’m not a huge fan of database test fixtures, but they have their good use cases and django supports them pretty well. They can be a pain to maintain (hello, database changes!) and they can make your tests slow if you load too much data.

In my Invertimo project I’ve been using one fixture for things such as Stock Exchanges (otherwise setting them up in tests would require quite a lot of boilerplate).

Natural keys stumble

It was unclear to me why, but my old fixture that worked with SQLite wouldn’t work with PostgreSQL now. Obviously, I couldn’t let my tests to keep failing and I decided that it will be easier for me to recreate the data and create a new fixture.

Django implements two very convenient methods for dealing with fixtures: loaddata and dumpdata.

It seems like a common source of problems is the django contenttypes that are autogenerated and ‘shouldn’t be serialized’. Well, by default they are, unless you exclude them or… do a natural key trick.

Like I did, when I created my application:

python manage.py dumpdata --natural-primary > finance/fixtures/exchanges.json

Well, you shouldn’t just use a flag, because someone on the internet suggested doing that. It can lead to another types of problems.

Do better than me, read about natural keys and serialization in django before thinking about using them.

So, I created my new fixtures, they managed to load with PostgreSQL, I found another bug that was related to db differences which I will talk about later and all was fine for a while.

But then I added some more features and added more tests and I suddenly started experiencing weird integrity errors across tests. I don’t know how the django tests are interacting with the db and I didn’t feel like reverse engineering the code (it’s a weekend side project after all and I have other things to do too…), but believe me it was very annoying.

My stack overflow searches were not yielding many results, but I couldn’t give up of course! The test with a fixture was rather important and I didn’t feel like redoing the db setup from scratch in the test. I ended diving into my fixtures and analyzing both the errors I have been getting. All the errors were related to loading the serialized User and failing uniqueness constraint on primary keys.

I found something a bit odd - my serialized user didn’t have a ‘pk’ field.

{"model": "auth.user", "fields":
 {"password": "pbkd1234zE9lKo=", "last_login": "2021-05-23T12:28:17.174Z",
 "is_superuser": true, "username": "leethacker", "first_name": "", "last_name": "", "email": "leethacker@gmail.com", "is_staff": true, "is_active": true }

Odd, right? Well, I made it this way by using the --natural-primary flag. Adding a fake "pk": 122 manually to my fixture solved my problem. Oof, believe me that was a pretty frantic debugging session.

Note for my future self, ditch the contenttypes instead of using --natural-primary for this specific use case:

python manage.py dumpdata --exclude=contenttypes > my_data.json

Another option would be not using fixtures at all and save myself from all associated complexity of serializing and deserializing data.

Is PostgreSQL just so much slower than SQLite?

When I did my initial jump from PostgreSQL to SQLite (and re-dumped my fixture) my tests got significantly slower. I blamed it on PostgreSQL being ‘heavy-weight’ compared to SQLite. Don’t people like to switch their db to SQLite to speed up their test?

Well, my tests were awfully slow with PostgreSQL. I figured out why when I was having problems with my fixture integrity errors. My fixture was awfully, awfully long. Long enough for github not to want to show me the diff - this could have raised some red flags, but it didn’t.

Turns out I accidentally dumped multi-year daily stock price history of multiple stocks to a json text file. No surprise that my tests would take 10s+ when parsing such data. I cut it out and sped up my tests about 10x.

A tip for my future self:

Evaluate the autogenerated fixtures critically for unnecessary data.

DB representation differences can cause application bugs

SQL Databases share some common features, but also have their differences. Some obvious ones are query syntax data types and performance differences. Here is a good comparison between MySQL and PostgreSQL. The more RAW SQL or custom DB features your django app is using the harder a DB switch is going to be.

My app seemed to very simple when it comes to using the SQL database, no custom SQL, no fancy DB features, but I managed to hit a complex issue anyways!

Invertimo has a feature when user can import their stock transactions via a CSV file. I use pandas to load and manipulate that.

Here are some code snippets that contained a very specific bug:

transactions_data = pd.read_csv(filename)
transaction_record = transactions_data_clean.iloc[x]

I get my float values and convert them to decimals, because this is how they are stored in the database:


quantity = transaction_record["Quantity"]
...

def to_decimal(pd_f):
    return pd_f.astype(decimal.Decimal)

accounts.AccountRepository().add_transaction(
    account,
    isin=isin,
    exchange=exchange,
    executed_at=executed_at,
    quantity=to_decimal(quantity),
    price=to_decimal(price),
    transaction_costs=transaction_costs,
    local_value=to_decimal(local_value),
    value_in_account_currency=to_decimal(value_in_account_currency),
    total_in_account_currency=to_decimal(total_in_account_currency),
    order_id=order_id,
)

I avoid adding duplicate transactions, by using Django’s get_or_create that should avoid adding an instance of the model if there is an instance with exactly same values already:

transaction, created = models.Transaction.objects.get_or_create(
    executed_at=executed_at,
    position=position,
    quantity=quantity,
    price=price,
    transaction_costs=transaction_costs,
    local_value=local_value,
    value_in_account_currency=value_in_account_currency,
    total_in_account_currency=total_in_account_currency,
    order_id=order_id,
)

Luckily I had tests for this functionality, unluckily the tests started failing after I switched the DB from SQLite to PostgreSQL. According to my tests, my duplicate prevention logic stopped working. In other words, I could end up with duplicate transactions now.

I narrowed down the issue to a problem with get_or_create. The only thing that changed was the DB, what could there be wrong?

It turns out that if I tried to import the same transaction twice a price in database vs a new one in the get_or_create would now not be identical and another transaction would be created, this way creating a duplicate.

Why would a transaction not be deduplicated with get_or_create if the initializing values were identical? My best educated guess is that the decimal values would not match. Could it be that pandas astype conversion was problematic?

def to_decimal(pd_f):
    return decimal.Decimal(pd_f.astype(decimal.Decimal))

This code worked perfectly fine with get_or_create while using SQLite, but with PostgreSQL it was causing transaction duplicates.

The fix was to convert the Decimal argument to string first:

def to_decimal(pd_f):
    return decimal.Decimal(pd_f.astype(str))

Most likely it was some type of a precision issue caused by conversion from floats, similar to this:

In [1]: price = 4.35

In [2]: decimal.Decimal(decimal.Decimal(price))
Out[2]: Decimal('4.3499999999999996447286321199499070644378662109375')

In [3]: decimal.Decimal(str(price))
Out[3]: Decimal('4.35')

Converting the argument to string first got rid of the ‘imprecision’ of the float and decimal stored in the db. Different databases don’t sore values like Decimal the same way, there are implementation differences big enough to change my code from working to not working.

The lesson from here is to write good tests and use the same database engine for tests and production code. This bug was tricky and non-intuitive. Tools like mypy that helps with types in python could not save me here. I was lucky my tests caught it or otherwise I would have happily shipped the error in production.

Conclusion

Database switch can be a quite painful and a time consuming operation, but it can be done. It might be easier to just start your app with the target database in mind to save yourself future trouble.

Otherwise make sure that you buffer some time to fix up the issues that might come up in the process.

This is my mathjax support partial