Materialized View Refresh Strategies

Introduction

In database-heavy applications with complex business requirements database queries will become complex and query execution could be very slow. This problem is usually solved with the help of Materialized Views (MV or MVS for plural). In this article, we will learn what Materialized Views are and how to implement them. We will then see what their main problem is and describe common strategies to solve this problem, but also explain a new strategy (and a practical system) which solves this problem in a more elegant way.

What is a Materialized View?

Materialized View is a data structure composed of data from multiple tables (and/or views). MV is similar to a database table because it is stored physically on the disk. Because of that, fetching data from MV is very fast. For example, in PostgreSQL we can create it with the following syntax:

CREATE MATERIALIZED VIEW mview_name

AS

mv_query

WITH [NO] DATA;

Once created, our MV can be accessed like a normal view or table, eg:

SELECT * FROM mview_name;

But, if we insert/update/delete some of the data in the tables that the MV is using, this query will show us the old data, and we must find a way to ‘refresh’ (reload) it somehow.

Loading the MV with ‘fresh’ data

The data in MV can be updated or ‘refreshed’. This is done with the REFRESH command, for example in PostgreSQL we have this command:

REFRESH MATERIALIZED VIEW view_name;

This will execute the mv_query, and put all the ‘fresh’ data from the mv_query (ie. underlying tables/views) into the materialized view.

MV refresh lock problem

REFRESH MV operation takes some time (‘t’) and during that time the MV is “locked”. What does this “lock” really mean, what happens with the application? Don’t worry, the application will not crash. We can still query the MV, but the MV data won’t be shown until the REFRESH operation is finished. In the app, the end users will have to wait longer than usual. To improve this, we could develop a few strategies, ie. we can look for the appropriate situations in which we can efficiently trigger a REFRESH command.

Materialized view refresh execution strategies

We can execute MV REFRESH operation in multiple ways:

A) manually, via database GUI application like dbeaver – this is usually done only for development, debugging or testing

B) in code, on demand, after the underlying data has been created/updated/deleted – this is the most common method, but also complex because the developers must add this REFRESH operation after each insert/update/delete is executed

C) periodically, by cron e.g. each minute, hour, day, week etc – This is a simple method but must be carefully done (more about that below – T3)

D) both B and C

When deciding about the ‘right’ MV refresh strategy, we should take multiple factors into consideration:

Average MV refresh time (‘t_avg’) – you can find out this with the method A

MV data importance from the business perspective

Are there multiple MVS that are using the same tables? – if yes, we get increased refresh times and longer loading times which is not good for user experience

Number of concurrent application users – 1, 5, 10, 100, 1000 or more?

Users behavior – what is the typical user behavior, how are they using the application, and how often?

But, to simplify things, we can start by only looking at the first factor, ‘t_avg’:

T1: If the ‘t_avg’ is very small, eg 0 – 8s, we can execute it after every table create/update/delete operation (B)

T2: If the ‘t_avg’ lasts from 8 – 30s, we can execute it periodically by cron, eg. every 2 minutes (C)

T3: If the ‘t_avg’ lasts from 30s – 5min or more, we can:

– execute it periodically (eg. every 2min to 1 hour), but that could lead to stale (old, wrong) data and unwanted/unexpected long loading times.

– or if we have rare inserts, we could trigger it only in that case, and once an hour/day/similar with cron to keep the update/delete records up-to-date.

And simply, if the data is rarely changed, or it isn’t that important to the business (end users), just update the MVS once a day, preferably by nightly cron.

These guidelines may be enough for some simpler cases. Truth be told, if this is acceptable to the client, look no further. But more often the data is very important to the business (they need it to be available and up-to-date at all times), the MVS are using same tables, and there are many users that are concurrently using the application in different ways. Also, we see that the third case (T3) has its problems, so we must find a smarter way to refresh our MV.

Behavioural MV refresh strategy

This MV refresh strategy is based on the users’ behaviour ie. frequency of users’ actions in the application. If we analyze exactly what the users do in the application, we have 3 types of users:

users group A – loads data, uses filters etc (sql select)

users group B – loads, filters, and rarely changes the data (sql select, update, insert, delete)

users group C – loads, filters, and change the data often, manually or via import (sql select, update, insert, delete)

Looking and filtering the data uses only the MV without the need for MV refresh. Changing the data initiates MV refresh, and ‘disables’ MV for viewing. Knowing all of this, we could make a system that “listens” for users’ actions and tries to make the MV refresh when the users are no longer using insert/update/delete for some time that we set (threshold time ‘Z’).

The implementation

Practically, this system consists of a frequently called (flock-ed) cron script (eg. every 15s or less) that executes MV refresh when needed, and an additional database table with the following structure:

create table public.mvs (

id int primary key,

mv_name text,

status text,

updated_at timestamp,

last_successful_update timestamp

)

This table keeps track of all MVS in the system, and their statuses:

updated

– outdated

updating

updating-outdated

Starting mvs.status is ‘updated’. When an MV related table has any of the insert/update/delete queries, in the code we simply change the mvs.status to ‘outdated’. After no more than 15s, the cron script will pick up this record but only when its mvs.updated_at is older than 2 (threshold ‘Z’) minutes. The script will then start to execute the appropriate MV refresh command, and change the mvs.status to ‘updating’. When the MV refresh is finally finished, the cron script will set the mvs.status back to ‘updated’. Threshold time ‘Z’ can be tweaked, from 2 to 5 minutes, depending on the feedback from the users: start with 2 minutes, and increase if needed.

Graphically this process looks like this:

There are two possible flows:

1) 1,2,3,4,5,6,1. This represents an ideal situation where a single user makes a change, and after a threshold time is passed, the cron runs and finishes a MV refresh.

2) 1,2,3,4,5,7,8,9,3,4,5,6,1. Here, a single user makes a change, and after a threshold time is passed, the cron runs a MV refresh, BUT during the MV refresh, the other user also makes a change, and the status is changed to updating-outdated. After the cron finishes the MV refresh, it will again check the status and see that is changed to updating-outdated and change the status to outdated and again, after a threshold is reached, trigger the MV refresh again.

It is important to note that during (3), the other users can change the data, and the mvs.updated_at is simply updated to current time. Therefore, the MV refresh is pushed more into the future. This is the main feature of the system, where we “stop” the unneeded MV refresh, and wait just a little longer, until the users stop making (frequent) changes to the tables.

Conclusion

Materialized views provide a way to optimize a database, in a way where the slowness of SELECT queries is shifted to the data manipulation side (INSERT, UPDATE, DELETE). The main drawback is that we get the MV locking problem while the REFRESH MV is in progress. This article describes a couple of standard strategies for solving this drawback, and a new automatic cron-based system that is based on users’ behaviour. This system was developed while implementing a materialized views for a client, and while it introduces some additional complexity, it has been proven to be a good and reliable way of improving the overall application speed, stability and user experience.

Related Posts

Leave a Reply

Contact Us