Aggregating indeterminate JSON structures in Postgres
Postgres JSON types are great, but what do you do when your users fill them with garbage and you want to do maths.
The problem I was presented with was to find a way to aggregate user-defined (and structured) data that was stored in a serialized Ruby hash in MySQL, and could come in pretty much any shape or form.
I initially thought about expanding our serialized hash store into a key value table, however, this raised a number of problems:
- We'd need to use a
TEXT
/BLOB
type for the values, they could be anything – strings, integers, floats, booleans, who knows - Non-numeric values would still be mixed up amongst numeric values, meaning we'd have to cast and/or filter in order to use MySQL's aggregation functions, support for which appeared less than satisfactory
- Indexing restrictions apply to
BLOB
(and thereforeTEXT
) data types in MySQL anyway - We would still be unable to aggregate multidimensional/nested data
I toyed with the idea of denormalizing the data structure and having columns like value_to_s
, value_to_i
and value_to_f
as per the Ruby vernacular, and casting string values into those columns if they match a relevant regex for the type.
However, whilst this would still only allow us to aggregate top-level values, make writes pretty complicated, synchronisation of updates would be an issue, and engineering this seemed like a headache. In short, I was largely dissatisfied with this as a solution, I didn't want to build that. Fuck that.
I quickly came to the conclusion that:
- I wanted to store this data as documents, JSON, ideally
- This seemed like a difficult problem to solve in MySQL
- I didn't really want to bolt another database onto the side of the application, yet another thing to keep in sync
- I didn't have much faith that the JSON data type added in MySQL 5.7.8 would make the situation much better
- MariaDB didn't make me any more confident either
- Postgres' JSON types, type casts, and regex support appeared to solve all of my problems
After more research I concluded that migrating the database from MySQL to Postgres and fixing the application issues that doing that raised would be a less painful experience than trying to implement this feature in MySQL – and that in addition to that, we'd get a truck load of benefits outside of this problem as well.
So, what did those queries look like once I'd done that?
Usage examples
Selecting properties
Selecting from JSON types is pretty easy once you've got your head around the syntax, it's basically columname->'propname'
to select a property and retrieve it as a JSON type (allowing further traversal) – and if you want to retrieve its text representation use a ->>
.
It's quite common to end up using ->
for everything up until the last property, i.e. traverse the structure then grab the value e.g. reviews->'author'->>'name'
.
More examples can be found in the functions and operators documentation.
Aggregating known types
The most basic aggregations, where you know for sure what the data looks like (e.g. all integers) are pretty simple:
id, title, custom 1, "Entry 1", {"rating": 0} 2, "Entry 2", {"rating": 100}
SELECT AVG(custom->>'rating') FROM entries
=> 50
Aggregating strings via casting
The dataset I had was user supplied, often from form submissions, so integers and floats were usually stored as strings, which isn't ideal, but casting the values to numerics gets around it:
1, "Entry 1", {"rating": "0"} 2, "Entry 2", {"rating": "100"}
SELECT AVG((custom->>'rating')::NUMERIC) FROM entries
=> 50
Note: an extra pair of parenthesis are required to wrap the property selection otherwise Postgres will try to cast the string 'rating'
rather than your property value.
Aggregating strings that might be numeric
As this data was structured by end users, properties were often reassigned or repurposed, meaning the result set often contained a variety of data types for the same key:
1, "Entry 1", {"rating": 0} 2, "Entry 2", {"rating": "100"} 3, "Entry 3", {"rating": ":("}
If you try and query data that looks like this with aggregation functions, you'll get all sorts of casting errors.
Not a massive deal though, Postgres has pretty robust regex support. The following example will match strings which Postgres can cast to numeric, which includes integers and floats (with optional leading/trailing dots and zeros) i.e. 1
, 0.1
, .1
and 1.
.
SELECT AVG((custom->>'rating')::NUMERIC) FROM entries WHERE (custom->>'rating')::TEXT ~ '^([0-9]+\.?[0-9]*|\.[0-9]+)$'
=> 50
What this means is that the non-numeric rows will simply be ignored, and won't form part of the aggregation, which is exactly what I wanted, incorrectly entered data is not aggregated.
If you want to include those rows, and count them as zeros,you might want to move this line into a CASE
statement instead, and try aggregating that.
In addition to handling uncastable values, this also filters out rows that omit that property or have mismatching JSON types e.g. booleans. Which means your data can look like this:
1, "Entry 1", {"rating": 0} 2, "Entry 2", {"rating": "100"} 3, "Entry 3", {"rating": ":("} 4, "Entry 4", {"rating": true} 5, "Entry 5", {"other": false} 6, "Entry 6", NULL
=> 50
Any rows (e.g. 4-6) that don't have numeric values are simply ignored for the purposes of aggregation, which is ideal.
Aggregating nested properties
Since Postgres has a dedicated syntax for JSON types, nested properties are no big deal at all, just get the operators right:
5, "Entry 5", {"ratings": {"score": 0}} 6, "Entry 6", {"ratings": {"score": "100"}} 7, "Entry 7", {"ratings": {"score": ":("}} 8, "Entry 8", {"ratings": {"score": true}}
SELECT AVG((custom->'ratings'->>'score')::NUMERIC) FROM entries WHERE (custom->'ratings'->>'score')::TEXT ~ '^([0-9]+\.?[0-9]*|\.[0-9]+)$'
=> 50
Aggregation values within arrays
Arrays are supported too:
5, "Entry 5", {"ratings": [0, 1000]} 6, "Entry 6", {"ratings": ["100", 2000]} 7, "Entry 7", {"ratings": [":("]} 8, "Entry 8", {"ratings": [true]}
SELECT AVG((custom->'ratings'->0->>'score')::NUMERIC) FROM entries WHERE (custom->'ratings'->0->>'score')::TEXT ~ '^([0-9]+\.?[0-9]*|\.[0-9]+)$'
=> 50
SELECT AVG((custom->'ratings'->1->>'score')::NUMERIC) FROM entries WHERE (custom->'ratings'->1->>'score')::TEXT ~ '^([0-9]+\.?[0-9]*|\.[0-9]+)$'
=> 1500
Aggregating values within objects within arrays
5, "Entry 5", {"ratings": [{"score": 0}, {"score": 1000}]} 6, "Entry 6", {"ratings": [{"score": "100"), {"score": 2000}}]} 7, "Entry 7", {"ratings": [{"score": ":("}]} 8, "Entry 8", {"ratings": [{"score": true}]}
SELECT AVG((custom->'ratings'->0->>'score')::NUMERIC) FROM entries WHERE (custom->'ratings'->0->>'score')::TEXT ~ '^([0-9]+\.?[0-9]*|\.[0-9]+)$' => 50
SELECT AVG((custom->'ratings'->1->>'score')::NUMERIC) FROM entries WHERE (custom->'ratings'->1->>'score')::TEXT ~ '^([0-9]+\.?[0-9]*|\.[0-9]+)$' => 1500
As part of the work I was doing I needed to be able to identify what keys had been used so that users could select the properties they've stored via a UI in order to build out aggregations.
Selecting a list of top level keys is pretty trivial, there's a basic function for that:
SELECT JSON_OBJECT_KEYS(custom) FROM entries
However, doing this recursively to get nested keys is significantly more troublesome:
WITH RECURSIVE doc_key_and_value_recursive(key, value) AS ( SELECT t.key, t.value FROM entries, JSON_EACH(entries.custom) AS t -- WHERE something = 123 # apply filtering here UNION ALL SELECT CONCAT(doc_key_and_value_recursive.key, '.', t.key), t.value FROM doc_key_and_value_recursive, JSON_EACH( CASE WHEN JSON_TYPEOF(doc_key_and_value_recursive.value) <> 'object' THEN '{}' :: JSON ELSE doc_key_and_value_recursive.value END ) AS t
)
SELECT DISTINCT key FROM doc_key_and_value_recursive WHERE JSON_TYPEOF(doc_key_and_value_recursive.value) NOT IN ('object') ORDER BY key
– http://stackoverflow.com/a/30133181/160821
What this returns is a list of dot delimited paths to retrieve a property e.g. custom.ratings.score
– similar to how you'd access it if the JSON was loaded into JavaScript.
Surprisingly, this executes in a relatively reasonable timeframe. When testing against half a million rows I was finding it takes around 20s – but given I only ever needed to query a fraction of that dataset, a few hundred/thousand, the query time was sufficiently negligable.
Array support (i.e. keys of objects within arrays) is feasible with enough tinkering too:
WITH RECURSIVE doc_key_and_value_recursive(key, value) AS ( SELECT CASE WHEN JSON_TYPEOF(t.value) = 'array' THEN CONCAT(t.key,'[]') ELSE t.key END AS key, CASE WHEN JSON_TYPEOF(t.value) = 'array' THEN JSON_ARRAY_ELEMENTS(t.value) ELSE t.value END AS value, CASE WHEN JSON_TYPEOF(t.value) = 'array' THEN JSON_ARRAY_LENGTH(t.value) ELSE NULL END AS i FROM entries, JSON_EACH(entries.custom) AS t -- WHERE something = 123 # apply filtering here UNION ALL SELECT CASE WHEN JSON_TYPEOF(t.value) = 'array' THEN CONCAT(doc_key_and_value_recursive.key, '.', t.key, '[]') ELSE CONCAT(doc_key_and_value_recursive.key, '.', t.key) END AS key, CASE WHEN JSON_TYPEOF(t.value) = 'array' THEN JSON_ARRAY_ELEMENTS(t.value) ELSE t.value END AS value, CASE WHEN JSON_TYPEOF(t.value) = 'array' THEN JSON_ARRAY_LENGTH(t.value) ELSE NULL END AS i FROM doc_key_and_value_recursive, JSON_EACH( CASE WHEN JSON_TYPEOF(doc_key_and_value_recursive.value) <> 'object' THEN '{}' :: JSON ELSE doc_key_and_value_recursive.value END ) AS t
)
SELECT key, MAX(i) FROM doc_key_and_value_recursive GROUP BY key ORDER BY key
What this returns is a little/a lot weird, it's a similar dot delimited notation to describe paths to fields, with []
slotted in wherever an array was found i.e. custom.ratings[].score
, as well as a max
column, that describes the largest number of items found in that array for any record.
Whilst this is definitely not what I wanted, I could process that data into what I did want, i.e. I could infer that a result of custom.ratings[].score, 2
meant that both custom.ratings.0.score
and custom.ratings.1.score
were therefore possible, and use that to build out a list of keys for user to select. Crazy, I know.
Hopefuly this will help you get started aggregating your JSON data in Postgres. Whilst some/most of these queries are pretty lengthy – Postgres does a pretty good job of handling them in small to medium size datasets, and I expect they'll scale pretty far into larger ones too.
Best of luck!