Our journey from Graph Databases to PostgreSQL

Chroma, a project we have been actively developing for over 2 years, is a platform for collectors. It allows detailed cataloguing and categorization of what we call "figures": Photographic records of any collectable, with visual and textual indexing for search and retrieval.

Each category in Chroma represents the child tree underneath itself and the category cover photos are calculated dynamically, based on your current search and browsing context. If you are browsing categories of a specific user or place we provide the latest photo of that category for that user or place. On top of this, there is additional requirements such as custom collections and private figures or users. This requires a tremendously complicated dynamic calculation system that needs to operate at sub-300 millisecond speeds.

How did we accomplish this?

Due to the hierarchical and document-oriented structure of the platform, we initially architected everything on MongoDB, for a few reasons:

  • Easy to use, implement and adapt to your project
  • Document oriented, easy to scale
  • JSON like data structure was highly usable to keep the high detail information data of categories and photos

However, since we used MongoDB as our persistence store, making recursive calculations such as contextual photo count of each category or contextual last photo of each category became really hard to scale, as our figure count grew from hundreds to tens of thousands. After some detailed research, we decided to give graph databases a try.

Let's use Neo4j

Neo4j is one of the most popular graph database options. It's very easy to learn, implement and use. It also provides a beautiful admin panel and query language cypher.

MongoDB was doing its job quite well for simple querying and storing figure data as documents, but the aggregation and recursive queries were spiraling out of control.

So we continued to use MongoDB as our core datbase and started synchronizing some of the data with Neo4J externally. We were running graph related queries through Neo4j, get the identifiers as result and serializing those identifiers by querying MongoDB.

It was fascinating, we were truly impressed by the ease of querying and usability. Some of the queries were like this:

Give me all the sub categories under 'Food and Drink'
MATCH (category:Category{id:"5660701dfd89786353e42a37"})
MATCH (category)-[:CHILD_OF*0..50]->(subcategories:Category)
WITH distinct subcategories
return subcategories.id

It was really easy to traverse through the branches of a category tree connected with same relation by just using [:CHILD_OF*0..50],

As you can see, Cypher querying is highly declarative:

  • Get the category
  • Get the recursive children of that category
  • Distinct
  • Return ids
Last photo of Beverages
MATCH (category:Category{id:"5660701dfd89786353e42a37"})
MATCH (category)-[:CHILD_OF*0..50]->(subcategories:Category)
WITH distinct subcategories
MATCH (subcategories)<-[:PHOTO_OF]-(photo)
with distinct photo
ORDER BY photo.created_at
return last(collect(photo.id))
  • Match category Beverages
  • Get subcategories
  • Distinct
  • Get photos of all the recursive subcategories
  • Distinct photos
  • Order by creation date
  • Return the last photo id

This is quite impressive when you compare with other querying languages.

Count of photos of user 'Andre Poremski' under Continent 'North America'
MATCH (user:User{id: "51e52cfc389a5e35fa403436"})
MATCH (place:Place{id: "5571b6a691201b05075468f8"})
MATCH (place)-[:CHILD_OF*0..50]->(subplaces:Place)<-[:TAKEN_AT]-  (photos:Photo)
with distinct photos, user
MATCH (user)<-[:OWNER]-(photos)
return count(photos)

Here be dragons!

This was working quite well in the beginning... until it wasn't.

  • Application logic was getting more complicated every day, Neo4J is great, but it gets slower with complicated queries and increased data scale
  • Maintaining two databases sucks, we kept running into sync issues between Neo4J and MongoDB :(
  • New query requirements such as private users and private photos which only owners and superusers are able to see started breaking the system, we couldn't cache anything due to completely dynamic calculations
  • Using different databases for graph queries and serialization queries increased our API response times substantially
  • As our database grew, Neo4j started getting slower and slower with no clear path for scaling or optimization

After some soul searching and a few sleepless nights, we decided to go conventional.

Let's use PostgreSQL

Going back to a relational database took some getting used to. First we tried implementing "on the fly queries" like how we did in Neo4j, this was not as easy as using cypher.

Recursive with queries
CREATE OR REPLACE FUNCTION recursive_cats(text[])
RETURNS table(id bigint, mongo_id text, name  text, path character varying) AS $$
WITH RECURSIVE first_level_elements AS (
  (
    (
      SELECT id, mongo_id, name, (mongo_id) AS path FROM categories
      WHERE categories.mongo_id = ANY($1)
    )
  )
  UNION
  (
      SELECT e.id, e.mongo_id, e.name, fle.path
      FROM (
        SELECT id, mongo_id, name, mongo_parent_id FROM categories)
      e, first_level_elements fle
      WHERE e.mongo_parent_id = fle.mongo_id
  )
)
SELECT * from first_level_elements
$$ LANGUAGE SQL;

Denormalization to the rescue

We were spending too much time writing complicated queries and it was hard to maintain or improve due to business demands. PostgreSQL is perfect for writing aggregation functions, complex queries. However this had performance impacts.

Category hierarchies were not changing too often and our data was unidirectional, from parent categories to child categories to figures. So we decided to denormalize and store the parent category information of each category in a concatenated string identifier such as parent1|parent2|parent3|.

Using this method, count query the query would be:

Recursive photo count under beverages
select count(*) as figure_count,
from photos 
where
P.precalcpath like '<bevarages id>%'

This approach allowed us to make recursive queries, also we no longer needed Neo4J, MongoDB or any other database but PostgreSQL thanks to the denormalized precalculated paths. Since PostgreSQL uses indexes for like queries, it's working quite fast when we compare with the MongoDB and Neo4j stack.

Conclusion

I still think it's easier and faster to write queries using Neo4j and Cypher. Neo4j is quite promising, and I am a huge fan. It's a rock solid graph database, and they are improving the performance observably in every release. By improving the data storing features and complex querying performance it may become a primary database for lots of projects.

That said, the lesson learned for us was never underestimating the power of conventional databases, specifically PostgreSQL. For aggregation, querying and performance, it's the most robust solution available.