Best Conversational AI Platform
Types of Databases to store your unstructured data    

I believe most popular databases(SQL or NoSQL) are capable of handling data of any type of application but are they efficient enough for the use case? It’s very important to find the appropriate database for the system because sometimes all the use cases are vague or unknown and you may find your system is paralyzing in future then migrating or adding another database on the top of it(Redis) is time consuming, costly and risky.

What is the use case?

Being a part of the SuperBot team, I can say it’s designed in such a way that every module can have its own technology stack if it’s necessary and It was the time to add another module, The Databank.

What is the Databank?

The databank is a collection of millions of records of contacts and the structure of contact is completely dynamic. A contact can have one or more information like phone number, name, location, etc.

On successful reposition of the data in the databank, the user can create one or more datasets on it by applying desired filters on contacts. A dataset is a part of the databank, later on, which can be used to start SuperBot campaign.

Requirements?

As it’s evident from the use case, we need a database system to store a fully dynamic databank and its datasets with good reading, writing and searching(with wildcards) speed.

We had to choose from two types of database systems, relational databases and document or NoSQL databases.

  1. Relational Databases:-
    Most relational databases have strict schema rules and creating columns dynamically is not possible in these databases. So the solution left here is to store the
    data of contacts in JSON structure. But if your database doesn’t have enough necessary functions to read and write on JSON column data then you may have to get whole data for manipulation and write it back to the database.This is not an efficient solution if your read/write requests are too frequent.

    1. MySQL-
      MySQL supports JSON data type and internally it converts JSON column data into binary structured formats, which allows us to quickly read access to sub-objects without reading whole data.
      Updating JSON data is possible by various MySQL JSON functions like JSON_SET(), JSON_INSERT(), JSON_REPLACE().
      MySQL does not support JSON column indexing so searching would be very slow compared to relational databases which support JSON indexing. Although we can index JSON sub-objects path expression by creating virtual columns and mapping this column to that path expression. This makes searching fast but this solution is not scalable for arbitrary JSON paths in my opinion.
    2. PostgreSQL-
      Who says you can’t teach an old dog some new tricks. PostgreSQL 9.4 (2014) added support for JSONB data type. JSONB stands for JSON binary. PostgreSQL supports GIN(Generalized Inverted Index) for JSONB data type which makes searching and updating partial data faster.PostgreSQL also supports various JSON and JSONB operators to select and update the JSON data.
  2. NoSQL Databases-
    NoSQL databases are schema-free databases and they also support indexing, thus we don’t need to use JSON data type here. Here at SuperBot, we use these 3 NoSQL
    databases-

    1. MongoDB
    2. ElasticSearch
    3. Redis

    Redis is an in-memory database, so we are not going to use this database in this case.

    1. MongoDB-
      MongoDB is a document-oriented database system. Inherently it stores data in JSON structure and provides a flexible schema. The major benefit of using MongoDB is Wildcard Indexing. Wildcard indexes are introduced in MongoDB 4.2. With the help of this feature, we can create indexing of dynamic nested fields of a document and not just the document itself.
      Programmatically creating wildcard indexes is pretty easy-
      For example, assuming there is a user_data object field in a collection which has nested key-value pairs which are unknown in advance.
      db.collection.createIndex({“user_data.$**”:1})
    2. ElasticSearch
      ElasticSearch is a restful search and analytics engine which was initially released in 2010. It’s also a NoSQL database and is based on JVM. It’s mostly used for search engines.
      Benchmarking-
      All tests are done with the data of 1 million rows and 3 levels on nested data and the results are in seconds
      a) Searching text on second level of nested field

      MySQL MongoDB ElasticSearch
      56.919 30.025 36.286
      53.548 28.237 36.168
      36.752 26.0296 38.547

      b) Searching integer on second level of nested field

      MySQL MongoDB ElasticSearch
      1742.802 361.322 14.661
      688.346 350.108 11.364
      657.467 376.343 12.193

      c) Searching word on second level of nested field

      MySQL MongoDB ElasticSearch
      38.576 29.442 34.646
      41.664 31.106 41.039
      47.129 29.967 47.637

      As we can clearly see from the above tables, MySQL is taking much higher time than MongoDB and ElasticSearch. ElasticSearch is close to MongoDB in two cases and won in one case but on average it’s slower than MongoDB and also it doesn’t fit in our use case because ElasticSearch is a text search engine and not a general-purpose database system.

PostgreSQL VS MongoDB

All tests are done with 5 million rows and 2 levels on nested data and the results are in seconds.
We have used the GIN index in PostgreSQL and wild card indexing in
MongoDB.

a) Searching text on the second level of nested field

PostgreSQL MongoDB
2.304 2.924
1.540 2.434
1.563 2.368

b) Searching integer on second level of nested field

PostgreSQL MongoDB
0.044 0.079
0.043 0.065
0.047 0.067

c) Searching word on second level of nested field

PostgreSQL MongoDB
1.520 1.704
1.581 1.695
1.622 1.755

Apparently, we can see PostgreSQL is slightly faster than MongoDB in querying over JSON data but we wanted to store every databank into its own collection. Creating collections dynamically in MongoDB is easier than creating tables in any SQL database. MongoDB is the combination of efficiency with easiness and speed. So for that reason, we decided to go with the MongoDB database.

Surendra Kumar


By: Surendra Kumar

Software Development Executive
PinnacleWorks Infotech (P) Ltd

 

Share:

Leave a Comment