The Big Question of Big Data Schema

Every successive generation of technology brings with it something that remains unchanged: a better version of what we desire. On the same note, schema-on-read is a strategy that developed after schema-on-write couldn’t cope with the speed and variance at which big data can function.

But are all new things better?

First, let’s take a brief look at what each strategy offers.

Schema on Read

It is the data analysis strategy where a schema or structure is applied to the data as it is pulled out of storage. Tools like Hadoop use this strategy. Increasingly becoming the go-to storage data platform for organizations around the world, Hadoop makes use of schema-on-read – which means you have to worry about preparing a schema only when you have to read a data. There is no worry about structured or unstructured data until it is time to utilize it. You can even have more than one views of a data through schema-on-read.

Schema on Write

It is the traditional cornerstone of relational databases that has been into practice since forever. You first prepare a schema, and then write the data into it; the output comes in the form of the schema you had defined. As the structure is so well-defined, and data consistency is easier to maintain, this database technology hasn’t gone out of the league even though its newer cousin ‘schema-on-read’ has proved to be immensely beneficial for the new generation technology.

Schema-on-read allows a better utilization of data lakes. You can accumulate as much data as you can, let it remain unorganized, and then eke out the beneficial stuff by preparing a schema later on. Schema-on-write does not allow such freedom; the unorganized data might be chucked out and there’s definitely a restriction on what can be stored.

Though schema-on-read is an attractive option, it is equally important to bear in mind that a structure has to be ultimately formed if any value has to be derived from a heap of unorganized data. Moreover, the chances of errors are minimal if the ETL and validation departments have done their jobs well.

There are applications where mission is critical, time is of utmost importance, and the knowledge of data structure and how and where it accelerates analysis and reporting is crucial. There are high-latency data which stay the same almost every time they are used. In such cases, only schema-on-write can work. Several business decisions need such strong structural frameworks.

On the other hand, there are several niche players that thrive on the latest, greatest Operational Data Store – Hadoop. For example, there’s a call centre log that is being queried by various SQL. Each user is using the same log for different purposes – one for searching rows for a specific call, one for doing a columnar analysis … and so on. This flexible addressing system maximizes their individual query performances.

In this discussion, one thing is for sure: newer technology paves way for varied operations. Older technology, on the other hand, gives a stable, tried and tested methodology to work out things.

What is your take on this? Share your opinions in the comments section.

Leave a Reply

Your email address will not be published. Required fields are marked *