Everybody in technology loves talking about Big Data because it is one of the new tech terms. Companies that are innovative want to make sure to use the latest technologies. Other companies have use cases that would fit perfectly but might not understand what Big Data is to know if it is applicable. Big Data shouldn't be intimidating, it's just the next step in a progression of technologies we know and love. Typically, Big Data refers to the analysis and storage of large amounts of data. It also refers to the movement and real time analysis of data. For this blog we will focus on the analysis and storage capabilities.
We are generating more data that companies can leverage than ever before. As storage is becoming cheaper and easier to access, companies can store more information about a consumer. Not all the data that is collected will be structured, such as uploaded documents or call transcripts. The trick now is how to leverage all that data into meaningful information. Current technology solutions such as traditional querying using SQL are no longer efficient. SQL is typically not partitioned and is installed on one server which limits the technology to finite resources. While SQL is great for querying through structured data it fails for the more unstructured data. The finite resource constraint will slow down queries as the data grows.
Big Data solutions build on these traditional solutions by first allowing them to scale horizontally and not just vertically. Traditional solutions require companies to upgrade or replace hardware to have more RAM or computing power to process more data. Big Data solutions work on clusters of computers which allows companies to continue to add more servers to the cluster. Adding to the resources versus upgrading is typically going to be cheaper. A SQL database is an example of a traditional datastore. If the SQL database is running out of storage, more can be added by replacing the server hard drives with larger ones. In the Big Data world, data is distributed across multiple servers, allowing more storage capabilities by adding a new server. I consider this like cooking more items than there is room in an oven. Traditional solutions would be comparable to getting more cooking space by adding more racks or replacing the oven with a larger oven. Big Data solutions are like keeping the same oven and getting a new oven to use as well so there is more space for cooking.
Big Data solutions build on traditional solutions by changing how the data gets accessed. As a traditional data store grows, accessing the data typically becomes slower because it has to scan through all records. There are ways to beef up machines to process data quicker, but Big Data solves this problem differently. Traditional data stores are row based which means returning data is done on a record by record base. This is a great way of accessing data when the use case involves all information about a record. However, in some cases you want to examine one particular piece of information across all records. To get the one piece of information, all records need to be accessed which, when the data set is large, will take a long time. Big Data solutions approach this by providing data storage options that are column based. This allows users to query all records based on one piece of information much quicker because not every field in the row needs to be accessed, only the relevant field.
For example, if you want to store information about purchases, each purchase is typically a row and purchase details are columns. If the data will only be accessed based on the purchase then row storage is optimal, all information about a purchase can easily be returned. However, an accountant does not care about all the details of the purchase, typically only the sale price. For that type of use case, column storage is optimal, all the sale prices can be quickly returned without having to go through every row.
Big Data allows analytics and processes that provide insight into the data to be done quicker. Traditional solutions will load all the data from the database onto a separate processing server. This will result in slow processing time because of the delay in moving data across servers. It may also cause a strain on the resources if the processing is done on only one machine with too much data. Big Data updates the current technologies to allow data to be processed on multiple servers in parallel. Then only the final results can be consolidated on one server. Many technologies no longer pull all the data to the server that is running the process. Instead technologies use the data storage servers to process the data. This is like my oven example.
For a potluck gathering, traditional solutions would dictate that everyone brings their ingredients to one place and cook everything together. This causes the cooking to not be able to start until all the ingredients have been brought. There is a strain on the resources available in the one kitchen where everyone needs to share the appliances. A Big Data solution would be comparable to allowing everyone to cook the ingredients in their own kitchen, so the ingredients do not have to travel. This allows things to be cooked in parallel. If multiple people need to combine what they made, the hosting kitchen can still be used. Since less will need to be combined, the host resources should be enough to handle the smaller amount of cooking.
I hope this helps put Big Data into some perspective. It is not a complete divergence from technologies that have been used for years. It is simply a new concept to update technologies to deal with the changing data landscape.