Big Data Essentials
Stepping into a new field of technology can be pain striking, especially when there are new add on every single day! To make your life easier, I am going to help you get familiar with the current situation in the Big Data world. These are some of the concepts set stage for other areas like Predictive Analytics, Machine Learning and Deep Learning.
What is Big Data?
After reading a dozen definitions, it is understood that the Big Data is a nebulous concept. For better understanding, I would break it down as – Big Data is a large structured or unstructured dataset(produced by disparate devices, hence the format is different), that needs to be captured and stored, broken down by complex processes and analysed so that it can be used for making future decisions; as a value add to the business.
Where is Big Data used?
Ever wondered how you watched The Imitation Game and Netflix recommended you watch Sherlock? How Sephora sends you the latest products from your favorite beauty brand?
How credit card companies catch a fraudulent transaction from the thousands of transactions being made in a minute? Or how your activity tracker commends you on burning those 300 calories? These are merely a few examples of Big Data we experience in our lives.
I am sure once you go over the applications of Big Data, you will begin to take note of how useful it is in all arenas of life 🙂
Already mentioned in the definition are three areas which are hot right now. Investigating each of these areas will give you a head start.
#1 Self Service Analytics
There has been a steady growth in the Business Intelligence market in the last couple of years. With more business users trying to get their hands on it, the result is data visualization tools being simplified to provide business leaders more power to query and generate their own reports.
Why you might wonder?
With industries undergoing changes every minute, the need to have answers at your fingertips is now greater than ever. From a business user’s perspective, they don’t have the time to wait for IT teams to work on flushing out a new report based on the immediate requirements.
Some of the common features in self-service analytics tools you can find today are ease of use – like drag and drop attributes to build a chart, connect to both on-premise and cloud sources, rapid computation and be mobile device friendly.
According to Gartner’s 2017 report, the leaders in the self-service tools are Tableau, Microsoft and Qlik.
Tableau has been one of the best visual analytics tool for the last few years, with clean UI, easy to create dashboards. The flexibility of playing with the UI is the biggest plus, and this is what the executives get excited about seeing too. Their latest addition include a subscription feature, where you can share the dashboards with your team in just a click.
Power BI has a slick looking UI, with unlimited customization you can create great looking dashboards. Plus the easy integration with other tools in Microsoft stack. It definitely is noted for the vast number of source system integrations.
Tip: Power BI is free for general use, this can be a great tool to analyse social media insights for your blog!
Qlikview may not been well known for its visuals, but it has some very strong capabilities – it stands out with it’s In Memory engine to analyze data much faster. It also has extensive API integrations, which you cannot find in other tools.
#2 Real-Time Processing
Typically data is processed in batches, where high volumes of data are collected over a period of time and fed into Data Warehousing tools for analysis. On the contrary, when data is being processed at short intervals of time – every few minutes or seconds even, it is referred to as real time data processing.
Log monitoring, fraud detection, inventory management, are a few instances where decisions needs to be made immediately, this is when real time data processing is significantly used.
The system requirement for such processes needs to have a low-latency server for data to traverse quickly; highly scalable cluster to handle increase in the volumes of streaming data, and most of the times be able to provide real-time analytics.
To compute the data that is coming in as streams, we need frameworks and processing engines. The most significant one is Spark which is a batch processing framework, but has streaming capabilities; You can also take a look at Apache Flink, Apache Samza, Apache Storm to understand how stream processing works.
#3 Data Lake
Traditionally when data from multiple channels having different formats like – computers, mobile devices, television needs to be analyzed, they have to be transformed to bring it to a standard format. Let’s take a look at how Data Lake makes the whole process easier and economical.
Data lake is basically a storage unit, that holds data in its native format, structured or unstructured, on a platform (that supports Hadoop). With this solution, organizations that have large volumes of unstructured data, doesn’t have to worry about building data warehouse at the beginning.
Storing data on Data lakes can bring down the costs substantially. Since all the data is stored in a central repository as-is, on-demand analysis/query can be run against the data when required(which means schema is defined when data is read). This saves the cost of continuous processing of data and storing in a large data warehouse.
Adhoc predictive and real-time analysis can be run against data to quickly discover new insights without waiting, hence providing quicker time to value.
AWS S3 and Microsoft Data Lake are being significantly used today, getting your hands on these services will give you a better understanding of the uses and capabilities.
An important shift in the recent times is seen around services moving from on-premise to cloud platform. With big players like Amazon Web Services, Microsoft Azure and Google Cloud that are cost efficient, safe, reliable and easy to build and maintain infrastructure; the use of on-premise services are slowly fading.
As a closing note I would like to say that, Big data is vast and consists of many components. The traditional methods – on-premise data warehouse, data analysis, reporting tools are being restructured to match the demands today. The above mentioned are only a few areas that are in the limelight currently.