Author: Nicholas Varner, Big Data Senior Solution Architect
As I mentioned in the “Kicking Off a BI Project – Who is involved?” post, some of the key personnel of a BI project are: Data Architect, ETL engineers, Business Analysts, Report Developers, Application Developers, Data Governance, and a Project Manager, which all have the common goal of providing stakeholders with the right information at the right time. The goal of this blog post is to get you thinking about the strategies that can provide the best Data Solutions for your company at the best price. The Data Solutions strategies we will detail include Data Integration, Data Storage, Web Analytics, Reporting, and Database as a Service (DaaS) solutions, also referred to as Cloud Solutions, and Hadoop.
A lot of Data Integration tools are available on the market, as you can imagine some are better than others and they all have their particular strengths and weaknesses. You can visitGartner’s website to get an analyst’s opinion of many of the available Data Integration tools on the market. Where do you get the best bang for your buck? Data Integration solutions can cost hundreds of thousands of dollars each year in licensing and support, so it’s a question that you will most likely have to answer when justifying the data pipeline portion of the project. What do you receive when you purchase a proprietary Data Integration product like an Informatica, DataStage, or Ab Initio? You will receive a product license and support. However, you will still need to find a Data Architect and ETL engineers that have the experience to design and execute a data model and perform ETL jobs.
So for open source products, like Pentaho PDI or Talend, licensing fees are offered at no cost, or potentially at limited costs depending on usage. Deploying open source Data Integration solutions can save you a lot of money in up front licensing fees. However, how do you quantify the support that a vendor can offer your business? Support can help you along in the Data Integration process as problems will arise. Normally, however, your business will require professional services regardless of the tool you select. Support will usually help your Data Solutions team work through various “gotchas”, however, support will not architect a custom solution for your business’ needs.
As the open source community continues to evolve, the differences between proprietary products and open source products are less pronounced. Core functionality in the aforementioned tools is close to the same as their open source counterparts. One could argue that if you are a business that is starting from the beginning in developing data solutions that it really makes little sense to spend any of your budget on proprietary solutions when the same solution is available without licensing fees and support contracts. You will need a Data Architect in either case to design the solution, so you may as well save the money on the licensing up front.
To illustrate, you can buy a suit that is Kenneth Cole, Zegna, or Armani, they are all suits of varying styles, fabric, and quality, but none of them are going to fit exactly to your body. A good tailor can measure your body, design the perfect fit for you, and then execute the design.
Just throwing a big name at a problem may sound good on the golf course, but it does not guarantee results. In fact, big name vendors may be less inclined to customize a solution for your business, after all, margins are better if you simply sell an out of the box solution.
What if you currently are using a proprietary Data Integration solution and are thinking to yourself, “why am I paying all this money in licensing fees and support when we could be doing this for free?” Costs exist anytime you are migrating solutions, however, in the long-term you have to decide what is best for your company. Solutions are always evolving, but businesses need to move source data into more friendly data stores that you can report off of. Questions that each business that is considering an alternative to their existing Data Integration solution could think about include: is the potential cost savings by moving to an open source Data Integration solution from a proprietary solution worth the long-term savings vs. the short term aggravation? Are we prepared to be married to our current Data Integration vendor and their associated costs indefinitely?
Along with a Data Integration solution, your business also needs an architecture that allows you to store your current data and future data as your business grows.Historically, data storage has been performed by onsite by local data databases like DB2, Oracle 11g, and so on using a star schema or snowflake design in a relational database. For many businesses, these solutions, when designed properly, provide effective solutions for their reporting needs. Of course, Big Data has become the topic of much debate, while some companies have made real use of Big Data, for many other companies it remains a nice concept in the abstract. Some think that Big Data merely applies to the volume of data, when in fact, the variety and velocity of data of data are many times actually more important to the Big Data equation. For example, you could have a two column Transactions table that has millions of rows in a relational database, you would still be able to effectively report on that data using a traditional relational database. However, what if your business wants to start reporting on things like Twitter feeds and other social media sources, Web Analytics, music, videos, images, and so on? A relational database is simply not designed to effectively store and report off of unstructured data. Enter NoSQL databases and DaaS.
NoSQL is a rebellion against the assertion that relational data stores are the end of history for databases and an exploration of what can be achieved by rejecting some of the core principles of how relational data stores are architected.
NoSQL and DaaS fall into the category of Big Data. Each have their strengths and best practices when storing data and finding data including: MongoDB (Open Source), Casandra (Open Source), Red Shift (AWS), and IBM’s Netezza. Businesses must weigh the cost-benefit of a proprietary solution with its licensing costs and support costs. However, similar to the Data Integrations previously discussed, plenty of great NoSQL options exist that are open source and provide the same ability to store unstructured data without the licensing costs. If, your business, exceeds certain storage limits, there are certain storage fees. The great news is that these offer no single point of failure and you can simply add more to your cluster as your application continues to grow. This has huge cost and time saving advantages over consistently needing to move data from one data store to another as your application expands. It doesn’t really make a lot of sense for most businesses to build their own cloud when you can leverage one of the great infrastructure solutions available to use right away at a reasonable price? And then there is Hadoop.
What’s so great about Hadoop? Why is it important to your business either in the present or in the future? What’s the difference between Hadoop and a NoSQL data store?
Hadoop is a platform for storing and processing large amounts of data across clusters of machines, with the processing aspect centered on the concept of Map Reduce. NoSQL broadly refers to a set of data storage technologies that sacrifice some of things typically associated with relational databases in order to achieve higher scalability, availability, and flexibility. Hadoop by itself could be considered a NoSQL storage solution, and there are tools such as HBase that are implemented on top of the Hadoop platform that are also NoSQL data stores.
You can perform Map Reduce using PIG, Hive, SQL, Java Map Reduce, or Streaming. Many businesses elect to use the PIG Latin programming language, a procedural language, to construct Data Pipelines in Hadoop. Think of the relationship like this: PIG Latin programming language is to HDFS as SQL is to a relational DB. Analytics are not performed against Hadoop directly. You can, instead, use a products like Hive or Impala, from Cloudera, to extract data from Hadoop, load the data into an environment where you can transform it to write queries and perform analytics. The process takes unstructured data and makes it structured so it’s possible to understand, report on, and visualize.
Cloudera solutions provide polish on top of open source solutions and help architect data solutions using Hadoop, without the collateral damage of being locked into a vendor and an aftermath of proprietary garbage you are left with.
Hadoop is important for processing large data sets. Remember that relational DBs, NoSQL, and Hadoop are not mutually exclusive, they all have their respective strengths and complement one another. The trick is knowing when to use them and how to use them in concert. For now, let’s take a look at a situation where a business could use Hadoop.
Let’s say you are a publisher and you want to know how users are interacting on your web platform. What articles are they reading? How long are they staying on the page? Who are they? Where do they come from? How often do they come? What factors would cause them to stop coming to your website? How can we target them for marketing promotions? How can we tell a better story to advertisers to increase our CPM? Using a Web Analytics tool, like Web Trends, Omniture, or using your own home grown Java application to track events and users on your web platform, a lot of data is generated. Sometimes cookie values are stored, sometimes they are not, sometimes users are blocking cookies, sometimes they are not, unique users are using multiple devices, refers, pathing analyses, user data, and on and on we can go and businesses, of course governments, do go to collect and mine data to create insights and advantages using Big Data solutions. Simple in theory, but it’s a mess to start with.
How can you make sense of any of it? Hadoop can collect unstructured data and process it, you then extract the data using a tool like Hive to a desired reporting tool. That’s relatively quick victory that Hadoop-based solutions can offer your business to gain real insights on your customers to increase revenue, to impact the types of future offerings, and to potentially redesign your web platform as needed. With this kind of solution in place, one of the Data Solutions team’s biggest problem may well be figuring out how to tell marketing to leave us alone, but that is, of course, a good problem to have. Hadoop-based solutions offer your business the opportunity to base decisions on data that you previously never would have had access to. Stop leaving decision-making up to the intuitions of those with the most clout and act based on what your customers are experiencing on your web platform.
Common reasons organizations use Hadoop include the following broad categories:
- Log and/or clickstream analysis
- Marketing analytics
- Machine learning
- Data mining
- Image processing
- Processing XML messages
- Web crawling and/or text processing
- General archiving, usually for compliance purposes