The Microsoft Clip Art schematic shown to the left demonstrates some general trends that we continue to experience when it comes to big data management, visualization, statistical analysis, and predictive models development. Along these processes are other processes that are paramount to ensure that we make the most use of the available data at hand to ensure that expectations are adequately fulfilled or delivered and that we have the right data and appropriate tools to enhance data visualization, analysis and reporting. The statistical and predictive outcomes of measured variables are as good as the quality and quantity of the data we have as well as the expertise and technologies used to make sense of the data. As much as the quality of the data is of significance, it is also a good thing to have the right tool and expertise to make quality and optimal sense of the available data.
Today, we are in a sense blessed with all those cool open source data analysis tools online and offline as well as those highly complex and sophisticated data management, visualization, analyses, and reporting tools such as SAS Products, R, SPSS, JMP Pro (my favorite), advanced MS Excel, StatPlus 5, and many more cool tools out there. Well, these are not cheap, but can be affordable if you are a student. The bottom line is that having the available data is one thing and processing it to make ‘optimal’ sense is yet another and this requires using the right tool.
Humanity today than ever before is experiencing what some would call the ‘era of data explosion.’ We are truly in the era of data explosion and that in a sense has taken hold of our humanity, freedom, love, personality, discussions, marriages, associations, etc. Everything now has to be judged by “do you have the data/evidence to proof that” and utilization of common sense no longer apply to some instances, because all we base our logic and facts on are purely based on whether or not the evidence suggest otherwise. Remember, common sense will always be the last resort for any difficult decision we have to make for which the data are not supportive. This has become the norm out there.
Even though having big data is not a bad idea, what matters is how we can utilize the available information with the appropriate tool to make sound decisions that would transform this world, positively.
In today’s world, we are bombarded on every side by data. Data are generated from everything we do both online and offline. Data is not only important for organizational decision-making processes, but could also be used for personal decision preferences. Each data we generate has a unique story that can be told to make informed decisions. But what exactly is big data? Are “big data” a mutually exclusive phrase restricted only for multi-national, trans-national corporations, business organizations, governments, academics, and or research institutions? Can small businesses, community-based organizations and religious institutions utilize big data in anyways to make informed decisions? These and many more questions continue to be heard down the streets of analytics and machine learning about the role of big data analytics and visualization to improve organizational goals irrespective of it size. The size of an organization does not necessarily matter when it comes to the use of big data to make decision preferences that would impact short and long-term goals. What matters is investing in the relevant technical and human expertise specific to your organizational needs that would transform the ways decisions are made and communicated to both the technical and non-technical stakeholders.
Thus, big data can be utilized by a wide range of organizations for multiple purposes; that is, from trying to understand the historical potential of customers’ purchasing power through historical trends data that can be used to inform marketing campaigns, to understanding how demographic characteristics and economic trends data relative to a particular population segment inform the demands and supply of goods and services.
Big data can also be used to explore the potential causal (if a causation-correlation) relationship exist between the outbreaks of diseases in relation to the social and environmental epidemiological characteristics of certain geographical areas, which make those areas more likely to be vulnerable to the potential outbreak of diseases. A classic example of how big data could be utilized in such a case is the current outbreak of the Ebola Virus Disease (EVD) in Western Africa. With what we already know from the potential cause and subsequent spread of the EVB and with available data in those areas (if readily available and accessible), advanced statistical analysis and predictive modelling could be conducted with big data to understand the main cause(s) of disease outbreak and those processes that facilitate the spread. Using big data appropriately, we can help facilitate the process to understand the main cause(s) of disease outbreaks, but also how to apply the applicable methods to stop the spread and evidently eliminate the disease.
The idea of big data is as old as human existence; however, the concept of utilizing big data was recently reinforced by the advancements on technology as well as analytical applications or software build to manage, store and process millions of datasets either through cloud-based servers or on local workstations customized to handle the workload.
Furthermore, the development of data warehouse to capture, host and store data as they are retrieved either in real-time or cumulatively has increased over the recent decades. Most organizations especially in the retail business and other industries are able to track customers’ activities online and offline, a process which generates huge volume of data that can be segmented for multiple purposes and to also understand the online and offline preferences and ‘behaviors’ of potential customers. We may not be 100 percent accurate in predicting online behaviors, as that is complex, but with lot more relevant data of what is being predicted, we might make intelligent guesses or estimates closer to the reality or anticipated outcomes.
Big data is the accumulation of data from several sources over time, such that, storage becomes an integral part of an organizational processes to extract, maintain, store and retrieve for processing for specific outcomes analysis and reporting. If you are considering utilizing big data for your organization, it is crucial to first think about the data forms, types and sources. If you already have the data available with the necessary expertise to create the architecture and infrastructures needed to ensure effective and efficient management, storage and retrieval, your troubles are half done. In this case, your next worries would be to put into place data governance processes to effective account for inter-departmental or inter-agency data management, storage, retrieval, and processing.
In the next few paragraphs, I will proposed few big data management steps that you can used to effectively manage and work with big data to ensure timely processing, analysis and reporting.
The Data Triangle
One important aspect in big data management, analytics and visualization is to try to understand the sources, forms and types of data you store in your data warehouse or databases. Together, the data sources, forms and types constitute what is known as Data Triangle. Why is formulating a data triangle crucial in big data management and analytics? Your data triangle serve as a visual framework of the sources your data are extracted or retrieved from, the forms in which your data are stored and can be retrieved and the types of data that are stored and retrievable.
Generally, a data triangle is a good visual tool that any analyst can create prior to actually doing any analytical work. It allows you to visually conceptualize your data and how to integrate each sourced data relative to the forms and types they were extracted and facilitate how to transform your data during data manipulation, mining and analysis. Graphically, the analyst would list along the title of each of the key point the data sources, types and forms.
For example, data sources could be external or internal databases, online databases, business intelligence databases, web-based analytics data, demographic data, census data, etc. Data forms could be data files related to specific variables of interest, such as, climate data, soil moisture measurement (soil temperature, pH, soil moisture measurements, etc.). The data forms represent the general description of the data that is found in the file. The data forms could be responses from a marketing campaign via diverse channel, order history, demographic and socioeconomic data, survey data, responses from questionnaire, etc. However, data types could either be structured and unstructured.
Structured data types are those are already organized can ready to be analyzed provided they meet the specific format of the specific software being used in the analysis. Unstructured data are those needs significant amount of work to turn the data in the form needed to be analyzed. Unstructured data could be anything from feedbacks from evaluations administered with both open and close ended questions, success stories generated after the completion of a project or program, etc. Any data that doesn’t have a well define structure could be classified as unstructured data types.
Note that data types could also be further segmented into numeric, character/categorical, row state, expression and also sub-classified as according to the modelling type into continuous, ordinal and nominal. Bear in mind that these terminologies changes in relations to the particular software that is being used to analyze the data and it is always recommended to use the appropriate classifications based on that system. In this post, I am using SAS JMP Pro nomenclature for data and modeling types respectively. Understanding your data sources, the forms of data you have and the data types allow you to decide which analysis to conduct and demonstrates that you know and understand your data at the fingertips.