Critical thoughts about big data analysis

Completed on 02-Jul-2016 (57 days)

Project 9 in full-screenProject 9 in PDF
Critic >
Accuracy unconcerned

An immediate consequence of the aforementioned data hording is that the adequate understanding of the collected information is converted into a somehow secondary concern. Note that an accurate-under-restricted-conditions model has nothing to do with properly defining the given situation. For example: only adding 2 and 3 vs. a fully-functional addition system; the first more restricted alternative might be acceptable in certain cases and only when the associated limitations are fully understood.

Despite the ideas in the previous paragraph, the fact of accounting for more or less information shouldn't theoretically affect the model accuracy. On the other hand and in most of the situations, this data-size-indifferent approach would imply a beyond-acceptable increase of complexity and, consequently, its practical impossibility. There is even a third alternative where all the additional not-properly-accounted information would be plainly ignored. This last implementation is much easier, but it still requires some additional work.

Another issue to bear in mind is that the information quality cannot be understood in an absolute way. The basic features which have to be present in any high-quality data source (e.g., correctness, completeness, descriptiveness, etc.) might be seen as mere prerequisites; on top of that, each data set has to meet certain specific quality targets. For example: when analysing the health of a person, the amount and quality of information related to physical features is very relevant, certainly not the case when analysing this person's taxes. Thus, a too careless or generic data-collecting behaviour has also a notable impact on the accuracy of the output conclusions, regardless of its size or the used data-analysing technique.

An adequate understanding of the corresponding context has also a major influence on accuracy. At a first sight, this issue might seem evident although the peculiarities of the automated understanding processes make easier to overlook it. Even under highly restricted conditions, an automated context understanding can be very difficult or impossible; even worse: the model might be unadvertedly ignoring such a relevant aspect.

Other issue to bear in mind is the common misconception that data models can be seen as absolute-answer deliverers working forever with a minimal support. Even the most complex and comprehensive model is very short-sighted and unadaptable in comparison with what human understanding can deliver. For example: a person, after having understood a given phenomenon, can output reliable conclusions on account of such knowledge for a wide variety of equivalent situations (logically, within the limits defined by the phenomenon and the person understanding capabilities). On the other hand, data models are always meant to be used under very restricted conditions and have a limited adaptability.

In summary, building an approach accurately analysing certain (big) data set in an automated way is a quite difficult task. A big number of different aspects have to be taken into account, what isn't a primary concern for many data-analysis unknowledgeable users. Furthermore, adequately assessing the suitability of certain model to describe a given phenomenon is difficult and misinterpretation-prone.