Data modelling >
Big data peculiarities
Critical thoughts about big data analysis
Completed on 02-Jul-2016 (57 days)
As already explained
, I have recently been working on various big-data-related developments (the appendix of this project
includes my detailed impressions about one of them). These experiences have helped me gain relevant insights into big data forecasting, as opposed to what is associated with my more-restricted-model background.
The most relevant differences which I observed when facing the aforementioned big-data problems are summarised in the following points:
Building comprehensive models (i.e., ones adequately accounting for virtually any sub-situation) is very difficult; in most of the cases, such a proceeding isn't even recommendable. The next point helps to understand this issue better; more specifically: the big-data expectations and/or assessing methodologies tend to favour not-so-bad-for-the-most outputs what penalises slightly-mispredicting-more-insightful approaches.
Generic assessing methodologies. A descriptive example to illustrate this point: by taking an average-based methodology and assuming that the modelled behaviour is defined by (input=>output) 1=>2, 2=>3 and 3=>1, predictions of the form 1=>2, 2=>2 and 3=>2 would be assumed perfect. Such a proceeding would provoke a relevant penalisation for high-accuracy-prone attempts: in the very unlikely scenario of delivering an actually-perfect answer, it would get the same score than the aforementioned simplistic average-value result; in any other case, it would be worse independently upon its real understanding of the underlying behaviour.
As a consequence of the two previous points, getting adapted to the peculiarities of this format seems an unavoidable requirement. Even a priori easy and intuitive ideas (e.g., keeping it as simple as possible) cannot be immediately applied, mainly in case of coming from a different background. The big-data character (i.e., huge training data sets, together with the conditions and expectations usually associated with these problems) has certainly a big influence on the way in which the given model is being developed.