Critical thoughts about big data analysis

Completed on 02-Jul-2016 (57 days)

Project 9 in full-screenProject 9 in PDF
Conclusions >
Big data modelling

As previously explained, most of my numerical modelling expertise has been focused on specialised models dealing with well-delimited situations and relatively small datasets. Sharing my impressions about the differences between such an experience and standard big-data conditions is precisely one of the main goals of this project. Below these lines, I am including some generic ideas about (the referred transition to) big data modelling by extending what I wrote in the Big data peculiarities section. Additionally, the appendix of this project contains a detailed analysis of certain big-data challenge where I participated.

Main ideas to bear in mind when facing the development of big-data models:
  • Relying on complex and comprehensive approaches since the first moment is certainly a bad proceeding. On the other hand, models based upon the accumulation of simple implementations affecting a relevant number of cases are likely to deliver a good performance.
  • There is a huge amount of free resources whose utilisation is almost a must in a big number of cases. Nevertheless, these free resources are usually a two-edged sword: very useful, but easily misunderstandable. It is not just the false certainty which unknowledgeable (or even not-that-experienced in the big-data peculiarities) people might get, but also their usual unfriendly character; what, on the other hand, is quite logical on account of their complexity and poorly-supported-open-source essence.
  • The big-data essence should never be forgotten. This point seems intuitively evident and this is precisely what it tries to prevent: intuitively evident but wrong actions. The most distinctive feature of big-data modelling is dealing with huge amounts of information, usually notably beyond our intuitive grasp; this issue can easily be missed when dealing with one of these models. That's why the following to-be-used-no-matter-what rule of thumb: each single intermediate action should always be automated by accounting for the highest number of cases; no intuition, suppositions, my-experience-tells-me-whatever or similar.
  • Algorithm optimisation and/or powerful hardware availability have always to be seen as top priorities.
  • Overfitting is particularly difficult to be detected and corrected. Nice-looking-but-really-saying-nothing results represent the most common variant of this problem. As discussed in the next section, the notably high number of unknowledgeable attitudes around big data makes this problem even more relevant.
  • Coming up with methodologies adequately assessing the performance of these models is also quite difficulty. The aforementioned unknowledgeableness has a notable impact here too.
In a nutshell, there are two main issues to bear in mind when facing most of big-data problems. Firstly, the easily-forgettable unintuitive essence of most of what is related to so huge amounts of data. Secondly, the cluelessness commonly associated with a relevant proportion of big data situations.