Critical thoughts about big data analysis
Completed on 02-Jul-2016 (57 days)
Many business opportunities have started to grow around (big) data analysis; a reality which has attracted quite a few not-too-skilled-on-this-front people. Usually, they have a blind trust in easily getting immediate benefits and disproportionate or plainly delusional expectations. Such attitudes tend to be near funding/decision-making spheres, what converts their unreliable opinions into actually influential trends.
Big data challenges are an excellent place to get an accurate idea about this kind of attitudes; actually, the appendix of this project
includes a detailed description of my participation in one of these challenges. Roughly speaking, complex (meaning interesting in this context) problems are proposed to a heterogeneous group of skilled, competitive and motivated online data modellers. That's why and regardless any other factor, these contests definitively provide a good reference about what some big-data-concerned(-but-not-necessarily-knowledgeable) companies consider difficult, relevant and even the future.
After having participated in a few of these challenges and analysed a relevant number of additional ones, my ideas about the typical challenge-proposer expectations (at least, the unknowledgeable sub-type) are very clear. I see two main problems here:
Not-saying-much & easily-manipulatable assessing methodologies. In most of these contests the goals and the way in which solutions are assessed tend to be very simplistic and adapted to specific methodologies (i.e., problem expected to be solved in certain way and defined with this fact in mind). Some people might argue that this is required on account of all what a competition entails. In my opinion, this issue is exclusively provoked by not having properly analysed the problem and the goal; a new representation of the quick-easy-results-and-not-knowing-but-deciding attitudes which underlie this whole critic. Most of these challenges expect very specific answers to highly-restricted problems, but rarely output the ideal good insights into certain set of problems about which the given challenge should only be a mere descriptive sample.
Plainly useless goals. I have seen quite a few cases where the pursued goal was plainly useless for the proposer. Example: creating a model to recognise to which road, out of 5, certain stretch belongs. This is a clearly overfitting-prone problem whose conclusions will never have a general applicability (i.e., being able to recognise any road from a given stretch).
In summary, wrongly-applied data-analysis/maths can prove virtually anything, what is the same than proving nothing. Additionally to building a proper model, the right questions have to be asked and the delivered outputs have to be adequately understood.