Critical thoughts about big data analysis
Completed on 02-Jul-2016 (57 days)
Undoubtedly, this has been a very valuable experience from which I have learned various worthy lessons. In fact, the original expectations of adapting my specific modelling expertise
to big-data conditions have been widely overpassed.
Kaggle is quite important within the (big-)data-challenge community and also getting relevant insights into the best proceeding to deal with its peculiarities. Without entering into too much detail about the exact essence of their public codes (the ones created by its staff and, eventually, improved by some contestants), I am certain about something: they help to quickly and efficiently get into the problem. Note that there is a very important difference between creating (mainly from scratch) and improving; even completely redoing a reasonably-good approach is notably easier than working without any guidance.
There is also a worthy conclusion from my unplanned first contact with Python
: (reconfirming that) getting adapted to a different programming language isn't too difficult, mainly for an experienced programmer like me. Note that, despite my relevant programming expertise and having worked with many different frameworks, I never used Python and its defining spacing-matters feature (a non-issue in all the languages I have ever used). I started running/analysing the given script almost immediately; but after realising about its memory problems, my first thought was moving the algorithm to C#. I changed my mind back after realising that the code was more intrincate than what it seemed at first sight. That is: the day after having used the language for the first time (and just hours before the final deadline), I started rewriting a not-that-simple algorithm such that it could be run in my (just!) 12 GB RAM computer. Unfortunately, this story didn't end well and my improved version wasn't able to deal with a notably bigger amount of data in one of my last tests. Nevertheless, this problem wasn't exactly provoked by my limited Python knowledge because of referring to an unexpected-and-difficult-to-be-fixed issue.
Regarding facing big-data problems, I learned the following:
Keeping things as simple as possible on all the fronts. Note that I applied as-simple-as-possible ideas since the very first moment. In any case, my intentions were inadvertently changing, mostly via wanting to build an as comprehensive as possible approach able to deal with as many variables as possible.
The Kaggle's public code helped me see that a low number of (combinations of) variables might also output a good performance. Logically, this fact doesn't say anything about what an approach accounting for many more variables (the target of my model) can do. On the other hand, the hardware/time constraints can certainly convert such a theoretical no-problem into a practical no-way.
Hence, the keep-it-simple ideas have to be applied as systematically as possible when dealing with big data problems. What doesn't just refer to the complexity of the different methodologies forming the model, but also to the number of variables and to the best way to face the whole situation (i.e., properly analysing a few rather than accounting for many).
At least in open challenges or similar situations where the proposer's background is unclear, the quality of the input information shouldn't be assumed perfect. The quality of the inputs (understood in its widest sense as training data, description of the problem, expected problems, etc.) has a tremendous impact on the reliability of the resulting model; even the slightest issue here should be immediately fixed.
In a challenge where the aforementioned fixing isn't possible, such a reality should be accepted and the model developed accordingly.
The assessing methodologies used when dealing with objective-correctness-prone scenarios and contest-like conditions are also very different. The model should definitively be assessed such that its performance is properly understood (i.e., likelihood of its predictions to be right, by paying special attention to avoiding correct-in-appearance misinterpretations). Further issues, like scalability and adaptability, should ideally be also brought into picture while determining the adequacy of a given numerical model.
Under contest-like conditions, the previous paragraph has no real applicability. All what matters is scoring as high as possible by applying the corresponding assessing methodology. In these cases, any other consideration (including developing an approach performing objectively better) would provoke unnecessary wastes of time.