Critical thoughts about big data analysis
Completed on 02-Jul-2016 (57 days)

Project 9 in full-screenProject 9 in PDF
Appendix >
Final model

After confirming that there wasn't enough time to maximise my approach, I took a quick peek at some of the public (Python) codes. Note that this was one of my first Kaggle challenges and I wasn't too sure about the exact meaning of these public contributions. Apparently, they were created by Kaggle's staff to provide some help to solvers.

This public code (note that all the ones I saw were slight modifications of the same algorithm) was performing notably better than my best attempt so far. Its basic structure was quite similar to the one of my approach; it was even accounting for the best combinations of variables as per my tests. On the other hand, it also had the following important differences with respect to my model:
  • It was accounting for the date/time variables in a quite complex way. During my tests, I did some (much more simplistic) attempts to bring this information into picture, but none of them provoked a relevant improvement.
  • It was filtering the cases on account of the is_booking variable. On one hand, the problem description clearly stated that this variable was considered in all the test cases; but on the other hand, all my tests and submissions on this front came to the conclusion that accounting for it wasn't beneficial. In fact, my last-moment tests, as described in the final paragraph of this section, seemed to support such conclusions.
  • It was giving some relevance to certain variable (distance) which my approach was ignoring. As per most of my tests, this variable had a quite low influence.
  • There were various looking-quite-arbitrary filters. Not sure about its exact motivation; perhaps a mistake or perhaps a new quite-complex-but-performing-surprisingly-well bit.
There wasn't much time remaining, this was my first contact ever with Python (and its spacing peculiarities) and that specific code was quite memory inefficient (at least, inefficient enough to not run on my computer). Despite all these problems, I was able to put together a reasonably-good benchmark which seemed to work perfectly; at least, until right the last moment (2 hours before the deadline), when the memory problems came back. My tests indicated that removing the reliance on is_booking would have allowed this model to score notably higher; but unfortunately, the last-moment memory problem didn't let me confirm such an assumption. It was definitively a curious end for a curious challenge.