Critical thoughts about big data analysis
Completed on 02-Jul-2016 (57 days)

Project 9 in full-screenProject 9 in PDF
Appendix >
Model evolution

The main algorithm didn't change much since the start. It consisted in sequentially analysing the likelihood of (combinations of) variables predicting the given output (hotel_cluster), as defined below these lines.
  • All the training cases where the given variable had the same value were grouped together. Most of them were associated with multiple hotel_cluster, that's why the total number of occurrences was also stored.
  • At the beginning, this algorithm was run (and validated via submissions) with the relevant-in-appearance variables. Gradually, only considering single variables was converted into mostly considering combinations of different variables.
  • The coordination of multiple (combinations of) variables was done by taking as many predictions as possible (up to the maximum value of 5) from the current combination before moving to the next one. The order in which the different combinations were analysed was problematic since the start, but mostly relied on a mixture of proven-good-performance, number of variables (the more variables in the combination, the better) and average likelihood of all the given hotel_cluster to be right (the lower the number of different hotel_cluster associated with the combination, the better).
This basic structure went through the following relevant modifications:
  • Gradual increase of the number of combinations and variables per combination until reaching a point where the model stopped being easily-scalable.
  • Better way to ease the inclusion of further combinations and variables, by bearing in mind my future-usage expectations (i.e., making it as generic and adaptable as possible).
  • After the first hitting-memory-limits problems, new approach able to deal with any number of variables. It was a multi-step/-application methodology, where intermediate information was generated and used at different points.
  • Although accounting for as many combinations as required was already possible, the number of potential configurations was too high (15! to 25!) and the process too slow. That's why finding the best combinations became the main concern.
  • Searching for the best combinations was proven more tiring and unrewarding than planned; additionally, there wasn't much available time already (with more time, this proceeding might have output good results). One last try at keeping it as simple as possible by ignoring all the stored combinations and restarting the process again.
There wasn't enough time to fully maximise my approach and giving up seemed the best option. Before that, I took a look at the public codes what, as explained in the next section, helped me realise about various issues.