Numerical models are basically a way to extend human understanding to situations initially beyond our reach (e.g., too big/complex sources of information or automated decision-making). That's why all the data models, regardless of their configuration, pursue the same goal: reliably understanding certain reality, ideally as well as a person would do. Despite being intrinsically identical, not all the data-understanding problems can be faced in the same way, an idea which underlies this whole project.
Roughly speaking, any modelling process can be divided into the following constituent parts:
Training data. The past meaningful information used by the model to draw its predictions. The human-understanding equivalence is straightforward: all the information being accounted by a person to understand any situation and decide accordingly.
Model itself. The set of algorithms in charge of learning (i.e., adequately understanding all the training information) and predicting (i.e., outputting the most likely results for the given inputs). This part emulates all the human learning, understanding and deciding capabilities.
Resulting predictions. The conclusions delivered by the model for certain set of inputs. Eventually, the default model predictions might be corrected or complemented in order to ensure the highest accuracy; for example, by relying on an auto-learning subsystem. This part emulates the final outputs of a person understanding process (e.g., decision, guess, supposition, etc.), seen as a complex reality which might also involve interactions with other individuals.
Logically, the input conditions and the expectations have a major impact on the development of a model. By continuing with the human-understanding analogy, not everyone can analyse certain situations under certain conditions; for example: abstract impressions against detailed answers of an expert in the given field. The effect of the training information quality seems also quite evident (e.g., the worse the information, the more insightful has to be the person to adequately understand). On the other hand, the size of the training information might seem somehow secondary on this front, but it is certainly not
As explained in the corresponding section
, most of my numerical modelling expertise is focused on accounting for small-and-high-quality training information. One of the goals of this project is to share my impressions about the transition from such a background to big-data conditions. A second goal is to critically analyse the big-data aspects which might better be approached differently.