Page 54 - Flytxt
P. 54
entail. Also, multiple models may work feature extraction, construction and
on multiple data representations (like transformation, automated handling of
different granularities of aggregations); skewed and missing data, and automated
hence the ability to have many models post-processing and calibration of target
being evaluated in parallel is necessary. values.
An evolving conceptual tool for While handling the expected
such parallel evaluation is ensemble heterogeneity and non-stationarity in
modelling – choosing a combination the distributions in the data set and the
of outputs from multiple models. changes in the data representation; the
We anticipate more focus in the data data scientist should also be able to make
science community on building efficient, new conclusions basis new correlations.
automated ensemble models to deal In our parlance this means that evolving
with infinite data sets. approximation functions map new
information in the target variables. This
variation in the target variable mapping
We advise the data scientist to is called concept drift. The data scientist
prepare to apply techniques would need mechanisms (automated, as
from key areas that will direct above) to detect and correct for concept
the evolution of data sciences drift. While this field is evolving (even
for infinite data sets such as
ensemble models, AutoML and a precise definition of concept drift
concept drift. eludes the machine learning community
today), the data scientist must closely
follow developments – like methods that
trigger detection of drift, synthetic drift
The data scientist must also prepare for data generators for model testing, and
adapting to many variances in data new classifications of drift handling and
representation and data set properties correction methods.
over the lifetime of the model. Thus
the need for data cleansing and Conclusions
data imputation must be sensed In this article we increase the awareness
and addressed with time and quality of the knowledge worker for the
guarantees; as should the need for motivations for the evolution of data
revising approximations in response to sciences. We have discussed the
property changes. For hand-crafted (non- handling of infinite data sets and the
AI) models, supervision (semi- or full) for different challenges that infinite data
training is inevitable; but preparation for sets pose, like variations over time in
the aforementioned variances makes its statistical properties of homogeneity and
automation critical. AutoML engineered stationarity. We advise the data scientist
machine learning pipelines will progress to prepare to apply techniques from
to solve these challenges. Data scientists key areas that will direct the evolution
must keep track of developments in of data sciences for infinite data sets
this nascent field and adopt state- such as ensemble models, AutoML and
of-the-art algorithms for automated concept drift
54 INSIGHTZ - VOLUME 03, 2018

