Page 54 - Flytxt
P. 54

entail. Also, multiple models may work               feature extraction, construction and
        on multiple data representations (like               transformation, automated handling of
        different granularities of aggregations);            skewed and missing data, and automated

        hence the ability to have many models                post-processing and calibration of target
        being evaluated in parallel is necessary.            values.
        An evolving conceptual tool for                      While handling the expected
        such parallel evaluation is ensemble                 heterogeneity and non-stationarity in

        modelling – choosing a combination                   the distributions in the data set and the
        of outputs from multiple models.                     changes in the data representation; the
        We anticipate more focus in the data                 data scientist should also be able to make
        science community on building efficient,             new conclusions basis new correlations.

        automated ensemble models to deal                    In our parlance this means that evolving
        with infinite data sets.                             approximation functions map new

                                                             information in the target variables. This
                                                             variation in the target variable mapping
             We advise the data scientist to                 is called concept drift. The data scientist
               prepare to apply techniques                   would need mechanisms (automated, as
              from key areas that will direct                above) to detect and correct for concept
              the evolution of data sciences                 drift. While this field is evolving (even
               for infinite data sets such as
             ensemble models, AutoML and                     a precise definition of concept drift
                         concept drift.                      eludes the machine learning community
                                                             today), the data scientist must closely
                                                             follow developments – like methods that
                                                             trigger detection of drift, synthetic drift
        The data scientist must also prepare for             data generators for model testing, and
        adapting to many variances in data                   new classifications of drift handling and
        representation and data set properties               correction methods.
        over the lifetime of the model. Thus

        the need for data cleansing and                      Conclusions
        data imputation must be sensed                       In this article we increase the awareness
        and addressed with time and quality                  of the knowledge worker for the

        guarantees; as should the need for                   motivations for the evolution of data
        revising approximations in response to               sciences. We have discussed the
        property changes. For hand-crafted (non- handling of infinite data sets and the

        AI) models, supervision (semi- or full) for  different challenges that infinite data
        training is inevitable; but preparation for  sets pose, like variations over time in
        the aforementioned variances makes its               statistical properties of homogeneity and

        automation critical. AutoML engineered               stationarity. We advise the data scientist
        machine learning pipelines will progress             to prepare to apply techniques from
        to solve these challenges. Data scientists  key areas that will direct the evolution
        must keep track of developments in                   of data sciences for infinite data sets

        this nascent field and adopt state-                  such as ensemble models, AutoML and
        of-the-art algorithms for automated                  concept drift


       54                                                                              INSIGHTZ - VOLUME 03, 2018
   49   50   51   52   53   54   55   56   57   58   59