“…some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”
-Pedros Domingo, author of The Master Algorithm
The COVID-19 pandemic rages on unrelentingly as we head into the winter months of 2020. The glimmer of hope is the group of innovative vaccines that have demonstrated high efficacy and hopefully high effectiveness as well. Innovation is so direly needed during this health crisis, and this mindset is exemplified by DeepMind of Google, where a team devised an impressive machine learning methodology called AlphaFold that is capable of translating a genomic sequence into a three-dimensional protein structure in a matter of minutes and hours (instead of years). This breakthrough by DeepMind to engender an in-depth understanding of protein structure will result in drug design improvements and improved patient outcomes.
This week, we also celebrated the 100th manuscript that was submitted to the nascent Intelligence-Based Medicine journal, a journal dedicated to clinicians and data scientists who collaborate on clinical projects. This journal published its inaugural issue last month. While some of the manuscripts are excellent, others lack a true multidisciplinary collaboration and thus fall short on particularly the feature engineering strategy for the project (read on for explanation). In past weeks, we have discussed a series of small topics in data science (parameter vs hyperparameters, the imbalanced dataset problem, different datasets in splitting, etc), so the topic for this week is therefore feature engineering and other important concepts that contain the word “feature” as there is some understandable confusion about these terminologies.
An attribute (or a variable) is defined as a column in the data table (e.g. patient age, systolic blood pressure, heart failure class, etc) and this is coupled with rows of observations (or instances)(e.g. 62 years of age, 126 mmHg, class III, etc). A feature, therefore, is an attribute that is useful to the question at hand (so not all attributes are features); in addition, a feature can have its feature importance assessed by a score via correlation coefficients and other univariate methods.
Clarification of additional terms with the word “feature” can improve both understanding of data science workflow as well as strategies to improve the performance of the model. Data or feature transformation is a process that modifies the data in such a way that it is more recognized by the machine. It encompasses steps such as imputation (replacing missing data with values), binning (grouping numbers of continuous values into less number of groups), data scaling (standardization and normalization), outlier detection, etc. This tedious process, which can take up to 80% or more of the project time, is also called data wrangling (or munging) and data cleaning.
Feature engineering describes the creative and intellectual process of selecting and/or creating features with contribution of domain knowledge (from especially clinicians in healthcare data science projects) to render an algorithm to perform better. Some include the aforementioned data and feature transformation as part of this feature engineering process, but the former does not create new features. Feature engineering is very much under leveraged as one of the most important aspects of machine learning. Jason Brownlee, author of Clever Algorithms, elegantly defines feature engineering as “the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.” Ryan Baker, who teaches feature engineering, describes feature engineering more succinctly as “the art of creating predictor variables”. I think of this feature engineering process is almost akin to a composer composing his/her musical composition using musical notes (data)(and the model process can be the orchestral performance).
Feature engineering consists of: 1) feature selection, 2) feature extraction, and 3) feature generation (although some equate any of these singly or in combination to feature engineering):
- Feature selection (or ranking). This process measures the impact of each feature on the model so that the best features can be included (and therefore the less important features removed). This process, by reducing features, decreases model overfitting and complexity while increases its accuracy and training speed. Regularization methods like LASSO and ridge regression (a future topic) have an “embedded” feature selection as these methods remove or lower the importance of certain features during the modeling process.
- Feature extraction. The related process of creating a new, smaller set of features to reduce the amount of data is termed feature extraction (can also be termed dimensionality reduction). One methodology used for feature extraction with tabular data is principal component analysis (PCA), which reduces the number of features in the data frame. This process can be automated.
- Feature generation (or construction). This is the process of constructing new features from the raw data, and requires domain knowledge to combine or split existing features into ones that will be more favorable for the model. An example is the pressure-rate product (mean blood pressure x heart rate) as a surrogate index of myocardial oxygen consumption.
Lastly, feature learning or representation learning is an automated strategy (supervised or unsupervised) to construct or extract features from raw data and is achieving some success in deep learning methodologies that require more data and are less explainable.