The Lead and Copper Rule Revision (LCRR) mandates the identification and replacement of lead service lines in water systems across the United States. To aid in this process, some state authorities have provided guidelines allowing the use of predictive modeling such as machine learning (ML) to identify the presence of lead in service lines with unknown lead status.
To achieve this, water service line inventories are constructed from various resources, including historical records, census tract data, tax assessor information, and more. This digital inventory typically results in a tabular dataset containing valuable features, such as the service line installation year, diameter, the year a property was built, and neighborhood demographics, to name a few.
Once the service line inventory dataset is established, it can be used to predict the presence of lead in service lines. This problem is commonly formulated as a binary classification task, where the classes are lead (LD) and non-lead (NLD). The labels for each class are obtained from known/verified lead status service lines. Data scientists then work to build robust and reliable predictive models that enable utilities to identify the probability of lead presence in service lines. A common question that arises is related to the contribution of each feature in these predictions: “Which features have the most effect on such predictions?” In other words, which features does the intelligent agent consider to correctly identify lead service lines? To answer these types of question, we dive into the concept of feature importance.
What is Feature Importance?
Feature importance refers to techniques that assign scores to input features based on their significance in predicting the target variable. These scores help to identify which features contribute the most to the predictions of a machine learning model. Understanding feature importance is crucial for model interpretation, feature selection, and improving model performance; Understanding which features contribute the most to model predictions helps in interpreting the model. This is particularly important in regulated industries, where transparency is required. In addition, feature importance can provide valuable insights into the underlying patterns in the data. For instance, in water service lines utilities, understanding, for example, that installation year and diameter are significant predictors of lead presence can guide decisions on where to prioritize lead service line replacements.
How to Calculate Feature Importance?
There are several methods to calculate feature importance. Here, we focus on permutation feature importance, where it’s a simple yet powerful technique to determine the importance of a feature. It is particularly useful because it does not require retraining the model, making it computationally efficient. This method involves randomly shuffling the values of a feature and measuring the change in the already-trained model’s performance (e.g., accuracy). If the model’s performance significantly drops, the feature is considered important. This approach is model-agnostic and can be applied to any machine learning model. For other model-specific feature importance evaluation, you can checkout methods based on Gini impurity or entropy.
To illustrate permutation feature importance, let’s consider a synthetic dataset with three features, namely diameter, population, and install year as well as the target feature, which is lead status. Once a model is trained, we are able to establish the baseline model performance such as accuracy. In the first step, we randomly shuffle a column on the validation dataset while leaving the target and all other columns intact. Next, we use the same trained model and make predictions to evaluate the new model performance. The performance deterioration is calculated from the difference between the new performance and the baseline one. Afterwards, we repeat these two steps, e.g., 1000 times, and take the average of the resulting values to obtain the feature importance. We could repeat the whole process for each feature in order to obtain the importance of each feature individually.
What Feature Importance is Not
In this blog we discussed the necessity of calculating feature importance regarding both model improvement and further decision making. However, it is crucial to know that feature importance does not imply causation. Just because a feature is important in making predictions does not mean it causes the target outcome. Additionally, feature importance scores can vary significantly across different models or even different training runs of the same model. Also, it should be pointed out that some methods of calculating feature importance, especially simpler ones, do not account for interactions between features; A feature might appear unimportant on its own but could be crucial when combined with other features.