Contents¶

  • Feature selection
    • Variance threshold
    • Filter-based methods
      • Pearson correlation
      • Chi-square Test
    • Information criterions
      • AIC
      • BIC
    • Wrapper-based methods
      • Recursive feature elimination
      • Forward feature selection
      • Backward Elimination
    • Embedded methods
      • Ridge Regression
      • Tree-based
      • Lasso
    • Combined results
  • Regression analysis

Feature Selection ¶

Different feature selection methods can be used to determine whether audio and lyrics features can be used to predict popularity measured by number of streams and specifically identify the best features to use for predicting popularity of songs. If any of the features are useful for predicting popularity, then a predictive model can be created by utilizing those features. It is often useful to perform feature selection and reduce the number of features used in a model to avoid overfitting (perfect training data fit but won't generalize to new samples), simplify the model, and exclude non-informative features.

In [1]:
# IMPORT DEPENDENCIES 
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import plotly.graph_objects as go
from plotly.subplots import make_subplots
In [2]:
# READ DATA 
song_data = pd.read_csv('../Data/merged_finaltop100_revised.csv') 
# REMOVE ROWS WITH NULL VALUES 
song_data = song_data.dropna()
song_data.head()
Out[2]:
Unnamed: 0 track_id artist_names track_name source rank weeks_on_chart streams country danceability ... duration_ms time_signature album_release_date lyrics lyrics_trans continent iso_alpha3 len_words_orig len_words_trans lyrics_clean
0 0 0yLdNVWF3Srea0uzk55zFn Miley Cyrus Flowers Columbia 1 5 124198 United Arab Emirates 0.707 ... 200455.0 4.0 2023-01-13 We were good, we were gold\nKinda dream that c... we were good we were gold kinda dream that can... Asia ARE 334 334 good gold dream sell right til build home watc...
1 1 1Qrg8KqiBpW07V7PNxwwwL SZA Kill Bill Top Dawg Entertainment/RCA Records 2 10 106927 United Arab Emirates 0.644 ... 153947.0 4.0 2022-12-08 I'm still a fan even though I was salty\nHate ... im still a fan even though i was salty hate to... Asia ARE 362 362 fan even though salty hate see broad know happ...
2 2 6AQbmUe0Qwf5PZnt4HmTXv PinkPantheress, Ice Spice Boy's a liar Pt. 2 Warner Records 3 2 83627 United Arab Emirates 0.696 ... 131013.0 4.0 2023-02-03 Take a look inside your heart\nIs there any ro... take a look inside your heart is there any roo... Asia ARE 372 372 take look inside heart room room would hold br...
3 3 0WtM2NBVQNNJLh6scP13H8 Rema, Selena Gomez Calm Down (with Selena Gomez) Mavin Records / Jonzing World 4 25 79714 United Arab Emirates 0.801 ... 239318.0 4.0 2022-08-25 Vibez\nOh, no\nAnother banger\nBaby, calm down... vibez oh no another banger baby calm down calm... Asia ARE 495 495 another banger baby calm calm girl body put he...
4 4 2dHHgzDwk4BJdRwy9uXhTO Metro Boomin, The Weeknd, 21 Savage Creepin' (with The Weeknd & 21 Savage) Republic Records 5 11 79488 United Arab Emirates 0.715 ... 221520.0 4.0 2022-12-02 Ooh, ooh-ooh\nOoh-ooh-ooh, ooh, ooh-ooh (Just ... ooh oohooh oohoohooh ooh oohooh just cant beli... Asia ARE 458 456 believe man want somebody say saw person kiss ...

5 rows × 30 columns

In [86]:
# CHANGE KEY TO CATEGORICAL VARIABLE 
song_data['key'] = song_data.key.astype('category')

I. Variance Threshold ¶

Variance threshold is a widely used technique for feature selection, which involves eliminating features with low variance. Such features are characterized by having nearly constant values or showing minimal variation across the data points. By utilizing the variance of the features, we can identify and exclude the features that do not meet the specified threshold. Using this method, we assume that features with higher variance provide more useful information. It is important to note that this method does not take into account the relationship between the target and features.

In [87]:
# IMPORT DEPENDENCIES 
from sklearn.feature_selection import VarianceThreshold
In [88]:
# CHOOSE QUANTITATIVE VARIABLES 
X = song_data[['valence', 'mode', 'loudness', 
                'acousticness', 'tempo', 'energy', 'liveness', 
                'key', 'duration_ms', 'instrumentalness', 'danceability', 
                'speechiness', 'len_words_orig', 'len_words_trans']]
In [89]:
# SELECT FEATURES USING 0 VARIANCE THRESHOLD 
var_thresh = VarianceThreshold(threshold =0.0) #set threshold to 0.0 
var_thresh.fit(X) 
print(var_thresh.get_support())
print(var_thresh.fit_transform(X).var(axis=0)) #print variances 
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True]
[4.83936367e-02 2.49423408e-01 5.76396379e+00 5.73799561e-02
 8.04333374e+02 2.28960102e-02 1.38679966e-02 1.30240889e+01
 2.15771595e+09 8.49994820e-03 1.81675201e-02 9.17713769e-03
 2.88829014e+04 2.79634791e+04]

This result shows that all features have non-zero variance (True = non-zero variance).

II. Filter-based methods ¶

Filter-based methods do not use machine learning models to determine if a feature is useful or not unlike wrapper methods. It simply uses simple statistical measures to rank the importance of features

1. Pearson Correlation ¶

In feature selection, Pearson's correlation coefficient can be used to identify the features that are highly correlated with the target variable. The absolute value of Pearson’s correlations between the target (streams) and quantitative features (audio&lyrics features) can be used to choose top n features. In this case, top 6 features are chosen with absolute correlations ranging from approximately 0.03-0.05.

In [90]:
corr = [] #list of correlations 
feature_names = ['valence', 'mode', 'loudness', 
                'acousticness', 'tempo', 'energy', 'liveness', 
                'key', 'duration_ms', 'instrumentalness', 'danceability', 
                'speechiness', 'len_words_orig', 'len_words_trans'] #list of features 
In [91]:
# COMPUTE CORRELATION BETW EACH FEATURE AND TARGET VARIABLE  
corr = [np.corrcoef(song_data[i], song_data['streams'])[0,1] for i in feature_names] 
corr = np.abs(corr) #absolute value of Peason's correlation 
corr_feature = X.iloc[:,np.argsort(np.abs(corr))[-6:]].columns.tolist() #chosen top 6 features 
corr_support = [True if i in corr_feature else False for i in feature_names]
print(corr_support)
[True, False, False, True, False, False, False, False, False, False, True, True, True, True]
In [92]:
# DISPLAY RESULTS AS DATAFRAME 
corr_df = pd.DataFrame({'feature_names': feature_names, 
                        'abs_pearsoncorr': corr, 'support':corr_support })
corr_df.sort_values('abs_pearsoncorr', ascending=False)
Out[92]:
feature_names abs_pearsoncorr support
13 len_words_trans 0.049183 True
11 speechiness 0.047060 True
12 len_words_orig 0.038529 True
3 acousticness 0.036375 True
0 valence 0.029379 True
10 danceability 0.028526 True
6 liveness 0.020586 False
1 mode 0.020552 False
4 tempo 0.019645 False
2 loudness 0.019227 False
9 instrumentalness 0.018003 False
7 key 0.013108 False
8 duration_ms 0.008091 False
5 energy 0.005559 False

Generally, it appears that audio and lyrics features are weakly correlated with the number of streams.

2. Chi-square Test ¶

We can also conduct a test of significance called the Chi-Square Test to determine whether the association between the target and each of the quantitative features is statistically significant. After calculating the chi-square metric between the target (streams) and quantitative features (audio & lyrics features), we can choose the feature with the largest chi-squared values. The following explains what small and high chi-square values mean.

  • Observed count is close to expected count when 2 features are independent $\rightarrow$ smaller chi-square value
  • Observed count is not close to expected count when 2 features are dependent $\rightarrow$ higher chi-square value

Higher Chi-Square values mean a feature is more dependent on the target and it can be selected for model training. For this test, we use the following hypothesis.

  • Null hypothesis: $H_0$: target and a feature are independent (i.e. no significant association)

  • Alternative hypothesis: $H_1$: target and a feature are not independent

In [93]:
# IMPORT DEPENDENCIES 
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
In [94]:
# GET VARIABLES 
X = song_data[feature_names] #independent variables 
y = song_data['streams'] #target variable
X_norm = MinMaxScaler().fit_transform(X) #scale features (models can be sensitive to scale of input)
In [95]:
# GET CHI-SQuare VAUES FOR ALL FEATURES 
chi_scores = chi2(X_norm,y) #compute chi-sq values 
scores_df = pd.DataFrame({'feature_names': feature_names, 
                          'chi-square_values': chi_scores[0], 
                          'p_values': chi_scores[1]})
print(scores_df.sort_values('chi-square_values', ascending= False))
       feature_names  chi-square_values  p_values
9   instrumentalness        3368.369840       1.0
1               mode        3207.603840       1.0
3       acousticness        1540.538880       1.0
7                key        1520.387532       1.0
11       speechiness         902.343689       1.0
0            valence         717.389487       1.0
6           liveness         662.960253       1.0
4              tempo         493.924591       1.0
10      danceability         333.264595       1.0
5             energy         261.738070       1.0
8        duration_ms         191.133601       1.0
12    len_words_orig         175.036027       1.0
13   len_words_trans         157.826374       1.0
2           loudness         125.937553       1.0
In [18]:
# GET TOP 5 FEATURES 
chi_selector = SelectKBest(chi2, k=6) #get top 6 features 
chi_selector.fit(X_norm, y)
chi_support = chi_selector.get_support()
chi_feature = X.loc[:,chi_support].columns.tolist()
print(f"{len(chi_feature)} selected features: {chi_feature}")
6 selected features: ['valence', 'mode', 'acousticness', 'key', 'instrumentalness', 'speechiness']

Based on the p-values obtained above, which equals 1.0 for all features, we can see that all the features do not have a statistically significant association with the target variable (streams), which supports the weak correlations obtained in the previous section.

Information criterion ¶

Information criteria are commonly used in feature selection to evaluate the information gain of each feature with respect to the target variable. Information criteria are based on the concept of entropy, which is a measure of the uncertainty in a random variable. AIC and BIC are two statistical measures that are often used in model selection.

AIC ¶

In [96]:
# IMPORT DEPENDENCIES 
import pandas as pd
import statsmodels.api as sm
import itertools
In [97]:
# CONVERT INDEPENDENT FEATURES TO DATAFRAME 
xnorm_df = pd.DataFrame(X_norm, columns= X.columns)
xnorm_df
Out[97]:
valence mode loudness acousticness tempo energy liveness key duration_ms instrumentalness danceability speechiness len_words_orig len_words_trans
0 0.648231 1.0 0.754958 0.064286 0.329883 0.672251 0.015283 0.000000 0.250501 0.000005 0.644617 0.051979 0.107597 0.108410
1 0.406654 1.0 0.689064 0.052994 0.164828 0.728436 0.151985 0.727273 0.167554 0.150470 0.562905 0.018956 0.116726 0.117608
2 0.871795 1.0 0.572892 0.256352 0.414990 0.805431 0.244322 0.454545 0.126652 0.000134 0.630350 0.031950 0.119987 0.120894
3 0.813520 1.0 0.714133 0.388602 0.267317 0.802310 0.102101 1.000000 0.319813 0.000699 0.766537 0.017763 0.160091 0.161301
4 0.146006 0.0 0.677108 0.424207 0.215847 0.608782 0.068351 0.090909 0.288071 0.000000 0.654994 0.030043 0.148027 0.148489
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6804 0.762662 1.0 0.732994 0.646996 0.795130 0.587972 0.093611 0.727273 0.278226 0.000005 0.670558 0.258464 0.191392 0.192838
6805 0.776436 1.0 0.588462 0.026849 0.454947 0.458953 0.072914 0.000000 0.252187 0.000000 0.817121 0.047330 0.160743 0.161958
6806 0.422547 0.0 0.460936 0.014336 0.295818 0.436063 0.006686 0.727273 0.634991 0.000860 0.810636 0.047210 0.082165 0.134691
6807 0.296461 0.0 0.312882 0.016880 0.301495 0.451670 0.076098 0.090909 0.696106 0.142111 0.767834 0.032308 0.132377 0.151774
6808 0.595253 0.0 0.372382 0.011895 0.295824 0.698262 0.240076 0.909091 0.568898 0.001766 0.810636 0.025632 0.133681 0.169185

6809 rows × 14 columns

In [98]:
# CREATE ALL POSSIBLE SUBSETS OF PREDICTORS 
subsets = []
for k in range(1, X.shape[1] + 1):
    subsets.extend(itertools.combinations(range(X.shape[1]), k))

# INTIALIZE VARIABLES FOR STORING RESULTS 
best_model = None
best_aic = np.inf
best_bic = np.inf
best_r2 = -np.inf
best_adjr2 = -np.inf
In [99]:
# LOOP THROUGH ALL SUBSETS AND FIT LINEA REGRESSION MODELS 
for subset in subsets:
    model = sm.OLS(y, sm.add_constant(X_norm[:, list(subset)])).fit()
    aic = model.aic
    bic = model.bic
    r2 = model.rsquared
    adjr2 = model.rsquared_adj
    #Check if current model is the best so far
    if aic < best_aic: #lower aic = best model 
        best_model = model
        best_aic = aic
        best_bic = bic
        best_r2 = r2
        best_adjr2 = adjr2
In [100]:
# BEST FEATURES USING AIC 
variables = list(best_model.params.index)
variables.remove('const')
results = pd.DataFrame({'AIC': [best_aic], 'BIC': [best_bic],  'R2': [best_r2], 'AdjR2': [best_adjr2]}, 
                       index=[str(variables)])
results
Out[100]:
AIC BIC R2 AdjR2
['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8'] 207483.680486 207545.114491 0.008191 0.007024
In [101]:
# CONVERT TO DATAFRAME 
aic_df = pd.DataFrame({'feature_names': X.columns.tolist(),  'support': True})
aic_df.loc[8:, 'support'] = False 
print(aic_df)
       feature_names  support
0            valence     True
1               mode     True
2           loudness     True
3       acousticness     True
4              tempo     True
5             energy     True
6           liveness     True
7                key     True
8        duration_ms    False
9   instrumentalness    False
10      danceability    False
11       speechiness    False
12    len_words_orig    False
13   len_words_trans    False

According to AIC there are 8 best features that maximize the likelihood of the data. However, the extremely low R-squared (0.008191) and adjusted R-squared values suggest that the model explains very little of the variability in the data. In other words, only 0.8191% of the variation in the dependent variable can be explained by the independent variable(s) included in the model. This implies that that the independent variables included in the model do not have a strong relationship with the dependent variable (streams), which supports the previous results.

BIC ¶

In [102]:
# LOOP THROUGH ALL SUBSETS AND FIT LINEA REGRESSION MODELS 
for subset in subsets:
    model = sm.OLS(y, sm.add_constant(X_norm[:, list(subset)])).fit()
    aic = model.aic
    bic = model.bic
    r2 = model.rsquared
    adjr2 = model.rsquared_adj
    #check if current model is the best so far
    if bic < best_bic:  #lower bic = best model 
        best_model = model
        best_aic = aic
        best_bic = bic
        best_r2 = r2
        best_adjr2 = adjr2
In [103]:
# BEST FEATURES USING BIC 
variables = list(best_model.params.index)
variables.remove('const')
results = pd.DataFrame({'BIC': [best_bic], 'AIC': [best_aic],  'R2': [best_r2], 'AdjR2': [best_adjr2]}, 
                       index=[str(variables)])
results
Out[103]:
BIC AIC R2 AdjR2
['x1'] 207522.844669 207509.192668 0.002419 0.002272
In [104]:
# CONVERT TO DATAFRAME 
bic_df = pd.DataFrame({'feature_names': X.columns.tolist(),  'support': False})
bic_df.loc[0, 'support'] = True  
print(bic_df)
       feature_names  support
0            valence     True
1               mode    False
2           loudness    False
3       acousticness    False
4              tempo    False
5             energy    False
6           liveness    False
7                key    False
8        duration_ms    False
9   instrumentalness    False
10      danceability    False
11       speechiness    False
12    len_words_orig    False
13   len_words_trans    False

Using BIC, there is only 1 best feature identified. However, the R-squared and adjusted R-squared values are much lower compare to AIC. Hence, the independent variable chosen does not have a strong relationship with the dependent variable (streams).

Wrapper-based methods ¶

Wrapper-based methods are useful in feature selection because they use a machine learning algorithm to evaluate the performance of each subset of features. They involve training and evaluating a model with different subsets of features, and the selection of features is based on the performance of the model on a validation set. Wrapper methods can provide a more accurate estimate of the performance of each subset of features compared to filter-based methods, which use simple statistical measures to rank the importance of features. By using a machine learning algorithm, wrapper methods can capture complex interactions between features and identify non-linear relationships between features and the outcome variable.

Additionally, wrapper methods also allow for the selection of subsets of features that are specific to the machine learning algorithm being used, rather than relying on a single criterion, such as variance or correlation, as in filter-based methods. This can lead to more efficient and accurate models that are tailored to the specific problem being solved. However, wrapper-based methods can be computationally expensive and time-consuming, as they require the training and evaluation of a model for each subset of features. In addition, wrapper methods can be prone to overfitting, as they can select features that are specific to the training data and may not generalize well to new data.

1. Recursive Feature Elimination ¶

RFE reduces model complexity by removing features one by one until the optimal number of features is left (aka Backward Elimination). We can use RFE to select features by recursively considering smaller and smaller sets of features. RFE can be used with different estimators such as Logistic Regression and Linear Regression. Since the target variable in this case is continuous, RFE is used with Linear regression estimator.

RFE - Linear Regression

In [105]:
# IMPORT DEPENDENCIES 
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
In [106]:
# PERFORM RFE AND SELECT TOP 6 FEATURES 
estimator = LinearRegression() #define estimator 
rfe_lr = RFE(estimator, n_features_to_select=6, step=1)
rfe_lr.fit(X_norm, y)
rfe_lr_support = rfe_lr.get_support()
rfe_lr_feature = X.loc[:,rfe_lr_support].columns.tolist()
#print(f"Initial features: {X.columns.tolist()}")
#print(f"Feature Ranking: {rfe_lr.ranking_}")
In [107]:
# CONVERT TO DATAFRAME 
df = pd.DataFrame({'feature_names': X.columns.tolist(), 
                          'feature_ranking': rfe_lr.ranking_, 'support': rfe_lr_support})
print(df.sort_values('feature_ranking'))
print(f"{len(rfe_lr_feature)} selected features: {rfe_lr_feature} ")
       feature_names  feature_ranking  support
0            valence                1     True
2           loudness                1     True
3       acousticness                1     True
11       speechiness                1     True
12    len_words_orig                1     True
13   len_words_trans                1     True
10      danceability                2    False
6           liveness                3    False
5             energy                4    False
9   instrumentalness                5    False
8        duration_ms                6    False
4              tempo                7    False
7                key                8    False
1               mode                9    False
6 selected features: ['valence', 'loudness', 'acousticness', 'speechiness', 'len_words_orig', 'len_words_trans'] 
In [108]:
# EVALUATE PERFORMANCE 
X_selected = rfe_lr.transform(X_norm)
#split transformed dataset & target variable into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42)
estimator.fit(X_train, y_train) #fit a regression model on training set 
y_pred = estimator.predict(X_test) #predict on test set
mse = mean_squared_error(y_test, y_pred) #compute mse
print(f'MSE: {mse}')
MSE: 1046198083685.7289

RFECV - DecisionTreeRegressor

RFECV is a feature selection method that combines the recursive feature elimination (RFE) algorithm with cross-validation. It recursively eliminate less important features based on their coefficients or importance scores, and then evaluate the performance of the resulting model using cross-validation. The RFE algorithm is used to recursively eliminate features from the dataset until the desired number of features is reached and RFECV uses cross-validation to estimate the performance of the model with different numbers of features. RFECV is advantageous because of the following reasons:

  • RFECV avoids overfitting by using cross-validation to estimate the model's performance.
  • RFECV helps to identify the optimal number of features to use in the model.
  • RFECV reduces the risk of selecting a suboptimal subset of features by evaluating multiple subsets using cross-validation.
  • RFECV can handle noisy or correlated features by considering the joint effect of features in the model.
In [109]:
# IMPORT DEPENDENCIES 
from sklearn.feature_selection import RFECV
from sklearn.tree import DecisionTreeRegressor
In [110]:
# PERFORM RFECV
estimator = DecisionTreeRegressor() #use decision tree estimator 
rfecv_dtr = RFECV(estimator=estimator,step=1, cv=5)
rfecv_dtr.fit(X_norm, y)
#print(f"Feature Ranking: {rfecv_dtr.ranking_}")
#print(f"Initial features: {X.columns.tolist()}")
#print(rfecv_dtr.support_)
Out[110]:
RFECV(cv=5, estimator=DecisionTreeRegressor())
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RFECV(cv=5, estimator=DecisionTreeRegressor())
DecisionTreeRegressor()
DecisionTreeRegressor()
In [111]:
# CONVERT TO DATAFRAME 
df = pd.DataFrame({'feature_names': X.columns.tolist(), 
                          'feature_ranking': rfecv_dtr.ranking_, 'support': rfecv_dtr.support_})
print(df.sort_values('feature_ranking'))
print(f"Number of features selected with CV: {rfecv_dtr.n_features_}")
       feature_names  feature_ranking  support
4              tempo                1     True
0            valence                2    False
8        duration_ms                3    False
2           loudness                4    False
13   len_words_trans                5    False
3       acousticness                6    False
6           liveness                7    False
12    len_words_orig                8    False
11       speechiness                9    False
10      danceability               10    False
5             energy               11    False
9   instrumentalness               12    False
7                key               13    False
1               mode               14    False
Number of features selected with CV: 1
In [112]:
# EVALUATE PERFORMANCE
X_selected = rfecv_dtr.transform(X_norm)
#split transformed dataset & target variable into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42)
estimator.fit(X_train, y_train) #fit a regression model on training set 
y_pred = estimator.predict(X_test) #predict on test set
mse = mean_squared_error(y_test, y_pred) #compute mse
print(f'MSE: {mse}')
MSE: 1683888684985.9526
In [113]:
# PERFORM RFECV WITH SPECIFIED SCORING AND DIFFERENT STEP AND CV
rfecv_dtr = RFECV(estimator=estimator,scoring="neg_mean_squared_error", step=2, cv=2)
rfecv_dtr.fit(X_norm, y)
df = pd.DataFrame({'feature_names': X.columns.tolist(), 
                          'feature_ranking': rfecv_dtr.ranking_, 'support': rfecv_dtr.support_})
print(df.sort_values('feature_ranking'))
print(f"Number of features selected with CV: {rfecv_dtr.n_features_}")
       feature_names  feature_ranking  support
4              tempo                1     True
3       acousticness                2    False
0            valence                3    False
6           liveness                3    False
8        duration_ms                4    False
13   len_words_trans                4    False
2           loudness                5    False
5             energy                5    False
11       speechiness                6    False
12    len_words_orig                6    False
9   instrumentalness                7    False
10      danceability                7    False
1               mode                8    False
7                key                8    False
Number of features selected with CV: 1
In [114]:
# EVALUATE PERFORMANCE
X_selected = rfecv_dtr.transform(X_norm)
#split transformed dataset & target variable into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42)
estimator.fit(X_train, y_train) #fit a regression model on training set 
y_pred = estimator.predict(X_test) #predict on test set
mse = mean_squared_error(y_test, y_pred) #compute mse
print(f'MSE: {mse}')
MSE: 1683888684985.9526

The computed MSE values for RFE AND RFECV using different estimators are significantly high, which suggests that none of the independent variables are useful for predicting song popularity measure by number of streams. This also supports the previous results obtained using other methods.

2. Forward Feature Selection ¶

Forward selection is a feature selection method that works by iteratively adding one feature at a time to the model until a stopping criterion is met. The algorithm starts with an empty set of features and evaluates the performance of the model with each additional feature, selecting the feature that improves the performance the most. Forward selection is helpful in feature selection because it can improve the efficiency and effectiveness of the modeling process in several ways:

  • Faster convergence
  • Reduce the dimensionality of the dataset by selecting only the most relevant features
  • Improve the interpretability of the model by selecting a subset of features that are most relevant to the outcome of interest

Disadvantages of forward selection:

  • Overfits data if the stopping criterion is not carefully chosen
  • Ignores important interactions between in independent variables
In [115]:
# IMPORT DEPENDENCIES
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
In [116]:
# PERFORM FORWARD SELECTION AND AUTO SELECT NUMBER OF FEATURES 
sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto', direction='forward')
sfs.fit(X_norm, y)
sfs_support = sfs.get_support()

# CONVERT TO DATAFRAME 
df = pd.DataFrame({'feature_names': X.columns.tolist(), 'support': sfs_support})
print(df)
sfs_feature = X.loc[:,sfs_support].columns.tolist()
print(f"Optimal subset size using forward feature selection: {len(sfs_feature)}")
       feature_names  support
0            valence    False
1               mode     True
2           loudness    False
3       acousticness    False
4              tempo     True
5             energy    False
6           liveness    False
7                key     True
8        duration_ms    False
9   instrumentalness     True
10      danceability    False
11       speechiness     True
12    len_words_orig     True
13   len_words_trans     True
Optimal subset size using forward feature selection: 7

3. Backward Elimination ¶

Backward elimination works similar to RFE with some differences:

  • RFE: Fit model with all features $\rightarrow$ Compute variable coefficients and their importance $\rightarrow$ Eliminate low ranking variable in each iteration

  • BE: Fit model with all features $\rightarrow$ Choose features with p-values < chosen significance level $\rightarrow$ Repeat the process until the removal of any variable affect the accuracy of the model (Stop if there is no more effect)

In [117]:
# PERFORM BACKWARD SELECTION AND AUTO SELECT NUMBER OF FEATURES 
sfs2 = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto', direction='backward')
sfs2.fit(X_norm, y)
sfs2_support = sfs2.get_support()

# CONVERT TO DATAFRAME 
df = pd.DataFrame({'feature_names': X.columns.tolist(), 'support': sfs2_support})
print(df)
sfs2_feature = X.loc[:,sfs2_support].columns.tolist()
print(f"Optimal subset size using forward feature selection: {len(sfs2_feature)}")
       feature_names  support
0            valence    False
1               mode     True
2           loudness     True
3       acousticness    False
4              tempo     True
5             energy     True
6           liveness    False
7                key     True
8        duration_ms    False
9   instrumentalness    False
10      danceability    False
11       speechiness     True
12    len_words_orig    False
13   len_words_trans     True
Optimal subset size using forward feature selection: 7

Forward and backward elimination both selected the top 7 features, but the sets of chosen features are not identical.

Embedded methods ¶

Embedded methods are highly useful in feature selection because they incorporate feature selection directly into the model building process. In embedded methods, feature selection is performed simultaneously with model fitting, rather than as a separate step. Embedded methods work by adding a penalty term to the objective function of the model, which encourages the model to select only the most relevant features. This penalty term is usually controlled by a hyperparameter that needs to be tuned to achieve optimal performance.

In [118]:
# IMPORT DEPENDENCIES
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV

1. Tree-based - SelectFromModel ¶

RandomForestRegressor is an ensemble learning method that combines multiple decision trees to make predictions for a continuous dependent variable (streams). Higher n_estimators improves performance of random forest by reducing overfitting and increasing prediction stability but increases computational cost. One of its main advantages is its ability to handle complex, high-dimensional data with many predictor variables. It can also handle both linear and nonlinear relationships between the predictors and the dependent variable.

In [119]:
#set hyperparam n_estimators=50 
embedded_rf = SelectFromModel(RandomForestRegressor(n_estimators=50), max_features=6)
embedded_rf.fit(X_norm, y)
embedded_rf_support = embedded_rf.get_support()
embedded_rf_feature = X.loc[:,embedded_rf_support].columns.tolist()

# CONVERT TO DATAFRAME 
df = pd.DataFrame({'feature_names': X.columns.tolist(), 'support': embedded_rf_support})
print(df)
print(f"{len(embedded_rf_feature)} selected features: {embedded_rf_feature} ")
       feature_names  support
0            valence     True
1               mode    False
2           loudness    False
3       acousticness    False
4              tempo     True
5             energy     True
6           liveness     True
7                key    False
8        duration_ms     True
9   instrumentalness    False
10      danceability    False
11       speechiness    False
12    len_words_orig    False
13   len_words_trans     True
6 selected features: ['valence', 'tempo', 'energy', 'liveness', 'duration_ms', 'len_words_trans'] 

2. Lasso Regularization ¶

Lasso regularization is a method used in linear regression to prevent overfitting and improve the generalization performance of the model. It is a form of regularization that adds a penalty term to the objective function of the linear regression problem. The penalty term in Lasso regression is the sum of the absolute values of the coefficients (i.e., the L1 norm of the coefficients) multiplied by a regularization parameter (lambda or alpha). The addition of this penalty term shrinks the coefficients towards zero, leading to a sparse set of non-zero coefficients that correspond to the selected features.

In [120]:
# SELECT FEATURES USING LASSO REG 
reg = LassoCV()
reg.fit(X_norm, y)
df = pd.DataFrame({'feature_names': feature_names, 'coefficients': reg.coef_})
df
Out[120]:
feature_names coefficients
0 valence 0.000000e+00
1 mode 8.774039e-12
2 loudness 0.000000e+00
3 acousticness 0.000000e+00
4 tempo 0.000000e+00
5 energy -0.000000e+00
6 liveness 0.000000e+00
7 key 0.000000e+00
8 duration_ms -0.000000e+00
9 instrumentalness -0.000000e+00
10 danceability -0.000000e+00
11 speechiness -0.000000e+00
12 len_words_orig -0.000000e+00
13 len_words_trans -0.000000e+00

If a feature is irrelevant, lasso penalizes its coefficient and makes it 0. The result above shows that most of the features are not relevant based on their coefficients.

Combined results ¶

In order to easily compare the results obtained using different methods that take into account the relationship between each feature to the target, they are consolidated into a dataframe presented below.

In [121]:
featureselection_df = pd.DataFrame({'Feature':feature_names,
                                    'Pearson':corr_support, 
                                    'Chi-2':chi_support, 
                                    'AIC': aic_df['support'], 
                                    'BIC': bic_df['support'], 
                                    'RFE_LinearRegression':rfe_lr_support , 
                                    'RFECV_DecisionTreeRegressor': rfecv_dtr.support_ , 
                                    'FFS_LinearRegression':sfs_support, 
                                    'BE_LinearRegression':sfs2_support,
                                    'Random Forest':embedded_rf_support})
featureselection_df['Total'] = np.sum(featureselection_df.iloc[:, 1:], axis=1) #sum Trues 
featureselection_df.sort_values('Total', ascending=False)
Out[121]:
Feature Pearson Chi-2 AIC BIC RFE_LinearRegression RFECV_DecisionTreeRegressor FFS_LinearRegression BE_LinearRegression Random Forest Total
0 valence True True True True True False False False True 6
4 tempo False False True False False True True True True 5
11 speechiness True True False False True False True True False 5
13 len_words_trans True False False False True False True True True 5
1 mode False True True False False False True True False 4
3 acousticness True True True False True False False False False 4
7 key False True True False False False True True False 4
2 loudness False False True False True False False True False 3
5 energy False False True False False False False True True 3
12 len_words_orig True False False False True False True False False 3
6 liveness False False True False False False False False True 2
9 instrumentalness False True False False False False True False False 2
8 duration_ms False False False False False False False False True 1
10 danceability True False False False False False False False False 1

Total = number of Trues

Predictive Analysis ¶

In [122]:
# IMPORT DEPENDENCIES 
from sklearn.feature_selection import f_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
import statsmodels.api as sm
In [123]:
# SPLIT INTO TRAINING AND TEST SETS 
X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.3, random_state=1)
In [124]:
def loss(yhat,y):
    """sqr error loss"""
    return (yhat - y)**2

def fit(X,Y):
    """fit the OLS from training w/ intercept"""
    lin1 = LinearRegression(fit_intercept=True) # OLS from sklearn
    lin1.fit(X,Y) # fit OLS
    return np.append(lin1.intercept_,lin1.coef_) # return betahat

def predict(x, betahat):
    """predict for point x"""
    return betahat[0] + x @ betahat[1:]
In [125]:
# COMPUTE AVERAGE TRAINING AND TEST LOSS 
n,p = X_norm.shape #get dimensions 

# COMPUTE LOSSES ON TEST SET 
betahat = fit(X_train,y_train)
Y_hat_te = [predict(x,betahat) for x in X_test]
test_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_te,y_test)]

# COMPUTE LOSSES ON TRAIN SET 
Y_hat_tr = [predict(x,betahat) for x in X_train]
train_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_tr,y_train)]
In [126]:
print("train avg loss: {}\ntest avg loss: {}".format(np.mean(train_losses), np.mean(test_losses)))
print("n, p :",n,p)
train avg loss: 985083046760.4008
test avg loss: 1034618292635.4753
n, p : 6809 14

Generally, we want the training loss to be smaller than the test loss. The reason for this is that the purpose of training a machine learning model is to optimize it for generalization to new, unseen data. If the training loss is larger than the test loss, it suggests that the model is overfitting to the training data and is not able to generalize well to new data. Ideally, we want the training loss to be small enough that the model can fit the training data well, but not so small that it overfits, causing the test loss to increase. In this case, the train loss is smaller than the test loss so the split is sufficient.

Scores of each features using linear regression

In [127]:
# PERFORM SELECTION TO GET SCORES 
fs = SelectKBest(score_func=f_regression, k='all')
fs.fit(X_train, y_train)
# transform train input data
X_train_fs = fs.transform(X_train)
# transform test input data
X_test_fs = fs.transform(X_test)

# PRINT SCORES FOR EACH FEATURE 
for i in range(len(fs.scores_)):
    print(f"{feature_names[i]}: { fs.scores_[i]} ")
valence: 4.866863426784755 
mode: 4.781261625930871 
loudness: 1.9683197099587608 
acousticness: 4.793979095391341 
tempo: 2.821715977917632 
energy: 0.09602678653190241 
liveness: 2.333521250752356 
key: 0.5454139312819927 
duration_ms: 0.0673084007003831 
instrumentalness: 1.6787315102411215 
danceability: 2.7773774576764736 
speechiness: 9.974707855423258 
len_words_orig: 4.840817447008443 
len_words_trans: 8.656145139661833 

Evaluate a model that uses all input features

In [128]:
# FIT LINEAR REGRESSION MODEL 
model = LinearRegression()
model.fit(X_train, y_train)

# EVALUATE MODEL 
predicted_y = model.predict(X_test) #get predicted values 
mae = mean_absolute_error(y_test, predicted_y) #mean absolute error 
mse = ((predicted_y - y_test)**2).mean() #or use sklearn mean_squared_error func
print('MAE: %.3f' % mae)
print('MSE: %.3f' % mse)
MAE: 539245.869
MSE: 1034618292635.474
In [129]:
# OBTAIN SUMMARY USING OLS() 
x = sm.add_constant(X_train) #design matrix (allow computation of intercept)
model = sm.OLS(y_train, x).fit()
print(model.summary())
print('MSE:', model.mse_model) # print MSE
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                streams   R-squared:                       0.009
Model:                            OLS   Adj. R-squared:                  0.006
Method:                 Least Squares   F-statistic:                     3.033
Date:                Thu, 09 Mar 2023   Prob (F-statistic):           0.000110
Time:                        16:18:22   Log-Likelihood:                -72572.
No. Observations:                4766   AIC:                         1.452e+05
Df Residuals:                    4751   BIC:                         1.453e+05
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.534e+05    1.4e+05      1.805      0.071   -2.18e+04    5.29e+05
x1          1.697e+05   7.32e+04      2.318      0.021    2.62e+04    3.13e+05
x2          5.004e+04   2.94e+04      1.703      0.089   -7562.742    1.08e+05
x3          2.667e+05   1.85e+05      1.441      0.150   -9.61e+04     6.3e+05
x4          1.277e+05   7.01e+04      1.822      0.068   -9677.372    2.65e+05
x5          1.112e+05   9.37e+04      1.186      0.236   -7.26e+04    2.95e+05
x6         -1.162e+05   1.43e+05     -0.815      0.415   -3.96e+05    1.63e+05
x7          1.771e+05   1.17e+05      1.511      0.131   -5.26e+04    4.07e+05
x8          3.471e+04   4.45e+04      0.780      0.436   -5.26e+04    1.22e+05
x9          1.378e+05   1.99e+05      0.692      0.489   -2.53e+05    5.28e+05
x10        -1.472e+05   1.59e+05     -0.923      0.356    -4.6e+05    1.65e+05
x11        -7.938e+04      1e+05     -0.792      0.428   -2.76e+05    1.17e+05
x12         -2.77e+05   1.41e+05     -1.965      0.049   -5.53e+05    -620.112
x13         1.538e+06   7.84e+05      1.963      0.050    2250.959    3.07e+06
x14        -2.018e+06    8.2e+05     -2.461      0.014   -3.63e+06    -4.1e+05
==============================================================================
Omnibus:                     5349.165   Durbin-Watson:                   1.983
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           553712.657
Skew:                           5.739   Prob(JB):                         0.00
Kurtosis:                      54.542   Cond. No.                         142.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
MSE: 2997150372521.643

Evaluate a model that includes top 5 input features in the combined results table

In [138]:
X = song_data[['valence',  'tempo', 'speechiness', 'len_words_trans', 'mode', 'acousticness', 'key']] #inputs 
y = song_data['streams']
X_norm = MinMaxScaler().fit_transform(X) #scale features (ML can be sensitive to scale of input)
In [139]:
X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.3, random_state=1)
In [140]:
# FIT MODEL 
model = LinearRegression()
model.fit(X_train, y_train)

# EVALUATE MODEL 
predicted_y = model.predict(X_test)
mae = mean_absolute_error(y_test, predicted_y)
mse = ((predicted_y - y_test)**2).mean() #or use sklearn mean_squared_error func
print('MAE: %.3f' % mae)
print('MSE: %.3f' % mse)
MAE: 539545.956
MSE: 1037037766985.242
In [141]:
# OBTAIN SUMMARY USING OLS() 
x = sm.add_constant(X_train) #design matrix (allow computation of intercept)
model = sm.OLS(y_train, x).fit()
print(model.summary())
print('MSE:', model.mse_model) # print MSE
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                streams   R-squared:                       0.006
Model:                            OLS   Adj. R-squared:                  0.005
Method:                 Least Squares   F-statistic:                     4.412
Date:                Thu, 09 Mar 2023   Prob (F-statistic):           6.78e-05
Time:                        16:21:26   Log-Likelihood:                -72577.
No. Observations:                4766   AIC:                         1.452e+05
Df Residuals:                    4758   BIC:                         1.452e+05
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       3.556e+05   6.61e+04      5.377      0.000    2.26e+05    4.85e+05
x1          1.521e+05   6.24e+04      2.437      0.015    2.98e+04    2.74e+05
x2          1.296e+05   9.09e+04      1.427      0.154   -4.85e+04    3.08e+05
x3         -3.285e+05   1.34e+05     -2.446      0.014   -5.92e+05   -6.52e+04
x4         -4.753e+05   2.76e+05     -1.721      0.085   -1.02e+06    6.61e+04
x5          5.548e+04   2.92e+04      1.900      0.058   -1779.946    1.13e+05
x6           1.22e+05   6.02e+04      2.029      0.043    4097.666     2.4e+05
x7           3.24e+04   4.44e+04      0.730      0.466   -5.47e+04    1.19e+05
==============================================================================
Omnibus:                     5352.588   Durbin-Watson:                   1.983
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           552917.268
Skew:                           5.747   Prob(JB):                         0.00
Kurtosis:                      54.500   Cond. No.                         28.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
MSE: 4363608708764.5713

Based on the results obtained above using different methods, it is evident that audio and lyrics features are not very useful for predicting popularity measured by number of streams. Thus, we can conclude that the chosen fatures are not as effective in explaining stream count on its own. There are some reasons that likely limit the explanatory power of this model. Since all genres are included for measuring popularity, it is likely that, because different genres do not share the same popular attributes, there will be noise in the hit prediction model making for a lower R2 value and lower correlations.

References:

https://scikit-learn.org/stable/modules/feature_selection.html