Music plays an important role in our lives. It influences our emotions, physical well-being, and social connections. It has the power to evoke emotions and moods in us, which can help us feel more connected to ourselves and to others. Listening to music can provide us with a sense of comfort, joy, or even sadness, depending on the type of music and our own personal experiences. Additionally, music has been shown to have physical and psychological benefits. It can reduce stress and anxiety, lower blood pressure, and even improve cognitive functioning. Whether we are actively engaged in making or performing music, or simply listening to it, music has the power to enrich our lives in countless ways.
This study focuses on popular music analysis to explore the songs that are widely consumed and enjoyed by people around the world. It involves a variety of methods that seek to understand the different elements that make up popular music in the present time. Popular music analysis has become an important field of study over the past few decades, as the role of popular songs in our society has grown and changed. By analyzing popular music, scholars and researchers can gain insights into the cultural, social, and political forces that shape the music we listen to. Furthermore, by examining the musical styles, themes, and lyrics of popular music in different countries, we can gain a better understanding of the values and beliefs of those cultures. This study aims to address the following questions of interest.
The analysis of popular music in different countries has become an increasingly important area of study in recent years, as global communication and technology have made it easier than ever before to access and share music from different cultures around the world. This has led to a growing interest among scholars and researchers in exploring the diverse musical traditions and practices of different countries, and understanding the cultural, social, and political contexts in which popular music is created and consumed.
Studies of popular music in different countries often focus on the unique musical styles, themes, and lyrics that reflect the cultural traditions and values of those societies. They may also explore the ways in which music is used to express social and political messages, and to promote cultural and national identities. In addition, analyses of popular music in different countries may examine the ways in which global musical trends and influences are adapted and transformed within local contexts, and the ways in which music is used to resist dominant cultural or political forces. Overall, the analysis of popular music in different countries is a rich and diverse field of study that offers insights into the unique musical cultures and practices of different regions, and the ways in which music is used to express and shape social, cultural, and political identities.
The process of acquiring data for this analysis involved multiple sources, namely Spotify Charts, Spotify API, and Genius Lyrics API. The Spotify Charts website was accessed to obtain the top 100 songs from 73 countries during the week of February 16, 2023, while the Spotify API was used to collect audio features of the selected songs. Furthermore, the lyrics for the most popular songs were obtained using the Genius API. These sources are recognized for their reliability and suitability in providing relevant data for answering the research questions posed by this analysis. The combination of these data sources is expected to provide a comprehensive understanding of the sentiments expressed in popular songs across different countries. The following list describes the attributes of the final merged dataset used for this study.
Preprocessing text data is a crucial step in preparing data for analysis. Raw text data, such as lyrics, often contains irrelevant information, inconsistencies, and noise that can negatively impact the accuracy and reliability of analysis results. To improve the quality of text data, various preprocessing techniques such as text cleaning, tokenization, stop-word removal, and lemmatization should be performed. These techniques can enhance the efficiency of machine learning algorithms and other text mining techniques. Moreover, preprocessing can help standardize text data across different sources and improve the interpretability of results. Effective preprocessing is, therefore, essential for accurate and insightful analysis of text data.
In this project, the data preprocessing steps involve adding a new column to the dataframe to count the number of words in the lyrics data. Additionally, new features such as continent and country codes are included for mapping purposes. To ensure that all stopwords are removed during the stopword removal process, contraction words such as "don't" and "wasn't" are expanded into their longer versions. Following the removal of stopwords, lemmatization is performed to simplify words by removing suffixes and prefixes. Expressions such as "oh" that hold no significant meaning and commonly appear in song lyrics are removed to further simplify the lyrics data. To deal with the absence of translations for some foreign words in the Googletrans library, those words are removed. Lastly, words are converted to the present tense, and remaining characters such as diacritics is removed. These preprocessing steps are expected to improve the quality of the data, making it suitable for analysis and aiding in the extraction of meaningful insights.
Fundamentally, descriptive analysis is an important initial step in comprehending and exploring data. In this particular project, descriptive analysis is utilized to outline the attributes of song features and lyric data across the 73 countries featured in the study. This section delves into exploring the correlations between different attributes in the data and song popularity, as measured by the number of streams. Furthermore, an investigation is conducted on the most commonly used words in the lyrics of present-day top songs. This section also observes how the average of multiple attributes in the data differ across various countries, as well as the most commonly used words in popular songs across different continents.
To answer the question of whether song features can explain song popularity, the process of feature selection and predictive analysis is employed. Feature selection is an essential step to identify the most relevant features that impact song popularity, measured by stream count. The quantitative song features of each song, including number of words in lyrics, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, and tempo, are considered. After identifying the relevant features, predictive analysis techniques such as regression analysis, is applied to model the relationship between the quantitative song features and stream count. This process enables us to identify the most significant features that contribute to song popularity and to build a predictive model that can accurately estimate stream count based on these audio features. Ultimately, this analysis will provide valuable insights into the factors that contribute to song popularity and help music industry professionals make data-driven decisions when it comes to music production and promotion.
Various feature selection techniques are employed to identify the most informative predictors for stream count prediction. These methods include Variance Threshold, Filter-Based Methods such as Pearson Correlation and Chi-square Test, Information Criterions such as AIC and BIC, Wrapper-Based Methods including Recursive Feature Elimination, Forward Feature Selection, and Backward Elimination, and Embedded Methods. After performing these techniques, a summary table of the outcomes is generated to choose the optimal predictors for constructing a predictive model.
Sentiment analysis is the process of extracting and analyzing the emotions and attitudes expressed in text data. In recent years, sentiment analysis has gained significant attention in the field of natural language processing, as it provides valuable insights into the subjective opinions and sentiments of individuals or groups towards a particular topic or product. One area where sentiment analysis can be applied is in the analysis of lyrics data, which can reveal the underlying emotions and themes expressed in songs across various countries. By applying sentiment analysis techniques to lyrics data, researchers and industry professionals can gain a deeper understanding of the emotional impact and cultural significance of music, as well as the social and political contexts in which it is created and consumed. In this context, this project aims to conduct sentiment analysis on a large dataset of lyrics data, in order to explore the emotional content and sentiment patterns in popular music. Unsupervised sentiment analysis is performed to specifically answer the following questions of interest.
The song lyrics data are classified using two well-known topic modeling methods, BERTopic and Latent Dirichlet Allocation (LDA) to determine the underlying topics or themes within lyrics of chart-topping songs. Although they share similar objectives, their processes are distinct. Both methods are topic models, which means they scan a collection of documents to identify word and phrase patterns and cluster similar expressions that best represent the set of documents. However, their similarities end there. LDA functions like principal component analysis (PCA) and breaks down the corpus document word matrix into the document-topic and topic-word matrices. In contrast, BERTopic is transformer-based and employs c-TF-IDF (Class-based Term Frequency-Inverse Document Frequency) to establish dense clusters, enabling the creation of easily interpretable topics while preserving significant words in topic descriptions.
In addition to these two methods, Top2Vec is another topic modeling method utilized for examining the themes present in the lyrics of popular songs. Top2Vec is an unsupervised machine learning algorithm used for topic modeling and document clustering. It combines the strengths of document embeddings, word embeddings, and clustering algorithms to produce high-quality clusters of related documents. Top2Vec creates a map of topics that can be explored visually, allowing for easy interpretation of results. It is particularly useful when dealing with large datasets, as it can quickly and efficiently identify relevant topics and cluster documents accordingly, without requiring labeled training data.
Popular songs are characterized by an average length of 3.3 minutes and a tempo of 122 beats per minute. They are typically highly danceable, energetic, and positive, while displaying low speechiness, loudness, acousticness, and instrumentalness. Figure 1.1 illustrates that Ukraine has the highest average instrumentalness, while Brazil has the highest average valence and significantly higher liveness compared to other countries. South African popular songs have a longer average duration, whereas Japan has the highest average energy, and Japan and Indonesia have the highest average acousticness. Furthermore, according to Figure 1.2, the top 1 songs of Brazil, Mexico, USA, and Italy have the highest number of streams. There is also a weak positive correlation between certain audio and lyrics features and song popularity, with acousticness having the highest correlation. The most frequently used words in song lyrics across six continents appear to be similar, with simple and commonly used words such as "like", "know", and "love" being the most frequent (Figure 1.8). Modern song lyrics typically incorporate words that are associated with contemporary themes, such as "party" and "love" as shown in Figure 1.6.
In the predictive analysis section of the appendix, a summary table displays the combined results indicating the top 7 useful features for predicting song popularity. These features include valence, tempo, speechiness, length of translated lyrics, mode, acousticness, and key. Based on this, only these pertinent variables were chosen for the regression analysis. Even after selecting the most useful features through various selection methods, the question remains whether combining independent variables can produce a dependable regression model for predicting song popularity. However, removing insignificant variables and only utilizing the "best" features chosen in the previous step did not enhance the explanatory power of the reduced model. Furthermore, the reduced model has a higher mean average error (MAE) in comparison to the model fitted with all the independent features. The explanatory power of our model is likely limited for several reasons including the absence of genre attribute in the data. By measuring streaming popularity across all genres, it's probable that the model includes noise due to the fact that different genres do not share the same popular attributes. As a result, this may lead to lower correlations and a lower R-squared value in the prediction model.
There is a generally moderate to high degree of similarity between the lyrics from 6 different continents using cosine similarity (Figure 2.1). This suggests that there are some commonalities in the themes and topics covered in these lyrics. By performing sentiment analysis on the data using various methods such as Vader, TextBlob, NRCLex, and Afinn, the most dominant sentiment in the lyrics of present-day popular songs in top charts is discovered. In Figure 2.4, it is shown that the average compound score computed using Vader appears to differ among various countries, with Japan, South Korea, Hong Kong, Vietnam, and Germany having the highest average compound scores. This suggests that people in these countries prefer to listen to songs with predominantly positive sentiments. Conversely, Turkey and Ukraine have the lowest average compound score. While the low average sentiment score in Ukraine may be attributed to the ongoing war in their country, there may be other political and social issues in Turkey that lead its people to favor songs with less positive themes. Similar to the average compound score plot in Figure 2.4, the average polarity scores for different countries also vary. In Figure 2.3, Pakistan's polarity bar is not visible due to its exceptionally low average polarity score of 0.000327. Other countries with low average polarity (sentiment) scores computed using TextBlob include Turkey, India, Dominican Republic, and Greece. Conversely, Indonesia, Malaysia, Thailand, Vietnam, and UAE have the highest average polarity scores using this method.
As for subjectivity, Portugal, Chile, Brazil, Thailand, and Panama have the highest average subjectivity scores among the countries analyzed (Figure 2.6). This means that these countries tend to listen to songs that likely contain subjective language and expressions of personal feelings or beliefs. In contrast, Philippines, Saudi Arabia, Greece, Nigeria, and India have the lowest subjectivity scores which means that they listen to songs with lyrics that are more objective and fact-based. As shown in Figure 2.7, it is apparent that popular songs predominantly express positive emotions and a relatively small number of song lyrics convey feelings of anger. This is useful because it may enable us to draw inferences about the collective emotions of music listeners and identify any global social or political issues that may be contributing to this trend.
The most prevalent topic in popular song lyrics is 'love', as determined by LDA, BERTopic, and Top2Vec methods. The intertopic distance map generated using LDA in Figure 3.3 indicates that the most common topics are 'nightlife' and 'love', with a greater number of documents falling under the 'love' topic. However, coherence scores (c_v = 0.3 and u_mass = -0.97) suggest that unsupervised topic modelling with LDA is not very effective with song lyrics data. This could be due to the small size of the corpus, which may not provide enough information to establish coherent topics. Increasing the size of the corpus could potentially enhance the coherence score.
Based on BERTopic, the primary topic for popular song lyrics is 'love', which is indicated by its high score in topic 0, along with its dominant words. This finding is consistent with the fact that the data set includes songs from the week of February 16, 2023, which falls during Valentine's Day season. The highest-scoring words for each topic are shown in Figure 3.5, and topic 6 is related to mental health, with words such as mature, drug, and therapist being predominant, while topic 7 is associated with dance, based on its high-scoring words. Compared to LDA, BERTopic has higher coherence scores (c_v = 0.6 and u_mass = -0.22), indicating more distinct and understandable topics. BERTopic's intertopic distance plot reveals that similar topics are more closely clustered together than in LDA (Figure 3.4) . However, due to the small size of the document corpus, LDA may not have generated coherent topics. Increasing the corpus size could potentially improve coherence scores for LDA.
When using Top2Vec for unsupervised sentiment analysis on the lyrics attribute in the dataset, it becomes apparent that most of the lyrics belong to topics 0 and 1. These topics represent 'sexual activity' and 'emotion', respectively, based on the dominant words within each topic. Some of the most dominant words and slang terms in topic 0 include 'jiggy', 'kiss', 'deceive', and 'poppin', while topic 1 includes words such as 'learn', 'ballad', 'cry', 'beg', 'fall', and 'emotion'. By using keywords to search for topics, the results reveal that there are lyrics with topics about love, hate, race, law, family, gun violence, transgenderism, social injustice, abortion rights, and education. These topics are commonly discussed in social media and other communication platforms.
This study emphasizes the significance of analyzing popular music to comprehend the societal, cultural, and political influences that shape the music we listen to. Researchers can gain valuable insights into the values and beliefs of various cultures by analyzing popular songs from different countries and examining their musical styles, themes, and lyrics. This analysis aims to investigate the connection between song features and popularity, prevailing sentiments in different countries, emotional differences in song lyrics, and the underlying themes in popular songs of February 2023. These findings can contribute to a better understanding of the role of music in shaping society and can have implications for the music industry and global cultural exchange.
This study finds that there are significant differences in the audio features of popular songs across different countries, and each country exhibits unique musical styles and preferences. Moreover, the study shows that certain audio and lyrics features have a weak correlation with song popularity, indicating that other factors, such as marketing and social media, play a crucial role in driving song popularity. The challenges of predicting song popularity using regression analysis and the importance of feature selection in the predictive modeling process is also highlighted in this project. Based on the results, valence, tempo, speechiness, length of translated lyrics, mode, acousticness, and key can potentially be used for predicting song popularity. However, despite the careful selection of these variables, the explanatory power of the model is limited, as evidenced by the higher mean average error and lower R-squared value compared to the model fitted with all the independent features. The absence of genre attribute in the data is likely a significant factor contributing to the model's limitations, as different genres may not share the same popular attributes. Thus, further research is needed to refine the methodology and improve the accuracy of the predictions.
Additionally, this study reveals interesting insights about the themes, sentiments, and emotions prevalent in popular songs across different countries. The analysis shows that people in different countries tend to prefer songs with predominantly positive sentiments. There are several reasons why people listen to positive songs. Positive songs can improve a person's mood and mental state by boosting energy levels, improving overall mood, and reducing feelings of stress or anxiety. Positive songs can also inspire and motivate people, as they often contain empowering lyrics that can boost confidence and encourage pursuing goals. Additionally, positive songs can offer a form of escape from negative emotions or difficult situations, providing comfort and helping people focus on positive aspects of their life or the world around them. Lastly, positive songs can be enjoyable to listen to, with their upbeat rhythms, catchy melodies, and feel-good lyrics often providing a source of pleasure and entertainment for listeners. In summary, people listen to positive songs for their ability to improve mood, provide inspiration, offer a form of escapism, and for their sheer enjoyment.
This study employs various topic modeling methods to analyze popular song lyrics, confirming that 'love' is the most prevalent topic. Applying various techniques, it is evident that popular songs typically revolve around themes of nightlife or parties, mental health, sexuality, relationships, and emotions because these are common experiences that many people can relate to. Many people enjoy going out and partying with friends, so it makes sense that these themes would be popular in music. Mental health is another common theme in music because it is something that affects many people. Music can be a powerful tool for expressing emotions, and many artists use their music to talk about their own struggles with mental health or to bring attention to the issue. Sexuality is another theme that is frequently explored in music because it is a universal aspect of human experience. Songs about sex can be fun, romantic, or provocative, and they can help people connect with their own desires and fantasies. Relationships and emotions are also common themes in music because they are central to the human experience. Love, heartbreak, joy, and sadness are emotions that everyone has felt at some point in their lives, and music can be a powerful way to express and connect with those feelings. Overall, songs revolve around these themes because they are relatable to many people and can help us connect with each other through shared experiences and emotions.
In summary, the findings of this analysis demonstrate the significant impact of political, social, and cultural factors on people's music preferences in today's society. Generally, music with positive themes is preferred. However, in some countries, individuals appear to be influenced by political and social concerns. While music serves as a means of escape for some, in other regions, individuals utilize music to connect with current social and political challenges facing their respective countries.
Spotify. Spotify Charts. Retrieved from Spotify Charts
Spotify. Web API Reference. Spotify Developer. Retrieved from Spotify API
Genius. Genius API Documentation. Genius. Retrieved from Genius API
Millr, J. LyricsGenius: a Python client for the Genius.com API. GitHub. Retrieved from LyricsGenius
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830. Feature selection
Mohammad, S. NRC Emotion Lexicon. Retrieved from NRCLex
Hutto, C.J. (2014). vaderSentiment. [Computer software]. Retrieved from Vader
TextBlob. TextBlob documentation. TextBlob. TextBlob
Grootendorst, M. (2021). BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics. GitHub. Retrieved March 14, 2023, from BERTopic
Mabey, B. (n.d.). PyLDAvis: Python library for interactive topic model visualization. GitHub. LDA
Angelov, D. (2021). Top2Vec: Distributed semantic search and clustering. GitHub. Retrieved from Top2Vec