Machine learning approach to auto-tagging online content for content marketing efficiency: A comparative analysis between methods and content type
Introduction
Turning online content into structured data is important for content marketers, as structuring the content supports users' information consumption and sharing purposes, and therefore, from a commercial perspective for firm performance (Balducci & Marinova, 2018). For marketers and decision-makers, especially in firms dealing with online content (e.g., social media managers, editors, content producers), a higher order understanding of content performance is crucial for competitive success, given the rising demand among users for personalised offerings (Kumar, 2018). Yet, making sense of online content performance to derive business value can be a daunting task, as the nature of data involved is complex in terms of volume and dynamics, it is fragmented across many channels, and it can be associated with many different metrics (Chun, 2018; Clarke & Jansen, 2017). Content classification (e.g. dividing the content into topics) is therefore a necessity, such that individual units of content are thematically aggregated to increase interpretability for decision-making in relation to content marketing1 activities such as content creation, dissemination, and management. Nonetheless, beyond the obvious impracticalities of time and effort involved, manually tagging online content for keywords is problematic for two main reasons: a) the tagging process is fallible owing to human error; and b) classification taxonomies can change over time as new topics emerge, especially given the vast quantity of online data generated daily. Consequently, online content often remains largely unstructured with the absence or incorrect allocation of tags (Kutlu, Elsayed, & Lease, 2018). Machine learning approaches have emerged as a potential solution to this problem and are increasingly applied in a variety fields to uncover hidden insights by automating the classification process (Antons & Breidbach, 2018).
Even so, the application of machine learning approaches in marketing is still at a developmental stage, in need of refinement and insight (Balducci & Marinova, 2018; Sterne, 2018). In this research, we contribute to the marketing literature by: 1) Comparing three relevant approaches to automatically classify news articles based on web content from a major worldwide news and media organisation; 2) Developing and illustrating a neural network algorithm to address the multilabel classification issue in automatically classifying webpages containing news articles; and 3) Applying the same algorithm, without channel-specific training, on the same organisation's YouTube channel to test the generalisability of the approach. The latter evaluation is important for several reasons. Most notably, evaluation of the cross-channel applicability of automatic classification approaches is often not conducted in the research dealing with auto-tagging online content, which means that the generalisability of the models over time and in different channels is not properly addressed. Rather, researchers employing machine learning methods to this problem tend to utilise the test data from the same overall sample to evaluate their models' performance. Even though this practice is typical for evaluating a model's performance (i.e., machine learning models are tested such that training and test data are kept separate, so that the model does not “see� the test data prior to predicting it), the cross-sectional nature of data collection (i.e., the training and testing data belong to the same overall sample) makes it difficult to evaluate the model's true generalisability over time and in different channels. Therefore, by evaluating the cross-channel applicability of our model, we address the broader question: Are machine learning models developed for online content classification generalisable beyond the dataset they were trained and tested on? To address this question, we conduct a repeated test of the model on an independently collected dataset of the organisation's content, i.e., the titles and descriptions of the videos in the organisation's YouTube channel.
In addition to addressing a research gap within the automatic classification of online content, cross-channel applicability of tagging online content is highly important for organisations practically engaged in content marketing, as such organisations typically publish their content in multiple channels, including website and social media such as Facebook, Twitter, YouTube, and LinkedIn. Thus, when developing a classifier to tag the content published in different channels, the classifier needs to be able to perform well in a multichannel environment that the marketing mix of the modern content marketer consists of. With increasingly large, complex, and dynamic data becoming the basis of marketing decisions, it is ever more important to develop better methods of converting unstructured ‘big’ data into actionable information and insights (Syam & Sharma, 2018). Though the vast amount of available data is useful for training machine learning algorithms to make accurate predictions or classifications, developing the right approach can be challenging, not least because of the level of noise in the datasets and the diverse range of problems in relation to available technologies (Flake, Frasconi, Giles, & Maggini, 2004). On the whole, higher level description of online content is important for machine-readability, model development, and statistically correlating topics to various key performance metrics of content marketing such as visitor statistics, development of content coverage over time, or the range of topics covered by various websites. Our aim is to address the gap in the extant marketing literature for more advanced and innovative methods (Hofacker, 2012; Kumar, 2018) by comparing machine learning approaches to dealing with the multilabel classification problem when classifying news articles and examining a high-performing machine learning model's cross-channel applicability for a different type of content.
By using data from a worldwide news organisation, we show that our approach yields an overall F1 Score of 70%, even with a large set of topics. We further visualise the development of news articles over time; provided the taxonomy is updated with at least some examples, our classification is robust to topic changes and new topics emerging over time. In addition, we evaluate cross-platform applicability by classifying the same organisation's YouTube videos and then manually reviewing the results via three human coders.
The remainder of the paper is organised as follows. First, we present an overview of the literature on machine learning applications in marketing, followed by a summary of the proposed solution strategy. Next, we explain the data exploration and preparation procedure. We then evaluate three classifiers: Random Forests, K-Nearest Neighbors, and Neural Network (NN); followed by a more detailed application of NN whereby data collected from one year (2017) is used for training and data collected from another year (2018) is used for testing. Based on this, keywords are generated for unclassified news articles using the developed approach. Subsequently, we evaluate the cross-channel applicability by classifying YouTube videos of the news organisation. Finally, we discuss implications and avenues for further research.
Section snippets
Machine learning in marketing and content classification
Machine learning is an umbrella term used to describe a variety of computer-based techniques for data mining to uncover complex patterns, particularly in large and complex datasets (Pereira, Plastino, Zadrozny, & Merschmann, 2018), with a view to deriving insights for prediction, classification, and decision-making purposes (Cui, Wong, & Lui, 2006). Particularly, in the context of a multiplicity of social media and user-generated content (UGC) platforms, the diversity of data, in both type and
Algorithm selection and data cleaning
Many algorithms are not well-optimised for dealing with the problem at hand, since they do not possess the inbuilt capability of handling multilabel classifications. There are alternative methods to train multilabel classifiers, such as training one model for each label. However, since we are predicting news keywords, which are numerous and diverse, this approach is not technically feasible. As such, we have opted to evaluate three algorithms that have inbuilt multilabel classification
Data collection and exploration
Al Jazeera is a global news and media organisation, headquartered in Doha, Qatar. The main website (aljazeera.com) attracts traffic from nearly 200 countries and regions and has had on average over 15 million visits in 2018, of which roughly 42% comes from search and another 44% is direct (SimilarWeb, 2018). We collected the data by scraping the content of Al Jazeera's main website that distributes news stories. The resulting dataset contains information about the article's content, its title,
Classifier models and evaluation
As mentioned previously, the models we can use are limited to those that support multilabel classification efficiently; that is, to avoid using multiple One-vs-Rest classifiers to create the model. Using multiple One-vs-Rest classifiers is computationally inefficient, because this entails creating one model per keyword, then using all models during prediction time (Read, Pfahringer, Holmes, & Frank, 2011). This means training a large number of models, which will only increase in number when the
Predicting keywords for news articles
As the first step in the process of predicting keywords, a total of 8160 articles missing their keywords were identified and converted into a TF-IDF matrix. Next, we use our trained model to predict which keyword(s) belong to each article. Since an article may have more than one keyword, the Neural Network computes a probability for each label to be present in an article; for selecting a label for an article, its probability must be ≥0.48. A specimen article, following keyword prediction, is
Discussion and implications
There has been an increasing shift in the field of marketing from conventional forms of content analysis to more advanced computational forms corresponding to the vastly increasing availability, complexity, and importance of data (Balducci & Marinova, 2018; Kumar, 2018). Meanwhile, a parallel development in relation to research methodology in marketing has been called-for (Hofacker, 2012), so that innovative approaches may also contribute to greater advancements in marketing theory, especially
Limitations and suggestions for further research
One improvement to our study would be to obtain more data, more keywords, and more articles, to further expand and improve the capabilities of the model. Though a small number of articles remained unclassified (0.453% overall), to remedy this, we may either include more keywords during training, or decrease the probability threshold for accepting predicted keywords. However, both approaches have their disadvantages, including an increase in false positives due to lowering the threshold for
Concluding remarks
Leveraging the benefits of machine learning applications in marketing and addressing the important need for such application for marketing research methods, this paper contributes to the literature by comparing three state-of-the art algorithms for tagging online website content and establishing cross-platform applicability. We find that the Neural Network performs the best for multilabel classification, and the developed model was able to cope with changes in topics over time, which is salient
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Joni Salminen, PhD holds a PhD in Marketing from Turku School of Economics and is currently working as a postdoctoral researcher at Qatar Computing Research Institute. His expertise lies in the area of digital marketing and using (big) data for marketing applications, such as automatic profiling of user segments and gauging brand reputations using social media data.
References (61)
- et al.
Does marketing research suffer from methods myopia?
Journal of Business Research
(2013) - et al.
Feeling a destination through the “right� photos: A machine learning model for DMOs' photo selection
Tourism Management
(2018) - et al.
Response models based on bagging neural networks
Journal of Interactive Marketing
(2005) On research methods in interactive marketing
Journal of Interactive Marketing
(2012)- et al.
Battle of the brand fans: Impact of brand attack and defense on social media
Journal of Interactive Marketing
(2018) - et al.
Finding the right words: The influence of keyword characteristics on performance of paid search campaigns
Journal of Interactive Marketing
(2014) - et al.
Intelligent topic selection for low-cost information retrieval evaluation: A new perspective on deep vs. shallow judging
Information Processing & Management
(2018) - et al.
Semi-supervised response modeling
Journal of Interactive Marketing
(2010) - et al.
Exploring coherent topics by topic modeling with term weighting
Information Processing & Management
(2018) - et al.
Marketing intelligent systems for consumer behaviour modelling by a descriptive induction approach based on genetic fuzzy systems
Industrial Marketing Management
(2009)
Prospects for personalization on the internet
Journal of Interactive Marketing
A soft-computing-based method for the automatic discovery of fuzzy rules in databases: Uses for academic research and management support in marketing
Journal of Business Research
Local word vectors guiding keyphrase extraction
Information Processing & Management
Correlation analysis of performance measures for multi-label classification
Information Processing & Management
Term-weighting approaches in automatic text retrieval
Information Processing & Management
Predicting the “helpfulness� of online consumer reviews
Journal of Business Research
Waiting for a sales renaissance in the fourth industrial revolution: Machine learning and artificial intelligence in sales research and practice
Industrial Marketing Management
Classifying residents' roles as online place-ambassadors
Tourism Management
A classification of user-generated content into consumer decision journey stages
Neural Networks
Mining user interests over active topics on social networks
Information Processing & Management
The impact of metadata implementation on webpage visibility in search engine results (part II)
Information Processing & Management
Ranking themes on co-word networks: Exploring the relationships among different metrics
Information Processing & Management
Twitter mining for ontology-based domain discovery incorporating machine learning
Journal of Knowledge Management
Big data, big insights? Advancing service innovation and design with machine learning
Journal of Service Research
Unstructured data in marketing
Journal of the Academy of Marketing Science
The dynamics of search engine marketing for tourist destinations
Journal of Travel Research
Latent dirichlet allocation
Journal of Machine Learning Research
A machine learning approach to research curation for investment process
Journal of Investment Management
Machine learning techniques and statistical methods for business applications: Implications on big data gold rush
Advanced Science Letters
Conversion potential: A metric for evaluating search engine advertising performance
Journal of Research in Interactive Marketing
Cited by (61)
Predicting the changes in the WTI crude oil price dynamics using machine learning models
2022, Resources PolicyMachine learning in marketing: A literature review, conceptual framework, and research agenda
2022, Journal of Business ResearchCitation Excerpt :Jai et al. (2021) investigate the effect of different types of visual sensory information on brain activation preceding purchase decisions. Salminen et al. (2019) compare the ML approach to automatically tag and classify different types of online news articles for content marketing efficiency. ML, which applies different analysis tools and learning algorithms to generate predictions needed to make decisions (Agrawal et al., 2018) in the era of big data, has entered marketing research (Hagen et al., 2020).
VSTAR: Visual Semantic Thumbnails and tAgs Revitalization
2022, Expert Systems with ApplicationsCitation Excerpt :To this end, a video hosting website typically allows and suggests uploaders to attach metadata to the video. However, this task may be challenging for the users (Bajaj et al., 2016; Salminen et al., 2019), as providing relevant and complete metadata requires a significant human effort in terms of energy and time consumption. Furthermore, this task is typically not mandatory.
Sustainable success in the music industry: Empirical analysis of music preferences
2022, Journal of Business ResearchA dynamic ensemble selection method for bank telemarketing sales prediction
2022, Journal of Business ResearchCitation Excerpt :Machine learning is a general term that covers various computer-based data mining techniques to discover complex patterns in data, especially big data (Pereira, Plastino, Zadrozny, & Merschmann, 2018). The literature on marketing topics suggests that machine learning methods can provide effective decision support for both direct marketing (AdyyÅ„ski, Bikowski, & Gawrysiak, 2019; Cui & Man, 2004) and strategic marketing (Orriols-Puig, MartÃnez-López, Casillas, & Lee, 2013; Salminen et al., 2019). The prediction of customers’ purchase intentions has always been an interesting research issue in marketing.
Joni Salminen, PhD holds a PhD in Marketing from Turku School of Economics and is currently working as a postdoctoral researcher at Qatar Computing Research Institute. His expertise lies in the area of digital marketing and using (big) data for marketing applications, such as automatic profiling of user segments and gauging brand reputations using social media data.
Vignesh Yoganathan, PhD is a Senior Lecturer (Associate Professor) in Marketing at University of Bradford, whose research focuses on digital and responsible marketing/branding, particularly using experiments and multivariate statistics or modelling. He has worked with several commercial and non-profit organisations to improve customer insights and market strategies in the technological context.
Juan Corporan, BSc is the Lead Data Scientist at Banco Santa Cruz in the Dominican Republic and specialises in building predictive models for decision-making and developing data quality for data-driven business decisions. He regularly contributes to various expert forums, addressing questions of cutting-edge data science developments.
Bernard J. Jansen, PhD is a Principal Scientist in the social computing group of the Qatar Computing Research Institute, and Professor at the College of Science and Engineering, Hamad bin Khalifa University. He is the Editor-in-Chief of the Information Processing & Management (Elsevier), and the former Editor-in-Chief of Internet Research (Emerald). He is also an adjunct professor with the College of Information Sciences and Technology at The Pennsylvania State University.
Soon-Gyo Jung, MSc is a Research Associate at the Qatar Computing Research Institute working in the area of computational social science. He has a background in web applications and software development and holds a master's degree in Electrical and Computer Engineering from Sungkyunkwan University in South Korea. He has published several articles in areas including information dissemination and audience segmentation.