Machine learning approach to auto-tagging online content for content marketing efficiency: A comparative analysis between methods and content type

doi:10.1016/j.jbusres.2019.04.018

Journal of Business Research

Volume 101, August 2019, Pages 203-217

https://doi.org/10.1016/j.jbusres.2019.04.018 Get rights and content

Highlights

•
Unstructured content across online platforms is a challenge for content marketers.
•
Manual tagging is impractical, fallible, and unfeasible for evolving topics.
•
Out of 3 multilabel classifications, Neural Network performs best (70% accuracy).
•
Cross-channel validity is shown by tagging YouTube videos of the same news firm.
•
This helps content-marketers gauge performance and create customer value-in-use.

Abstract

As complex data becomes the norm, greater understanding of machine learning (ML) applications is needed for content marketers. Unstructured data, scattered across platforms in multiple forms, impedes performance and user experience. Automated classification offers a solution to this. We compare three state-of-the-art ML techniques for multilabel classification - Random Forest, K-Nearest Neighbor, and Neural Network - to automatically tag and classify online news articles. Neural Network performs the best, yielding an F1 Score of 70% and provides satisfactory cross-platform applicability on the same organisation's YouTube content. The developed model can automatically label 99.6% of the unlabelled website and 96.1% of the unlabelled YouTube content. Thus, we contribute to marketing literature via comparative evaluation of ML models for multilabel content classification, and cross-channel validation for a different type of content. Results suggest that organisations may optimise ML to auto-tag content across various platforms, opening avenues for aggregated analyses of content performance.

Introduction

Turning online content into structured data is important for content marketers, as structuring the content supports users' information consumption and sharing purposes, and therefore, from a commercial perspective for firm performance (Balducci & Marinova, 2018). For marketers and decision-makers, especially in firms dealing with online content (e.g., social media managers, editors, content producers), a higher order understanding of content performance is crucial for competitive success, given the rising demand among users for personalised offerings (Kumar, 2018). Yet, making sense of online content performance to derive business value can be a daunting task, as the nature of data involved is complex in terms of volume and dynamics, it is fragmented across many channels, and it can be associated with many different metrics (Chun, 2018; Clarke & Jansen, 2017). Content classification (e.g. dividing the content into topics) is therefore a necessity, such that individual units of content are thematically aggregated to increase interpretability for decision-making in relation to content marketing¹ activities such as content creation, dissemination, and management. Nonetheless, beyond the obvious impracticalities of time and effort involved, manually tagging online content for keywords is problematic for two main reasons: a) the tagging process is fallible owing to human error; and b) classification taxonomies can change over time as new topics emerge, especially given the vast quantity of online data generated daily. Consequently, online content often remains largely unstructured with the absence or incorrect allocation of tags (Kutlu, Elsayed, & Lease, 2018). Machine learning approaches have emerged as a potential solution to this problem and are increasingly applied in a variety fields to uncover hidden insights by automating the classification process (Antons & Breidbach, 2018).

Even so, the application of machine learning approaches in marketing is still at a developmental stage, in need of refinement and insight (Balducci & Marinova, 2018; Sterne, 2018). In this research, we contribute to the marketing literature by: 1) Comparing three relevant approaches to automatically classify news articles based on web content from a major worldwide news and media organisation; 2) Developing and illustrating a neural network algorithm to address the multilabel classification issue in automatically classifying webpages containing news articles; and 3) Applying the same algorithm, without channel-specific training, on the same organisation's YouTube channel to test the generalisability of the approach. The latter evaluation is important for several reasons. Most notably, evaluation of the cross-channel applicability of automatic classification approaches is often not conducted in the research dealing with auto-tagging online content, which means that the generalisability of the models over time and in different channels is not properly addressed. Rather, researchers employing machine learning methods to this problem tend to utilise the test data from the same overall sample to evaluate their models' performance. Even though this practice is typical for evaluating a model's performance (i.e., machine learning models are tested such that training and test data are kept separate, so that the model does not “see�? the test data prior to predicting it), the cross-sectional nature of data collection (i.e., the training and testing data belong to the same overall sample) makes it difficult to evaluate the model's true generalisability over time and in different channels. Therefore, by evaluating the cross-channel applicability of our model, we address the broader question: Are machine learning models developed for online content classification generalisable beyond the dataset they were trained and tested on? To address this question, we conduct a repeated test of the model on an independently collected dataset of the organisation's content, i.e., the titles and descriptions of the videos in the organisation's YouTube channel.

In addition to addressing a research gap within the automatic classification of online content, cross-channel applicability of tagging online content is highly important for organisations practically engaged in content marketing, as such organisations typically publish their content in multiple channels, including website and social media such as Facebook, Twitter, YouTube, and LinkedIn. Thus, when developing a classifier to tag the content published in different channels, the classifier needs to be able to perform well in a multichannel environment that the marketing mix of the modern content marketer consists of. With increasingly large, complex, and dynamic data becoming the basis of marketing decisions, it is ever more important to develop better methods of converting unstructured ‘big’ data into actionable information and insights (Syam & Sharma, 2018). Though the vast amount of available data is useful for training machine learning algorithms to make accurate predictions or classifications, developing the right approach can be challenging, not least because of the level of noise in the datasets and the diverse range of problems in relation to available technologies (Flake, Frasconi, Giles, & Maggini, 2004). On the whole, higher level description of online content is important for machine-readability, model development, and statistically correlating topics to various key performance metrics of content marketing such as visitor statistics, development of content coverage over time, or the range of topics covered by various websites. Our aim is to address the gap in the extant marketing literature for more advanced and innovative methods (Hofacker, 2012; Kumar, 2018) by comparing machine learning approaches to dealing with the multilabel classification problem when classifying news articles and examining a high-performing machine learning model's cross-channel applicability for a different type of content.

By using data from a worldwide news organisation, we show that our approach yields an overall F1 Score of 70%, even with a large set of topics. We further visualise the development of news articles over time; provided the taxonomy is updated with at least some examples, our classification is robust to topic changes and new topics emerging over time. In addition, we evaluate cross-platform applicability by classifying the same organisation's YouTube videos and then manually reviewing the results via three human coders.

The remainder of the paper is organised as follows. First, we present an overview of the literature on machine learning applications in marketing, followed by a summary of the proposed solution strategy. Next, we explain the data exploration and preparation procedure. We then evaluate three classifiers: Random Forests, K-Nearest Neighbors, and Neural Network (NN); followed by a more detailed application of NN whereby data collected from one year (2017) is used for training and data collected from another year (2018) is used for testing. Based on this, keywords are generated for unclassified news articles using the developed approach. Subsequently, we evaluate the cross-channel applicability by classifying YouTube videos of the news organisation. Finally, we discuss implications and avenues for further research.

Section snippets

Machine learning in marketing and content classification

Machine learning is an umbrella term used to describe a variety of computer-based techniques for data mining to uncover complex patterns, particularly in large and complex datasets (Pereira, Plastino, Zadrozny, & Merschmann, 2018), with a view to deriving insights for prediction, classification, and decision-making purposes (Cui, Wong, & Lui, 2006). Particularly, in the context of a multiplicity of social media and user-generated content (UGC) platforms, the diversity of data, in both type and

Algorithm selection and data cleaning

Many algorithms are not well-optimised for dealing with the problem at hand, since they do not possess the inbuilt capability of handling multilabel classifications. There are alternative methods to train multilabel classifiers, such as training one model for each label. However, since we are predicting news keywords, which are numerous and diverse, this approach is not technically feasible. As such, we have opted to evaluate three algorithms that have inbuilt multilabel classification

Data collection and exploration

Al Jazeera is a global news and media organisation, headquartered in Doha, Qatar. The main website (aljazeera.com) attracts traffic from nearly 200 countries and regions and has had on average over 15 million visits in 2018, of which roughly 42% comes from search and another 44% is direct (SimilarWeb, 2018). We collected the data by scraping the content of Al Jazeera's main website that distributes news stories. The resulting dataset contains information about the article's content, its title,

Classifier models and evaluation

As mentioned previously, the models we can use are limited to those that support multilabel classification efficiently; that is, to avoid using multiple One-vs-Rest classifiers to create the model. Using multiple One-vs-Rest classifiers is computationally inefficient, because this entails creating one model per keyword, then using all models during prediction time (Read, Pfahringer, Holmes, & Frank, 2011). This means training a large number of models, which will only increase in number when the

Predicting keywords for news articles

As the first step in the process of predicting keywords, a total of 8160 articles missing their keywords were identified and converted into a TF-IDF matrix. Next, we use our trained model to predict which keyword(s) belong to each article. Since an article may have more than one keyword, the Neural Network computes a probability for each label to be present in an article; for selecting a label for an article, its probability must be ≥0.48. A specimen article, following keyword prediction, is

Discussion and implications

There has been an increasing shift in the field of marketing from conventional forms of content analysis to more advanced computational forms corresponding to the vastly increasing availability, complexity, and importance of data (Balducci & Marinova, 2018; Kumar, 2018). Meanwhile, a parallel development in relation to research methodology in marketing has been called-for (Hofacker, 2012), so that innovative approaches may also contribute to greater advancements in marketing theory, especially

Limitations and suggestions for further research

One improvement to our study would be to obtain more data, more keywords, and more articles, to further expand and improve the capabilities of the model. Though a small number of articles remained unclassified (0.453% overall), to remedy this, we may either include more keywords during training, or decrease the probability threshold for accepting predicted keywords. However, both approaches have their disadvantages, including an increase in false positives due to lowering the threshold for

Concluding remarks

Leveraging the benefits of machine learning applications in marketing and addressing the important need for such application for marketing research methods, this paper contributes to the literature by comparing three state-of-the art algorithms for tagging online website content and establishing cross-platform applicability. We find that the Neural Network performs the best for multilabel classification, and the developed model was able to cope with changes in topics over time, which is salient

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Joni Salminen, PhD holds a PhD in Marketing from Turku School of Economics and is currently working as a postdoctoral researcher at Qatar Computing Research Institute. His expertise lies in the area of digital marketing and using (big) data for marketing applications, such as automatic profiling of user segments and gauging brand reputations using social media data.

References (61)

D.F. Davis et al.
Does marketing research suffer from methods myopia?
Journal of Business Research
(2013)
N. Deng et al.
Feeling a destination through the “right�? photos: A machine learning model for DMOs' photo selection
Tourism Management
(2018)
K. Ha et al.
Response models based on bagging neural networks
Journal of Interactive Marketing
(2005)
C.F. Hofacker
On research methods in interactive marketing
Journal of Interactive Marketing
(2012)
B.E. Ilhan et al.
Battle of the brand fans: Impact of brand attack and defense on social media
Journal of Interactive Marketing
(2018)
S. Klapdor et al.
Finding the right words: The influence of keyword characteristics on performance of paid search campaigns
Journal of Interactive Marketing
(2014)
M. Kutlu et al.
Intelligent topic selection for low-cost information retrieval evaluation: A new perspective on deep vs. shallow judging
Information Processing & Management
(2018)
H.-j. Lee et al.
Semi-supervised response modeling
Journal of Interactive Marketing
(2010)
X. Li et al.
Exploring coherent topics by topic modeling with term weighting
Information Processing & Management
(2018)
F.J. Martínez-López et al.
Marketing intelligent systems for consumer behaviour modelling by a descriptive induction approach based on genetic fuzzy systems
Industrial Marketing Management
(2009)

A.L. Montgomery et al.

Prospects for personalization on the internet

Journal of Interactive Marketing

(2009)

A. Orriols-Puig et al.

A soft-computing-based method for the automatic discovery of fuzzy rules in databases: Uses for academic research and management support in marketing

Journal of Business Research

(2013)

E. Papagiannopoulou et al.

Local word vectors guiding keyphrase extraction

Information Processing & Management

(2018)

R.B. Pereira et al.

Correlation analysis of performance measures for multi-label classification

Information Processing & Management

(2018)

G. Salton et al.

Term-weighting approaches in automatic text retrieval

Information Processing & Management

(1988)

J.P. Singh et al.

Predicting the “helpfulness�? of online consumer reviews

Journal of Business Research

(2017)

N. Syam et al.

Waiting for a sales renaissance in the fourth industrial revolution: Machine learning and artificial intelligence in sales research and practice

Industrial Marketing Management

(2018)

S. Uchinaka et al.

Classifying residents' roles as online place-ambassadors

Tourism Management

(2019)

S. Vázquez et al.

A classification of user-generated content into consumer decision journey stages

Neural Networks

(2014)

F. Zarrinkalam et al.

Mining user interests over active topics on social networks

Information Processing & Management

(2018)

J. Zhang et al.

The impact of metadata implementation on webpage visibility in search engine results (part II)

Information Processing & Management

(2005)

W. Zhao et al.

Ranking themes on co-word networks: Exploring the relationships among different metrics

Information Processing & Management

(2018)

B. Abu-Salih et al.

Twitter mining for ontology-based domain discovery incorporating machine learning

Journal of Knowledge Management

(2018)

D. Antons et al.

Big data, big insights? Advancing service innovation and design with machine learning

Journal of Service Research

(2018)

B. Balducci et al.

Unstructured data in marketing

Journal of the Academy of Marketing Science

(2018)

P. Bing et al.

The dynamics of search engine marketing for tourist destinations

Journal of Travel Research

(2010)

D.M. Blei et al.

Latent dirichlet allocation

Journal of Machine Learning Research

(2003)

S. Cates et al.

A machine learning approach to research curation for investment process

Journal of Investment Management

(2017)

S.-H. Chun

Machine learning techniques and statistical methods for business applications: Implications on big data gold rush

Advanced Science Letters

(2018)

T.B. Clarke et al.

Conversion potential: A metric for evaluating search engine advertising performance

Journal of Research in Interactive Marketing

(2017)

Cited by (61)

Value-based pricing in digital platforms: A machine learning approach to signaling beyond core product attributes in cross-platform settings
2022, Journal of Business Research
Value-based pricing is known to be challenging, especially on online platforms, but is considered a superior pricing strategy. We investigate cross-platform pricing and other factors that influence perceived customer value in the context of the accommodation industry. This industry is characterized by powerful platforms (e.g., Booking.com) as well as small and medium-sized enterprises (SMEs) selling across platforms. We compare the importance of platform choice and seller history as underlying signals conveying value and thus defining pricing beyond core product attributes. Such actor-signaling-actions for value are neglected in previous research. We pay particular attention to how time-based price discrimination affects the importance of these non-core product signals. As cross-platform efforts increase the complexity of value-based pricing, we apply machine learning methods to model how SMEs can successfully predict pricing across platforms. We discuss our methodological and theoretical contributions to value-based pricing and signaling theory.
Predicting the changes in the WTI crude oil price dynamics using machine learning models
2022, Resources Policy
This study aims to use a monthly dataset from 1991 to 2021 to predict West Texas Intermediate (WTI) oil price dynamics using U.S. macroeconomic and financial factors, as well as a global crisis and crashes. We used advanced machine learning models such as Logistic Regression, Decision Tree, Random Forest, AdaBoost, and XgBoost in this study. According to the results, the XgBoost and Random Forest models outperform traditional models. We also used DeLong statistical test procedures to accurately compare machine learning models' performance. In addition, the study used SHAP - SHapley Additive exPlanations values to support model evaluation and interpretability. This new outline highlights the critical features of the WTI crude oil price prediction and provides appropriate model explanations by utilizing the practical SHAP values. The empirical findings showed that machine learning models could successfully and accurately predict the trend of WTI crude oil price changes. Our findings are important for policymakers, companies, and investors, as well as long-term energy-based economic development.
Machine learning in marketing: A literature review, conceptual framework, and research agenda
2022, Journal of Business Research
Citation Excerpt :
Jai et al. (2021) investigate the effect of different types of visual sensory information on brain activation preceding purchase decisions. Salminen et al. (2019) compare the ML approach to automatically tag and classify different types of online news articles for content marketing efficiency. ML, which applies different analysis tools and learning algorithms to generate predictions needed to make decisions (Agrawal et al., 2018) in the era of big data, has entered marketing research (Hagen et al., 2020).
In recent years, machine learning (ML) and artificial intelligence (AI) have attracted considerable attention in different industry sectors, including marketing. ML and AI hold great promise for making marketing intelligent and efficient. In this study, we conduct a literature review of academic journal studies on ML in marketing applications and propose a conceptual framework highlighting the main ML tools and technologies that serve as the foundation of ML applications in marketing. We use the 7Ps marketing mix, that is, product, price, promotion, place, people, process, and physical evidence, to analyze these applications from 140 selected articles. The applications are supported by various ML tools (text, voice, image, and video analytics) and techniques such as supervised, unsupervised, and reinforcement learning algorithms. We propose a two-layer conceptual framework for ML applications in marketing development. This framework can serve future research and provide an illustration of the development of ML applications in marketing.
VSTAR: Visual Semantic Thumbnails and tAgs Revitalization
2022, Expert Systems with Applications
Citation Excerpt :
To this end, a video hosting website typically allows and suggests uploaders to attach metadata to the video. However, this task may be challenging for the users (Bajaj et al., 2016; Salminen et al., 2019), as providing relevant and complete metadata requires a significant human effort in terms of energy and time consumption. Furthermore, this task is typically not mandatory.
Nowadays, video-sharing portals’ popularity has entailed massive growth in data uploads over the Internet. For several applications (e.g., browsing, retrieval, or recommendation of videos), dealing with vast data volumes has become a critical issue. In a video-sharing scenario, the devising of tools and infrastructures able to completely satisfy users’ interests and requests is becoming increasingly crucial to influence their online experiences. On the one hand, annotating a video with meaningful human-friendly words (i.e., tags) plays an essential role in matching users’ interests. On the other hand, providing a condensed and straightforward preview of the video content (i.e., thumbnails) is crucial to capture the user’s attention immediately. In this context, we propose VSTAR (Visual Semantic Thumbnails and tAgs Revitalization), a novel approach in video optimization aimed at generating both suitable tags and thumbnails from a different perspective than classical approaches. The novelty lies in: (i) exploiting image captioning to extract visual and semantic information for generating both tags and thumbnails; (ii) identifying semantically related popular search queries (i.e., trends) to be suggested as new tags; (iii) giving the final user the control on a trade-off between quality and quantity of the generated items (tags and thumbnails); (iv) creating a proper dataset and making it publicly available. Experiments demonstrate the viability of our proposal.
Sustainable success in the music industry: Empirical analysis of music preferences
2022, Journal of Business Research
Sustainability is not just a trend, but an important part of our everyday life including the satisfaction of human needs and preservation of a healthy business environment for present and future generations. The objective of this study is to provide an empirical approach for how to achieve the sustainable success in the music industry. As consumers’ utility toward a certain music product can be shifted up or down depending on the elements that constitute the music, we investigate the effect of musical elements on the consumer’s choice of music. We quantitatively measure the effects using hierarchical Bayesian logit choice model allowing for the individual heterogeneity. Based on the results, we find that utilizing musical components plays critical roles in understanding and predicting consumer choice. In addition, our findings suggest how music marketers can come up with a desirable configuration for music products. Sustainability in the music industry can be justified by whether musical components are well aligned, consistent with consumers’ preference.
A dynamic ensemble selection method for bank telemarketing sales prediction
2022, Journal of Business Research
Citation Excerpt :
Machine learning is a general term that covers various computer-based data mining techniques to discover complex patterns in data, especially big data (Pereira, Plastino, Zadrozny, & Merschmann, 2018). The literature on marketing topics suggests that machine learning methods can provide effective decision support for both direct marketing (Adyyński, Bikowski, & Gawrysiak, 2019; Cui & Man, 2004) and strategic marketing (Orriols-Puig, Martínez-López, Casillas, & Lee, 2013; Salminen et al., 2019). The prediction of customers’ purchase intentions has always been an interesting research issue in marketing.
We propose a dynamic ensemble selection method, META-DES-AAP, to predict the success of bank telemarketing sales of time deposits. Unlike existing machine learning-based marketing sales prediction methods focusing only on prediction accuracy, META-DES-AAP considers the accuracy and average profit maximization. In META-DES-AAP, to consider both accuracy and average profit in the framework of dynamic ensemble selection using meta-training, a multi-objective optimization algorithm is designed to maximize the accuracy and average profit for base classifiers selection. Base classifiers suitable for each test telemarketing campaign are integrated according to the dynamic-based base classifiers integration method. Experimental results on bank telemarketing data show that META-DES-AAP achieves the best accuracy and the largest average profit when compared across several state-of-the-art machine learning methods. In addition, the factors influencing telemarketing on the average predicted probability of telemarketing success and average profit obtained by META-DES-AAP are analyzed.

View all citing articles on Scopus

Vignesh Yoganathan, PhD is a Senior Lecturer (Associate Professor) in Marketing at University of Bradford, whose research focuses on digital and responsible marketing/branding, particularly using experiments and multivariate statistics or modelling. He has worked with several commercial and non-profit organisations to improve customer insights and market strategies in the technological context.

Juan Corporan, BSc is the Lead Data Scientist at Banco Santa Cruz in the Dominican Republic and specialises in building predictive models for decision-making and developing data quality for data-driven business decisions. He regularly contributes to various expert forums, addressing questions of cutting-edge data science developments.

Bernard J. Jansen, PhD is a Principal Scientist in the social computing group of the Qatar Computing Research Institute, and Professor at the College of Science and Engineering, Hamad bin Khalifa University. He is the Editor-in-Chief of the Information Processing & Management (Elsevier), and the former Editor-in-Chief of Internet Research (Emerald). He is also an adjunct professor with the College of Information Sciences and Technology at The Pennsylvania State University.

Soon-Gyo Jung, MSc is a Research Associate at the Qatar Computing Research Institute working in the area of computational social science. He has a background in web applications and software development and holds a master's degree in Electrical and Computer Engineering from Sungkyunkwan University in South Korea. He has published several articles in areas including information dissemination and audience segmentation.

View full text