Elsevier

Journal of Business Research

Volume 101, August 2019, Pages 203-217
Journal of Business Research

Machine learning approach to auto-tagging online content for content marketing efficiency: A comparative analysis between methods and content type

https://doi.org/10.1016/j.jbusres.2019.04.018Get rights and content

Highlights

  • •

    Unstructured content across online platforms is a challenge for content marketers.

  • •

    Manual tagging is impractical, fallible, and unfeasible for evolving topics.

  • •

    Out of 3 multilabel classifications, Neural Network performs best (70% accuracy).

  • •

    Cross-channel validity is shown by tagging YouTube videos of the same news firm.

  • •

    This helps content-marketers gauge performance and create customer value-in-use.

Abstract

As complex data becomes the norm, greater understanding of machine learning (ML) applications is needed for content marketers. Unstructured data, scattered across platforms in multiple forms, impedes performance and user experience. Automated classification offers a solution to this. We compare three state-of-the-art ML techniques for multilabel classification - Random Forest, K-Nearest Neighbor, and Neural Network - to automatically tag and classify online news articles. Neural Network performs the best, yielding an F1 Score of 70% and provides satisfactory cross-platform applicability on the same organisation's YouTube content. The developed model can automatically label 99.6% of the unlabelled website and 96.1% of the unlabelled YouTube content. Thus, we contribute to marketing literature via comparative evaluation of ML models for multilabel content classification, and cross-channel validation for a different type of content. Results suggest that organisations may optimise ML to auto-tag content across various platforms, opening avenues for aggregated analyses of content performance.

Introduction

Turning online content into structured data is important for content marketers, as structuring the content supports users' information consumption and sharing purposes, and therefore, from a commercial perspective for firm performance (Balducci & Marinova, 2018). For marketers and decision-makers, especially in firms dealing with online content (e.g., social media managers, editors, content producers), a higher order understanding of content performance is crucial for competitive success, given the rising demand among users for personalised offerings (Kumar, 2018). Yet, making sense of online content performance to derive business value can be a daunting task, as the nature of data involved is complex in terms of volume and dynamics, it is fragmented across many channels, and it can be associated with many different metrics (Chun, 2018; Clarke & Jansen, 2017). Content classification (e.g. dividing the content into topics) is therefore a necessity, such that individual units of content are thematically aggregated to increase interpretability for decision-making in relation to content marketing1 activities such as content creation, dissemination, and management. Nonetheless, beyond the obvious impracticalities of time and effort involved, manually tagging online content for keywords is problematic for two main reasons: a) the tagging process is fallible owing to human error; and b) classification taxonomies can change over time as new topics emerge, especially given the vast quantity of online data generated daily. Consequently, online content often remains largely unstructured with the absence or incorrect allocation of tags (Kutlu, Elsayed, & Lease, 2018). Machine learning approaches have emerged as a potential solution to this problem and are increasingly applied in a variety fields to uncover hidden insights by automating the classification process (Antons & Breidbach, 2018).

Even so, the application of machine learning approaches in marketing is still at a developmental stage, in need of refinement and insight (Balducci & Marinova, 2018; Sterne, 2018). In this research, we contribute to the marketing literature by: 1) Comparing three relevant approaches to automatically classify news articles based on web content from a major worldwide news and media organisation; 2) Developing and illustrating a neural network algorithm to address the multilabel classification issue in automatically classifying webpages containing news articles; and 3) Applying the same algorithm, without channel-specific training, on the same organisation's YouTube channel to test the generalisability of the approach. The latter evaluation is important for several reasons. Most notably, evaluation of the cross-channel applicability of automatic classification approaches is often not conducted in the research dealing with auto-tagging online content, which means that the generalisability of the models over time and in different channels is not properly addressed. Rather, researchers employing machine learning methods to this problem tend to utilise the test data from the same overall sample to evaluate their models' performance. Even though this practice is typical for evaluating a model's performance (i.e., machine learning models are tested such that training and test data are kept separate, so that the model does not “see� the test data prior to predicting it), the cross-sectional nature of data collection (i.e., the training and testing data belong to the same overall sample) makes it difficult to evaluate the model's true generalisability over time and in different channels. Therefore, by evaluating the cross-channel applicability of our model, we address the broader question: Are machine learning models developed for online content classification generalisable beyond the dataset they were trained and tested on? To address this question, we conduct a repeated test of the model on an independently collected dataset of the organisation's content, i.e., the titles and descriptions of the videos in the organisation's YouTube channel.

In addition to addressing a research gap within the automatic classification of online content, cross-channel applicability of tagging online content is highly important for organisations practically engaged in content marketing, as such organisations typically publish their content in multiple channels, including website and social media such as Facebook, Twitter, YouTube, and LinkedIn. Thus, when developing a classifier to tag the content published in different channels, the classifier needs to be able to perform well in a multichannel environment that the marketing mix of the modern content marketer consists of. With increasingly large, complex, and dynamic data becoming the basis of marketing decisions, it is ever more important to develop better methods of converting unstructured ‘big’ data into actionable information and insights (Syam & Sharma, 2018). Though the vast amount of available data is useful for training machine learning algorithms to make accurate predictions or classifications, developing the right approach can be challenging, not least because of the level of noise in the datasets and the diverse range of problems in relation to available technologies (Flake, Frasconi, Giles, & Maggini, 2004). On the whole, higher level description of online content is important for machine-readability, model development, and statistically correlating topics to various key performance metrics of content marketing such as visitor statistics, development of content coverage over time, or the range of topics covered by various websites. Our aim is to address the gap in the extant marketing literature for more advanced and innovative methods (Hofacker, 2012; Kumar, 2018) by comparing machine learning approaches to dealing with the multilabel classification problem when classifying news articles and examining a high-performing machine learning model's cross-channel applicability for a different type of content.

By using data from a worldwide news organisation, we show that our approach yields an overall F1 Score of 70%, even with a large set of topics. We further visualise the development of news articles over time; provided the taxonomy is updated with at least some examples, our classification is robust to topic changes and new topics emerging over time. In addition, we evaluate cross-platform applicability by classifying the same organisation's YouTube videos and then manually reviewing the results via three human coders.

The remainder of the paper is organised as follows. First, we present an overview of the literature on machine learning applications in marketing, followed by a summary of the proposed solution strategy. Next, we explain the data exploration and preparation procedure. We then evaluate three classifiers: Random Forests, K-Nearest Neighbors, and Neural Network (NN); followed by a more detailed application of NN whereby data collected from one year (2017) is used for training and data collected from another year (2018) is used for testing. Based on this, keywords are generated for unclassified news articles using the developed approach. Subsequently, we evaluate the cross-channel applicability by classifying YouTube videos of the news organisation. Finally, we discuss implications and avenues for further research.

Section snippets

Machine learning in marketing and content classification

Machine learning is an umbrella term used to describe a variety of computer-based techniques for data mining to uncover complex patterns, particularly in large and complex datasets (Pereira, Plastino, Zadrozny, & Merschmann, 2018), with a view to deriving insights for prediction, classification, and decision-making purposes (Cui, Wong, & Lui, 2006). Particularly, in the context of a multiplicity of social media and user-generated content (UGC) platforms, the diversity of data, in both type and

Algorithm selection and data cleaning

Many algorithms are not well-optimised for dealing with the problem at hand, since they do not possess the inbuilt capability of handling multilabel classifications. There are alternative methods to train multilabel classifiers, such as training one model for each label. However, since we are predicting news keywords, which are numerous and diverse, this approach is not technically feasible. As such, we have opted to evaluate three algorithms that have inbuilt multilabel classification

Data collection and exploration

Al Jazeera is a global news and media organisation, headquartered in Doha, Qatar. The main website (aljazeera.com) attracts traffic from nearly 200 countries and regions and has had on average over 15 million visits in 2018, of which roughly 42% comes from search and another 44% is direct (SimilarWeb, 2018). We collected the data by scraping the content of Al Jazeera's main website that distributes news stories. The resulting dataset contains information about the article's content, its title,

Classifier models and evaluation

As mentioned previously, the models we can use are limited to those that support multilabel classification efficiently; that is, to avoid using multiple One-vs-Rest classifiers to create the model. Using multiple One-vs-Rest classifiers is computationally inefficient, because this entails creating one model per keyword, then using all models during prediction time (Read, Pfahringer, Holmes, & Frank, 2011). This means training a large number of models, which will only increase in number when the

Predicting keywords for news articles

As the first step in the process of predicting keywords, a total of 8160 articles missing their keywords were identified and converted into a TF-IDF matrix. Next, we use our trained model to predict which keyword(s) belong to each article. Since an article may have more than one keyword, the Neural Network computes a probability for each label to be present in an article; for selecting a label for an article, its probability must be ≥0.48. A specimen article, following keyword prediction, is

Discussion and implications

There has been an increasing shift in the field of marketing from conventional forms of content analysis to more advanced computational forms corresponding to the vastly increasing availability, complexity, and importance of data (Balducci & Marinova, 2018; Kumar, 2018). Meanwhile, a parallel development in relation to research methodology in marketing has been called-for (Hofacker, 2012), so that innovative approaches may also contribute to greater advancements in marketing theory, especially

Limitations and suggestions for further research

One improvement to our study would be to obtain more data, more keywords, and more articles, to further expand and improve the capabilities of the model. Though a small number of articles remained unclassified (0.453% overall), to remedy this, we may either include more keywords during training, or decrease the probability threshold for accepting predicted keywords. However, both approaches have their disadvantages, including an increase in false positives due to lowering the threshold for

Concluding remarks

Leveraging the benefits of machine learning applications in marketing and addressing the important need for such application for marketing research methods, this paper contributes to the literature by comparing three state-of-the art algorithms for tagging online website content and establishing cross-platform applicability. We find that the Neural Network performs the best for multilabel classification, and the developed model was able to cope with changes in topics over time, which is salient

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Joni Salminen, PhD holds a PhD in Marketing from Turku School of Economics and is currently working as a postdoctoral researcher at Qatar Computing Research Institute. His expertise lies in the area of digital marketing and using (big) data for marketing applications, such as automatic profiling of user segments and gauging brand reputations using social media data.

References (61)

  • A.L. Montgomery et al.

    Prospects for personalization on the internet

    Journal of Interactive Marketing

    (2009)
  • A. Orriols-Puig et al.

    A soft-computing-based method for the automatic discovery of fuzzy rules in databases: Uses for academic research and management support in marketing

    Journal of Business Research

    (2013)
  • E. Papagiannopoulou et al.

    Local word vectors guiding keyphrase extraction

    Information Processing & Management

    (2018)
  • R.B. Pereira et al.

    Correlation analysis of performance measures for multi-label classification

    Information Processing & Management

    (2018)
  • G. Salton et al.

    Term-weighting approaches in automatic text retrieval

    Information Processing & Management

    (1988)
  • J.P. Singh et al.

    Predicting the “helpfulness� of online consumer reviews

    Journal of Business Research

    (2017)
  • N. Syam et al.

    Waiting for a sales renaissance in the fourth industrial revolution: Machine learning and artificial intelligence in sales research and practice

    Industrial Marketing Management

    (2018)
  • S. Uchinaka et al.

    Classifying residents' roles as online place-ambassadors

    Tourism Management

    (2019)
  • S. Vázquez et al.

    A classification of user-generated content into consumer decision journey stages

    Neural Networks

    (2014)
  • F. Zarrinkalam et al.

    Mining user interests over active topics on social networks

    Information Processing & Management

    (2018)
  • J. Zhang et al.

    The impact of metadata implementation on webpage visibility in search engine results (part II)

    Information Processing & Management

    (2005)
  • W. Zhao et al.

    Ranking themes on co-word networks: Exploring the relationships among different metrics

    Information Processing & Management

    (2018)
  • B. Abu-Salih et al.

    Twitter mining for ontology-based domain discovery incorporating machine learning

    Journal of Knowledge Management

    (2018)
  • D. Antons et al.

    Big data, big insights? Advancing service innovation and design with machine learning

    Journal of Service Research

    (2018)
  • B. Balducci et al.

    Unstructured data in marketing

    Journal of the Academy of Marketing Science

    (2018)
  • P. Bing et al.

    The dynamics of search engine marketing for tourist destinations

    Journal of Travel Research

    (2010)
  • D.M. Blei et al.

    Latent dirichlet allocation

    Journal of Machine Learning Research

    (2003)
  • S. Cates et al.

    A machine learning approach to research curation for investment process

    Journal of Investment Management

    (2017)
  • S.-H. Chun

    Machine learning techniques and statistical methods for business applications: Implications on big data gold rush

    Advanced Science Letters

    (2018)
  • T.B. Clarke et al.

    Conversion potential: A metric for evaluating search engine advertising performance

    Journal of Research in Interactive Marketing

    (2017)
  • Cited by (61)

    • Machine learning in marketing: A literature review, conceptual framework, and research agenda

      2022, Journal of Business Research
      Citation Excerpt :

      Jai et al. (2021) investigate the effect of different types of visual sensory information on brain activation preceding purchase decisions. Salminen et al. (2019) compare the ML approach to automatically tag and classify different types of online news articles for content marketing efficiency. ML, which applies different analysis tools and learning algorithms to generate predictions needed to make decisions (Agrawal et al., 2018) in the era of big data, has entered marketing research (Hagen et al., 2020).

    • VSTAR: Visual Semantic Thumbnails and tAgs Revitalization

      2022, Expert Systems with Applications
      Citation Excerpt :

      To this end, a video hosting website typically allows and suggests uploaders to attach metadata to the video. However, this task may be challenging for the users (Bajaj et al., 2016; Salminen et al., 2019), as providing relevant and complete metadata requires a significant human effort in terms of energy and time consumption. Furthermore, this task is typically not mandatory.

    • A dynamic ensemble selection method for bank telemarketing sales prediction

      2022, Journal of Business Research
      Citation Excerpt :

      Machine learning is a general term that covers various computer-based data mining techniques to discover complex patterns in data, especially big data (Pereira, Plastino, Zadrozny, & Merschmann, 2018). The literature on marketing topics suggests that machine learning methods can provide effective decision support for both direct marketing (Adyyński, Bikowski, & Gawrysiak, 2019; Cui & Man, 2004) and strategic marketing (Orriols-Puig, Martínez-López, Casillas, & Lee, 2013; Salminen et al., 2019). The prediction of customers’ purchase intentions has always been an interesting research issue in marketing.

    View all citing articles on Scopus

    Joni Salminen, PhD holds a PhD in Marketing from Turku School of Economics and is currently working as a postdoctoral researcher at Qatar Computing Research Institute. His expertise lies in the area of digital marketing and using (big) data for marketing applications, such as automatic profiling of user segments and gauging brand reputations using social media data.

    Vignesh Yoganathan, PhD is a Senior Lecturer (Associate Professor) in Marketing at University of Bradford, whose research focuses on digital and responsible marketing/branding, particularly using experiments and multivariate statistics or modelling. He has worked with several commercial and non-profit organisations to improve customer insights and market strategies in the technological context.

    Juan Corporan, BSc is the Lead Data Scientist at Banco Santa Cruz in the Dominican Republic and specialises in building predictive models for decision-making and developing data quality for data-driven business decisions. He regularly contributes to various expert forums, addressing questions of cutting-edge data science developments.

    Bernard J. Jansen, PhD is a Principal Scientist in the social computing group of the Qatar Computing Research Institute, and Professor at the College of Science and Engineering, Hamad bin Khalifa University. He is the Editor-in-Chief of the Information Processing & Management (Elsevier), and the former Editor-in-Chief of Internet Research (Emerald). He is also an adjunct professor with the College of Information Sciences and Technology at The Pennsylvania State University.

    Soon-Gyo Jung, MSc is a Research Associate at the Qatar Computing Research Institute working in the area of computational social science. He has a background in web applications and software development and holds a master's degree in Electrical and Computer Engineering from Sungkyunkwan University in South Korea. He has published several articles in areas including information dissemination and audience segmentation.

    View full text