Jonas Hyllseth RyenPredicting Customer Churn with Data Analytics NTNU Norwegian University of Science and Technology Faculty of Engineering Department of Mechanical and Industrial Engineering
Master ’s thesis
Jonas Hyllseth Ryen
Predicting Customer Churn with Data Analytics
How to use data analytics to indicate customer churn for a small online-based company.
Master’s thesis in Engineering and ICT Supervisor: Bjørn Haugen
June 2020
Jonas Hyllseth Ryen
Predicting Customer Churn with Data Analytics
How to use data analytics to indicate customer churn for a small online-based company.
Master’s thesis in Engineering and ICT Supervisor: Bjørn Haugen
June 2020
Norwegian University of Science and Technology Faculty of Engineering
Department of Mechanical and Industrial Engineering
Summary
Effective use of data is increasingly becoming a competitive advantage for businesses.
Although a wide range of studies has been conducted to understand the benefits of ap- plying data-driven knowledge, few of them provides practical approaches. While large enterprises have the resources to develop a practical approach themselves, many small businesses struggle to find the time and resources to solve problems from data analytics.
This thesis presents a concrete approach to solve problems through data analytics for small online-based companies. The online tutoring agency Learnlink was chosen as the subject.
Customer, student and tutor data was analyzed in order to solve one of Learnlink’s hardest problems: how to indicate premature cancellations for individual customers, also referred to as customer churn.
The methods that were applied to achieve the main objective can be summarized as collect, analyze and evaluate. Relevant data wascollectedfrom the company’s database, payment solution and customer relationship system, assembled and cleaned, and made available for analysis in Google Data Studio visualizations. With a near real-time analyt- ics tool at hand, anexploratory analysiswas conducted in order to investigate patterns and correlations in the data. Seeming relations were tested for significance before integrated into four different prediction models. By classifying customers based on indications for premature cancellations and comparing the predicted classes with actual churn, the results from the analysis could beevaluated.
Throughout the analysis, it became clear that both demographic, behavioural and tutor data could provide indications of whether a customer was probable to prematurely cancel their subscriptions. The analytics tool that was developed throughout this thesis should be sufficient to work as a foundation for solving further business problems with data as well as help mitigating customer churn for the subject company. Furthermore, other small companies should be able to implement the same system in order to solve problems with data. For future work, the results can be improved by introducing more data sources and to experiment with a wider variety of prediction models.
Sammendrag
Flere og flere bedrifter oppn˚ar konkurransefortrinn ved ˚a ta i bruk data p˚a en effektiv m˚ate. Til tross for at det er gjennomført mange studier for ˚a forst˚a fordelene ved ˚a ta i bruk datadrevet kunnskap, er det f˚a av disse som beskriver praktiske fremgangsm˚ater for ˚a gjøre dette. Store selskaper har ofte nok ressurser til ˚a utvikle slike praktiske fremgangsm˚ater p˚a egenh˚and, men mange sm˚a bedrifter sliter med ˚a sette av nok tid og ressurser til ˚a legge til rette for problemløsing ved hjelp av dataanalyse. Denne oppgaven presenter en konkret fremgangsm˚ate som kan benyttes av sm˚a bedrifter som leverer nettbaserte tjenester til ˚a løse problemer ved hjelp av dataanalyse. Det nettbaserte leksehjelpsselskapet Learnlink ble valgt som utgangspunkt for oppgaven. Data fra kundeforhold, elever og lærere ble analysert for ˚a løse en av de aller vanskeligste utfordringene Learnlink st˚ar ovenfor: hvor- dan de kan f˚a indikasjoner p˚a hvilke kunder som kommer til ˚a avslutte kundeforholdet for tidlig, vanligvis kaltchurn.
Metodene som ble brukt i denne oppgaven, kan oppsummeres somsamle,analysere ogevaluere. Relevant data blesamletfra selskapets database, betalingsløsning og kun- deoppfølgingssystem, sl˚att sammen og renset, og gjort tilgjengelig for analyse gjennom visualiseringer i Google Data Studio. Med et analyseverktøy som viste data nærmest i sanntid, kunne enutforskende analysegjennomføres for ˚a finne mønstre og korrelasjoner i dataen. Tilsynelatende relasjoner ble testet for signifikans, før de ble integrert i fire ulike prediksjonsmodeller. Ved ˚a klassifisere kundene basert p˚a indikasjonene for tidlig avslutning og sammenligne disse klassene med faktisk churn, kunne resultatene fra den utforskende analysenevalueres.
Analysen viste at b˚ade demografisk data, oppførsel og data knyttet til læreren kunne indikere om en kunde hadde høy sannsynlighet for ˚a avslutte kundeforholdet for tidlig.
Analyseverktøyet som ble utviklet gjennom arbeidet med oppgaven bør være godt nok til ˚a fungere som et grunnlag for datadrevet problemløsing for Learnlink, samt bidra til arbeidet med ˚a reduserechurn. Andre sm˚a selskaper vil ogs˚a kunne bygge det samme systemet for
˚a løse problemer ved hjelp av data. For videre arbeid kan resultatene forbedres ved ˚a legge til flere datakilder og eksperimentere med et bredere spekter av prediksjonsmodeller.
Preface
This thesis was developed at the Department of Mechanical and Industrial Engineer- ing at the Norwegian University of Science and Technology (NTNU) during the spring of 2020. Some preliminary research was completed during the fall of 2019.
I thank my supervisor, Bjørn Haugen, for providing helpful advice and guidance through- out the writing process, and for his positive inclination to engage in a subject that is unusual for a master thesis in this department. Also, I am grateful to NTNU for providing the re- sources to complete the thesis. I would also like to thank the employees of Learnlink AS for contributing with background knowledge, help and feedback throughout the develop- ment process, and especially Chief Technology Officer Johannes Berggren for facilitating the technical assistance that was necessary to complete the thesis.
Jonas Hyllseth Ryen, author
Table of Contents
Summary i
Sammendrag iii
Preface v
Table of Contents ix
List of Figures xii
Nomenclature xiii
1 Introduction 1
1.1 Problem description . . . 1
1.2 Objectives . . . 3
1.2.1 Main objective . . . 3
1.2.2 Secondary objectives . . . 3
1.3 Limitations . . . 4
1.4 Privacy . . . 5
2 Background 7 2.1 Empirical focus . . . 7
2.1.1 Gap in literature . . . 7
2.1.2 Subject company and problem . . . 7
2.1.3 Learnlink AS . . . 10
2.1.4 Interesting attributes regarding a tutoring company . . . 11
2.2 Methods . . . 12
2.2.1 Interviews . . . 12
2.2.2 Practical implementation of data analytics dashboards . . . 12
2.2.3 Exploratory analysis . . . 12
2.2.4 Evaluating the analysis through prediction . . . 13
2.3.2 Firestore SQL export . . . 16
2.3.3 Learnlink platform architecture . . . 16
2.3.4 Data flow . . . 18
2.4 Data . . . 19
2.4.1 Data collection . . . 19
2.4.2 Tables . . . 19
2.4.3 Excluded data . . . 21
2.5 Defining churn . . . 22
2.5.1 Adjusting our definition to obtain prediction results . . . 25
2.6 Prediction model . . . 26
2.6.1 Prediction model criteria . . . 26
2.6.2 Classification . . . 28
2.6.3 Regression models . . . 29
3 Results 31 3.1 Data analytics dashboards for customer churn and lifetime . . . 31
3.1.1 Phase 1 - Research and planning . . . 31
3.1.2 Phase 2 - Implementation . . . 31
3.2 Exploratory analysis . . . 35
3.2.1 Natural churn . . . 35
3.2.2 Trends by season . . . 36
3.2.3 Trends by demographics and subjects . . . 38
3.2.4 Trends throughout the customer lifespan . . . 42
3.2.5 Tutor attributes effect on lifetime . . . 48
3.3 Prediction results . . . 49
3.3.1 Decision tree classification . . . 49
3.3.2 Decision tree algorithm performance . . . 52
3.3.3 Classification model with logistic regression in BigQuery . . . . 55
4 Discussion 57 4.1 Building data analytics dashboards for an early-stage company . . . 57
4.1.1 Prerequisites for building data infrastructure . . . 57
4.1.2 Choosing tools . . . 58
4.1.3 Structuring data in BigQuery . . . 59
4.1.4 Errors and changes . . . 59
4.1.5 Generalization of data dashboard development . . . 60
4.2 Exploratory analysis . . . 61
4.2.1 Natural churn . . . 61
4.2.2 Trends by season . . . 61
4.2.3 Trends by demographics . . . 62
4.2.4 Trends relating to events during the customer lifespan . . . 63
4.2.5 Tutor attributes and behaviour as an indication of lifetime . . . . 65
4.2.6 Overall observations from the exploratory analysis . . . 65
4.3 Prediction . . . 67
4.3.1 Natural churn decision tree . . . 67
4.3.2 Churn decision tree . . . 67
4.3.3 Logistic regression . . . 69
4.3.4 Overall prediction results . . . 70
5 Conclusion 73 5.0.1 Errors, biases and improvements . . . 74
5.0.2 Applications . . . 74
5.1 Further work . . . 77
5.1.1 Further develop the prediction models . . . 77
5.1.2 Analysis of virtual classroom data . . . 77
Bibliography 79 Appendix 81 A BigQuery table queries . . . 81
A.1 Churn . . . 81
A.2 Lessons and reports . . . 84
A.3 Tutors . . . 87
B Prediction . . . 89
B.1 Decision trees . . . 89
B.2 Logistic regression in BigQuery . . . 90
List of Figures
2.1 An overview of the methods used for writing this thesis. . . 14
2.2 Overview showing how parts of the Learnlink software platform are send- ing data to the other parts. . . 17
2.3 The data flow from Learnlink’s data sources to data dashboards. . . 18
2.4 A standard decision tree. . . 29
2.5 The characteristic S-curve of the sigmoid function, used in logistic regression.[1] . . . 30
3.1 A sample page from each of the three dashboards used in the exploratory analysis. . . 34
3.2 Students grouped by problem. In this analysis, onlyretakingExamis rel- evant, showing the ratio of students who have already finished school and is retaking exams to improve grades. . . 35
3.3 Median lifetime, in days, shown by which month the last lesson occurred. 36 3.4 Median lifetime, in lessons, shown by which month the first lesson occurred. 37 3.5 Acquisition channel as indicator of lifetime. . . 38
3.6 Decision maker as indicator of lifetime. . . 38
3.7 Lifetime shown by level. . . 39
3.8 Lifetime by subject, for elementary school students. . . 40
3.9 Lifetime by subject, for middle school students. Subjects with little rele- vance are not labeled. . . 40
3.10 Lifetime by subject, for high school students. Label and value is only displayed for subjects with significant results. . . 41
3.11 Motivation shown by number of months after first lesson. . . 42
3.12 Motivation shown by number of months before last lesson (churn). . . 42
3.13 Lifetime by motivation, grouped. . . 43
3.14 Average number of sessions shown by number of months after first lesson. 44 3.15 Average number of sessions shown by number of months before last lesson (churn). . . 44
dicates that the homework was unfinished. . . 45
3.17 Average homework completion shown by number of months before last lesson (churn), where 1 indicates that all homework has been completed while 0 indicates that the homework was unfinished. . . 45
3.18 Lifetime in relation to homework assignment and completion. . . 45
3.19 Lifetime based on median difficulty level in lesson. . . 47
3.20 Average difficulty level shown by number of months after first lesson. . . 47
3.21 Tutor experience (in number of students) and indication on lifetime. . . . 48
3.22 Tutor experience (in number of students) and indication on lifetime. . . . 48
3.23 Natural churn decision tree. . . 49
3.24 High risk decision tree. . . 50
3.25 Low risk decision tree. . . 51
3.26 Confusion matrix from the logistic regression evaluation set. . . 55
3.27 Performance metrics from the logistic regression evaluation set. . . 55
Nomenclature
Churn = When a customer cancels the customer relationship.
Churn rate = Percentage of total customer base lost during a certain time period.
Fast churn = A premature cancellation of the customer relationship.
Data analytics = Visualisation and interpretation of data.
Prediction = Forecasting values based on previous events.
Classification = Labeling items based on predicted values.
Chapter 1
Introduction
1.1 Problem description
Effective use of data is increasingly becoming a competitive advantage for businesses [2]
[3]. Data provides knowledge and let companies make quick decisions based on facts.
Access to frequently updated and detailed data enables companies to make more informed decisions faster than their competitors. Many companies utilize this to build a company- wide culture that is data-driven, where the goal is that all employees are empowered by data in their day-to-day work[4].
Companies providing services through online channels have the opportunity to gather vast amounts of data from their users. This data range from visitor data (non-registered users), data delivered by third-parties like Google or Facebook, usage data from registered users and data gathered through offline communication like phone calls. In the age of big data[5], even small companies generate data that far exceeds the limit that their computing power, available storage space and human resources are ready to handle. In order to make this data useful, companies need to establish tactics for retrieving, collecting, cleaning, analyzing and storing data.
Data is, however, most useful when being applied to solve problems or to cover knowl- edge gaps. Recent developments have made the power of data analytics available to not only large enterprises with departments dedicated to data processing, but also to small companies with tight budgets. [6] Cloud-based databases updated in real-time, advanced plug-and-play analytics software offered online and pre-trained machine learning algo- rithms are all available today, starting at a few dollars per month. Today’s challenge for a small company is rather limitations of data, shortage of time and integration with business strategy. [7] In order to justify spending time on data analytics, one has to decide: what problems are most important to solve, and how do we proceed to solve them?
One of the most common, yet hard-to-solve problems of consumer-product businesses
is customer churn, or the premature cancellation of a customer relationship. Churn is when a customer stops buying a product from the company, like when a Netflix subscrip- tion is cancelled or you make a switch from your local grocery store in favour of the new supermarket. The word retentionis often also used when addressing this problem, and refers to retaining customers - in other words, retention is the opposite of churn. Busi- nesses often rely on recurring payments from the same customers, and keeping customers from leaving or switching to a competitor can be the key to a more sustainable business model. Customer churn has a direct impact on profits, is an indicator of customer satis- faction, and it is often more lucrative to retain customers than acquiring new ones. Con- sequently, finding ways to understand, predict and mitigate customer churn are important activities for most companies.
During the phase of early growth, companies can sustain high churn rates and still grow fast. However, as the number of customers becomes higher, the newly acquired customers become a decreasingly smaller part of the total. Sales and marketing activities become less important for sustaining the overall revenue than stopping existing customers from leaving. Hence, retaining customers becomes a priority for mature companies; it is profitable to keep existing customers than to lose them and reach new ones through marketing. As lowering churn rates become more important as the size of the company customer base grows, large gains can be achieved by lowering the churn rate when the company is still small and not when the problem escalates.
Churned customers share one common trait across any business; they have all been customers at some point. Which means, they have been interacting as customers and left information about their behaviour, their purchases, their interactions with customer sup- port and often personal information. As customers who churn by definition are the ones a company has collected the most data about, finding out why customers churn is a sensible problem to solve with data analytics.
The purpose of this thesis is to contribute to solving one of the most difficult prob- lems for a small online-based company through data analytics. Through the writing of this thesis, the goal was to not only help with the churn problem, but to also establish a foundation for data-driven problem solving for the subject company. Learnlink, an online tutoring company founded in 2016, was chosen as the subject. We investigated customer and tutor data, and used this to establish a framework for classifying the risk for individual customers to churn. The goal was that by the completion of the thesis, the Learnlink team could use this knowledge to impose measures for high-risk customers before churn, and thus improve customer loyalty, reduce churn rates and build a more sustainable business.
1.2 Objectives
1.2 Objectives
1.2.1 Main objective
Build a data analytics tool for a small company to detect whether a customer relationship is on track or is probable to be prematurely cancelled by the customer.
1.2.2 Secondary objectives
In order to accomplish our main objective, the technical infrastructure must be in place.
We will need to be able to navigate data from the database and other data sources easily and to make visualisations to observe patterns and trends. A prerequisite for drawing the right conclusions is that irregularities and errors are removed - data sets need to beclean. After cleaning and structuring data from all data sources, we need to determine exactly what to extract from our data: A clear, unambiguous definition for our key termchurnmust be established. After completing this preliminary work, we will be ready to embark on the actual analysis; we will investigate the data through visualizations, looking for patterns and correlations. Relations will be checked for significance. In order to assess the practical value of the knowledge we derive from the analysis, we will build a prediction model based on these assumptions and test the model against actual events. After completing this step, we will know whether our main objective is achieved.
1. Choose data sources and establish connections between the necessary software for collecting and cleaning the relevant data.
2. Establish a robust definition ofchurnand a measure that can indicate whether the churn is premature.
3. Perform an exploratory analysis of the data set with focus on differences between fast-churn and slow-churn customers.
4. Choose variables that best describe the indication for a premature cancellation.
5. Set up prediction models based on the findings in the exploratory analysis.
6. Evaluate predictions by comparing to actual events..
1.3 Limitations
The thesis should be regarded as preliminary work that establishes a method, routines and a framework for gathering data and analyzing customer churn. Results will be limited by the amount of data that is available. The end result will be a set of indicators, not predictors, and the system that will be built will require maintenance and continuous improvement in order to reach the full potential.
This thesis is reliant on access to customer data. The actual gathering of data is per- formed by the technical team in the company and not a subject of this thesis. Thus, the thesis is constrained by the tools and architectural choices made by the Learnlink team.
The most significant constraint is assumed to be the choice of database. The author of this thesis is not entitled to request significant changes to the technical stack for the Learnlink platform. Similarly, the technical team in Learnlink has made some choices regarding the structure of the data flow. Segment.io has been chosen as the platform for routing data to the desired destination, as it is cost-efficient and easy-to-use. Apart from this, most services regarding data are chosen from Google’s BI platform in order to make integration smoother and costs low.
There are many constraints on the data that is collected for this thesis. As described in more detail in the Background section, the nature of the business has changed dramatically over the past years. The company has completely changed the business model two times during the analysis period, and has experienced a rapid growth that results in a customer base that has an unusual high ratio of recently acquired customers to the total customer base. The GDPR privacy regulations also impose limitations to our analysis, as some data has to be excluded from the set that might have been interesting, including geographical data. Performing this analysis is both challenging and very interesting due to these con- straints, but there is little doubt that modifications to the resulting prediction models will be necessary after the thesis is completed. This thesis and the corresponding data models should be seen as foundational work that can continued by the subject company after com- pletion. Our thesis is also limited by the observation period when it comes to comparing actual events with predictions. The observation period is set to approximately 1,5 months, from mid-April to the end of May 2020.
Given the wide variety of prediction models and algorithms, applying them to predict- ing churn is worthy of a thesis in itself; prediction modelling from data is a broad and developed field.. Even though the last two objectives of this thesis are concerned with es- tablishing predictions, they should be viewed merely as a form of validation of the findings from the analysis. Data analysis is the main focus in this thesis. The priority is not to build advanced prediction models, neither to evaluate, compare or provide an overview of the prediction models available for solving similar problems. The purpose of prediction in this thesis is not to make advancement in the field but to utilize it for validation. Consequently, our investigation of prediction modelling will be limited.
1.4 Privacy
1.4 Privacy
All data used in this project was anonymised before the analysis. Learnlink AS is General Data Protection Regulation (GDPR ) compliant and is anonymising data on requests from customers and after inactivity. Historical data used in this thesis has been stripped of all data points that could be used to directly or indirectly identify individuals. No personal information for individual customers is included in this thesis, and it will not be possible to track any information back to individual students or tutors based on the information collected here.
Chapter 2
Background
2.1 Empirical focus
2.1.1 Gap in literature
A wide range of literature regarding data analytics, and especially big data, has become available over the past years, but there is little research describing how to implement the available solutions in businesses [8] Most research and literature is focused on the under- lying concepts and not the implementation and the challenges with using data analytics in practice. Large enterprises struggle to integrate data analytics into the overall strategy [9]. Startups struggle to integrate data analytics into their development, even though they are aware of the opportunities and regard them as valuable. It seems that many companies regard analytics as something that should be postponed and dealt with in the future [7]. It is difficult to find literature that suggests that data is not useful and that it should not be important for small companies. The gap in recommendations and step-by-step methods to use data analytics in practice is worthy of attention.
2.1.2 Subject company and problem
We will now elaborate on the choice of Learnlink AS as the subject. Young companies face a significant challenge when aiming to solve problems through data analytics. They are often in a constant process of change as the company structure is developing, and the management tools have to be developed accordingly. They have tight budgets and limited time. A large and established company is able to use external consultants to set up data infrastructure, while early-stage companies have neither the financial resources nor human resources to make large investments in data infrastructure. They are in need of finding a cost-efficient, flexible and easy way to visualize their data.
As a company that delivers service through online channels, Learnlink is a good fit for a data analytics project. The company has already been gathering and storing data for
years, has basic data infrastructure in place and has come a long way in creating a data- driven culture among employees. The technical team knows the challenges at hand, but like many other early-stage companies, they have strict development deadlines and lim- ited resources. Thus they can assist in the project with experience, advice and knowledge, but are prevented from carrying out the project themselves. Nonetheless, the importance of forecasting their income stream and getting the problem under control ensures that the project will receive the necessary attention and assistance.
The topic of churn rates was partly chosen based on inquires from Learnlink manage- ment and partly due to the application a solution might have for other similar companies.
As mentioned in the problem statement, the churn rate is a core part of the business model.
Other problems that were considered was demand prediction, evaluating tutoring lesson quality and analyzing variations in the effectiveness of sales calls. Below follows com- ments from Learnlink management on the problem chosen for this thesis.
Understanding and predicting churn rates is crucial to our financial plan- ning. A few percentage points increase in the churn rate can be the difference between profitability and bankruptcy in the long run. It determines how much we are able to spend for marketing and the number of hires we can make.
Nonetheless, churn rates have shown to be hard to predict, especially with the rapid changes we are making to our service and to the business model.
Industry standards do not suffice as we have an offering that is significantly different from our competitors.
Decreasing the churn rate is one of the main challenges for the Learnlink team at the moment. Most of the measures we have taken in the past have not had any effect. Decisions on how to mitigate churn are too often based on gut feelings rather than hard facts, and we struggle to separate the effective measures from the insignificant ones.
We hope that this thesis will uncover patterns that can be used to detect whether a customer has a high risk for churning and obtain better prerdictabil- ity for our company.
- Product manager, Learnlink AS[10]
Utilizing the power of data analytics has always been an ambition, but never the highest priority for our team. The development team is constantly entan- gled in fixing issues and improving business-critical features, and with only two full-time developers we are not able to allocate the time that is required to dig as deep as we want into building data infrastructure. Nevertheless, we have acknowledged that leveraging data will be crucial for us in the fu- ture and are currently collecting all the data we can. A few projects in the past have been successful with preliminary work to establish data flows and
2.1 Empirical focus automatically updated dashboards. Our most common key performance indi- cators are updated continuously in a Google Data Studio dashboard, and we use these for guiding our priorities and making decisions. Another obstacle has been using statistics to separate irrelevant data from real patterns and to single out the most important questions to query. Data science is not really an engineering problem but a business problem, although the technical team has to be involved. I hope that this project will establish a foundation for har- vesting knowledge from our data in the future and that we can continue the work on both the churn analysis and other components after completion.
- Chief technology officer, Learnlink AS [10]
2.1.3 Learnlink AS
Learnlink was founded in 2016 by three university students who experienced the tutoring market in Norway as inefficient and underdeveloped. Tutoring was only available close to universities, and companies operating in the market had not utilized the developments in online education tools. By 2019, Learnlink had grown to be the fourth largest tutor- ing provider in the country and the top-grossing provider of online tutoring services for students in elementary through high school. According to the company’s management, Learnlink had 5 MNOK revenue from 11 000 tutoring lessons in 2019, employing 150 university students as tutors.
As of May 2020, Learnlink had six full-time and two part-time employees. Their business is organized through Learnlink.no, a two-sided web platform connecting students with qualified tutors. Students are often represented by parents, who are paying the bills and often administering how often lessons take place. This way, tutoring through Learn- link involves four different stakeholders: the students, who aspire to advance in school, parents, who want their children to advance, tutors who help students and get paid, and the Learnlink team which works as a quality-assuring intermediary.
Tutors
Learnlink tutors are university students with strong academic results and a desire to main- tain a steady income from a flexible and relevant part-time job. As the tutoring is done online, there are no geographical limitations, and even though most tutors live in Oslo or Trondheim, there are Spanish tutors living in Barcelona and French tutors in Paris. As- piring tutors register at Learnlink’s website and upload diploma from high school, police records and other relevant documentation before their applications are reviewed. Strong applicants are called into online interviews and ultimately qualified as tutors. All tutors comply with confidential agreements about their student’s activities and personal informa- tion. Most tutors teach 3-4 students, although all lessons are one-on-one and kept separate.
Tutors are matched with students based on their academic and personal profiles, as well as preferences from parents or students. Tutors are free to structure the tutoring as they see fit, but have access to a wide range of resources through the Learnlink web platform.
Students and parents
The students range from children in the lowest levels of elementary school to middle school and up to high school. Some students have previously graduated or failed to gradu- ate from high school and are retaking exams, but most are still in school. The vast majority of students are lagging behind the rest of the class and are taking tutoring lessons to catch up. The goal of tutoring lessons is often to increase motivation and feeling of accomplish- ment through more personalized learning than the student receives in a classroom with thirty other students. As the tutoring lessons are online, students from all over Norway use Learnlink. In most cases, parents are the ones to initiate tutoring, to manage lesson sched- ules and communicate with tutors as well as paying the bills. Presumably, they would also be active in the decision about ending the tutoring and churning. Parents receive reports
2.1 Empirical focus
from the tutor after every lesson and are this way able to follow the progress over time.
Tutoring
Most students have 2-4 lessons per week, divided into 1-2 sessions. The student and the tutor meet at the agreed time in a virtual classroom, where they can write, share screen and see and hear each other. Contents of the sessions vary, often split between going through assignments the students have completed since last time and learning new subjects. After the session, the tutor sends the tutoring report to the student’s parents.
2.1.4 Interesting attributes regarding a tutoring company
Uncovering patterns for customers buying educational services can be interesting for pub- lic educational institutions as well. Even though the mechanisms for ”losing” students are different in the public sector, factors like student motivation, their willingness to engage in learning activities and progress will be interesting to investigate.
Tutoring companies are subject to unusually powerful seasonal variations. Two months of summer holidays with low activity periods in the beginning and the end completely halt the market, while exams creates short periods of very high demand. These effects are likely to emerge in the churn data. Moreover, as Learnlink only offers tutoring for students in elementary through high school, all customers naturally grow out of the service. While other industries can retain customers for decades, all tutoring customers will indisputably stop using the service when they finish school.
2.2 Methods
Methods used for writing this thesis are closely connected, and the last steps rely on results from the previous one. The activities were carried out in the order they are represented here.
2.2.1 Interviews
In order to pick the right problem to solve, attain a better understanding of the chosen problem as well as gaining insight into the underlying structure of the platform, interviews with employees were carried out. The interview with Learnlink’s Chief Technology Of- ficer contributed to the description of the technological stack used by the company and the architecture of the Learnlink database, as well as the choices for software to the data pipeline. Other employees participated by describing their challenges with the current ac- cess to data and information about attempts to understand and solve problems regarding churn. All interviews were done one-on-one and supplemented by a continuous dialogue during the implementation period.
2.2.2 Practical implementation of data analytics dashboards
The most important method used in this thesis and a prerequisite for carrying out the analysis is the implementation of data analytics dashboards. The implementation involves connecting data sources to online software in order to access and visualize data near real- time. The approach is practical because it is carried out similarly to how the company and other companies would have done on their own. An alternative to this approach would be to export historical data to perform analysis locally. This would be sufficient for covering the analysis and prediction need for this thesis, but would not be useful for Learnlink in the future. Yet another approach would have been to evaluate ways for connecting data sources and perform analysis without actually doing this in practice. Even though this would have provided a more thorough evaluation of different analytics systems, it might fail to uncover some problems that could arise during the actual installation.
2.2.3 Exploratory analysis
The exploratory analysis involved visualizing different parts in order to investigate pat- terns and trends. After the implementation of the data dashboards, data from the different sources like the database, the payment system and the customer relationship management system could be collected and displayed together. The basis for the exploratory analysis was a list of roughly 40 parameters derived from these data sources. Every parameter was explored in Google Data Studio through pie charts, column charts and bubble charts for interesting patterns or correlations. When patterns were found in the visualizations, they were tested for statistical significance, and significant correlations were included in the prediction models.
2.2 Methods Significance
Significance was calculated using a student t-test in the following manner.
Let Z be the part of the dataset that our hypothesis concerns. The average or median for the whole dataset - the assumed actual value over time - is denoted byµ. The number of observations in Z, usually individual customers, is given by n, and x¯ is the average or median for the subset that is tested for the significant change. Standard deviation is denoted byσ. The probability that the measured difference in Z is due to change is then given by
t= x¯−µ
√σ n
(2.1)
The significance level is then given by taking the TDIST function of1−t. The TDIST function is the Excel version of the student t-distribution function. For all calculations in this thesis, we have used the t-distribution with n - 1 degrees of freedom and one tail.
Hypothesises with significance levels above 95% will be regarded as significant enough to be used in the prediction model as a standalone parameter. Results with significance levels above 90% will be used in combination with other parameters. When displaying the significance level in the Results section, it will be displayed as an error, which means that a value closer to zero means higher significance.
Causation and correlation
We can not necessarily assume causation when we find significant correlations in the data.
When the results from the exploratory analysis are ready, the logic of the hypothesises will be discussed and evaluated. There will always be patterns in a data set, and checking for logical sense will help distinguish mere random patterns from actual relations.
2.2.4 Evaluating the analysis through prediction
Testing churn predictions that are derived from the exploratory analysis against actual events that are yet to happen will show how useful the analysis can be in practice. Even though we find patterns in the data, the patterns can be arbitrary, or for some reasons only relevant to the historical data and not applicable to future events. The usefulness of the data analysis will be drastically lower if it cannot be used to provide indications for churn in the future. In order to be able to test predictions, we have to develop a simple prediction model. Criteria for the model were that it should be based on the patterns discovered through the exploratory analysis and that it should be easy to both understand why some customers are chosen as high churn probability while others don’t. This will make it easier to improve the model in the future.
Figure 2.1:An overview of the methods used for writing this thesis.
2.3 The Learnlink platform architecture
2.3 The Learnlink platform architecture
This section will provide the necessary background information about Learnlink’s tech- nology stack and architecture.
2.3.1 Firestore and document-oriented databases
Learnlink is using Firestore as the database for their web app, and this is where most of the data will be retrieved from.
About document-oriented databases
Firestore is a NoSQL (“Not only SQL”) database that is a part of the Firebase development platform, first developed by Firebase Inc. and later acquired by Google. The database is real-time and offered as back-end-as-a-service, as it is entirely hosted in the cloud.
Firestore is included in a subcategory of NoSQL databases known as document-oriented databases [11]. Contrary to traditional SQL databases, document-oriented databases like Firestore store data as key-value-pairs rather than strictly defined tables. In traditional relational databases, objects can be divided across different tables. In document-oriented databases, all information about an object is stored at the same place, which removes the need for object-relational mapping when loading the database[12].
The CTO of Learnlink mentions that the fact that the key-value-pairs in Firestore are so similar to Javascript objects and thus can be directly exported was one of the main reasons for choosing this database. As the Learnlink front-end is written in Javascript, this allowed the team to develop the entire platform, including front-end and back-end, in one programming language. It is common to use languages like PHP combined with Javascript for frontend and SQL in the database, which would require the developers to master several programming languages. This also allows for closer integration between the front-end web apps and the database. Firestore is convenient in that it updates the web apps continuously and real-time without the need for refreshing the web page. In summary, the Firestore database gives the Learnlink team access to a series of advanced features without the need for building it themselves[13].
Document-oriented databases and data analytics SQL
Relational databases are considered to support a wider variety of queries than document- oriented databases. The most common language for data analytics is SQL, which is not supported in a document-oriented database. Consequently, most plug-and-play data ana- lytics or business intelligence software is made for relational databases and do not provide the same support for NoSQL databases[14]. However, NoSQL databases can be prefer- able in big data projects because they do not have the same strict schema as relational databases and do not have to satisfy the ACID-properties (Atomicity, Consistency, Isola- tion and Durability)[15] and can be used to store unstructured data [15]. The unstructured data can be preprocessed and structured in the application that will use the data rather than
in the database itself.
Storing in multiple locations
Document-oriented databases support storing the same data in different places. In rela- tional databases, this was considered bad practice, but when it comes to storing and ac- cessing big data efficiently, many paths to the same data source can reduce query time [14].
Flexibility
Relational databases have strict rules for changing the structure of the data in the same ta- ble (schema). This can give rise to trouble when working with large amounts of unexplored data, as different rows in a data set may have different structure and attributes. Document- oriented databases have more flexible schemas and are thus handier when working with large, unstructured data sets [14].
Speed
Document-oriented databases provide faster indexing and thus faster query response. In- stances have shown more than 100x the query speed compared to relational databases [14].
Scalability
Sharding means partitioning large data sets into smaller and more manageable data parts, referred to as shards. A database that supports sharding can be scaled almost without lim- its, which is useful when working with large data sets or for small businesses that plan and build for fast growth. Document-oriented databases support sharding, while relational databases do not [14].
2.3.2 Firestore SQL export
At the time of initiation of this thesis, there was no pre-made integration that easily could be set up to export data from Firestore to BigQuery, the tool for structuring the data before queries. To solve this problem, Learnlink’s CTO Johannes had made an export function to complete the export, made publicly available on GitHub. As Google did not have a permanent solution that could automatically refresh the data with frequent intervals, the function has been used by other companies around the world, which probably face the same problem as Learnlink.[11].
2.3.3 Learnlink platform architecture
The Learnlink platform architecture is shown in figure 2.2. There are several front-end ap- plications for different uses. App.learnlink.no is used by customers and tutors for admin- istration of their customer relationship; access to payment information, lesson schedule, learning resources and reports from completed tutoring lessons. The admin-platform is used by the Learnlink team when following up customers and tutors. Online.learnlink.no is a virtual classroom where the actual tutoring happens. The virtual classroom is a video chat software with support for drawing and writing as well as recording footage from the tutoring lessons to be viewed later.
2.3 The Learnlink platform architecture
Figure 2.2:Overview showing how parts of the Learnlink software platform are sending data to the other parts.
All of these front-end applications are connected to the database, in addition to some third party applications as shown [13].
Most out-of-the-box analytics tools are event-based: instead of reading data from the database, they receive information about user events. These events are activated in the front-end applications when users push buttons, send messages and so forth. A widely used event-based analytics service is Google Analytics. The challenge with event-based analytics arises when you are dependent on tracking changes that arise from something else than user activity in the applications. As Learnlink is a two-sided platform, every user’s data is connected to and updated because ofother users’actions. So when a tutor registers a lesson for one of their students, the student’s profile is updated accordingly, and an email with a summary from the lessons should is sent out. Another challenge surfaces when tutoring is completed on another video conversation tool than Learnlink’s own; there is no “evidence” on the platform that the actual tutoring found place, however the comple- tion of the lessons should be reflected on both the tutor and the student’s accounts. Our final example is payment, that is handled through third-party payment provider Stripe and is automatically collected at the time of completion for a lesson. Due to the more compli- cated circumstances around a two-sided platform, only event-based analytics was never an option for giving a complete picture for the Learnlink team. A visualization tool needed to be able to read both event-based analytics and database data and show them in the same charts.
Figure 2.3:The data flow from Learnlink’s data sources to data dashboards.
2.3.4 Data flow
Figure 2.3 describes the data flow in the Learnlink application. Customer data is saved in the Firestore database. Information about payments and subscriptions comes from the payment provider Stripe. Information about customer correspondence comes from the Customer Relationship Management (CRM) system Intercom. This data is sent to Seg- ment, which is a data pipeline software - it gathers data from different sources and sends it to the desired destinations. If the Learnlink team would want to attach more data sources to their analytics tool in the future, this can be connected to Segment and sent together with the rest of the data. Segment also handles event-tracking from Learnlink.no: When users perform certain actions, messages that the events are triggered are sent to Segment. Seg- ment data is forwarded to Google BigQuery, where the data is rewritten to SQL. Data in BigQuery is structured in tables. Tables can then be used as data sources in analytics tools like Google Data Studio. Dashboards made in Google Data Studio can then be shared with others [13] and is the tool used to display dashboards and perform the exploratory analysis.
2.4 Data
2.4 Data
This section will describe the data in the form it is retrieved in BigQuery, and how it is preprocessed before being sent to Google data studio for the exploratory analysis.
2.4.1 Data collection
Most of the data that will be used in this thesis is collected from data points that are derived from customer, student and tutor activity on the Learnlink web platform. Information about the student is collected at the beginning of the customer relationship in order to find the right tutor. Subject, school level and reason for requesting tutoring are typical data points collected here. Tutor profiles are displayed to customers before their first lesson and is designed to give an impression of the tutor’s strengths and preferred learning strategies. Consequently, tutor profiles contain information about the tutor and can be used in the analysis. During the lifespan of the customer relationship, data is gathered through lesson reports. Motivation, homework assignment and completion and difficulty levels are gathered from reports. Payment and subscription information is collected from the payment provider Stripe. Information about customer support activity, as complaints, satisfaction survey responses and number of conversations, is collected through the CRM system Intercom.
2.4.2 Tables
Table Rows Variables Description
Balances 5705 6 Lesson balances
Categories 89 9 List of tutoring subjects (math etc.)
Lessons 16726 24 Tutoring lessons
Projects 3445 45 Customer/tutor relationships
Reports 6015 42 Lesson progress reports
Users 5774 46 Users (tutors and students)
Intercom conversations 64945 18 Customer support conversation activity
Stripe 3364 16 Payment information from Stripe
Table 2.1:Tables from the database, Stripe and Intercom in BigQuery.
Balances
Every customer has a lesson balance that records how many lessons the customer has available for tutoring. Customers who buy more lessons than they complete have positive balances, and customers who complete more lessons than they pay for have negative bal- ances until they pay their debt. This table contains balances for every user and metadata such as the last transaction update and how negative balances are handled. Balances can be of relevance for churn because they indicate whether a customer use more or fewer lessons than they initially planned for. Balances were introduced with the subscription business model in August 2019 and thus there is no balance data prior to this.
Categories
Every student receives tutoring in one or more subjects. Subjects are referred to as cate- gories in the database. They are divided on grade level - for example, ”Math 8th to 10th grade” and ”Math 1st to 7th grade”. At higher levels, there can be several courses within one subject, like different math classes for students attending the first year of high school or various foreign languages. Consequently, there are morecategoriesper year in the higher levels than in the lower levels. All subjects also have a level which indicates whether they are elementary school, middle school or high school subjects.
Lessons
The lessons table contains information about all the tutoring lessons. Examples are time, date, duration and whether the lesson has been paid or cancelled.
Projects
Every relationship between a tutor and a student is called aproject. A tutor usually has several projects, one with each student, and a student might have more projects if they have different tutors in different subjects. The projects table contains information about the number of lessons, tutor wage, relevant milestones like the initiation of the project and pricing information.
Reports
After every tutoring lesson, the tutors fill out a lesson evaluation report. Reports are sent to parents who can review progress, follow student motivation and check whether the student is completing the assigned homework. The most recent reports also include all official subject goals (”læreplanm˚al”) and how the student’s progress on each of the goals is over time. This feature is however just recently launched and there is not enough data to include subject goals this analysis. Information about motivation and overall goal achievement has been present in evaluation reports since 2017 and is used in the exploratory analysis.
Users
All customers and tutors have user profiles with person-specific information. Data points include personal information, contact info, activity and information that is specific to the user type (whether it is a tutor or a customer). The table is included as demographic information might be relevant when looking a churn rates.
Stripe
The Stripe table contains information about payments and subscriptions. Examples of rel- evant data points are whether invoices are paid on time, the number of payment reminders and total revenue per customer.
2.4 Data Intercom conversations
The most common way for customers and tutors to get in touch with Learnlink employees is through the Intercom chat or email. All conversations are stored in this table, together with user-specific information such as customer satisfaction and the number of email no- tifications received.
2.4.3 Excluded data
The Learnlink platform gathers more data than what is included in the tables listed above.
The reason for excluding some of the data we have available is to increase the chances of deriving useful conclusion from our dashboards. Examples of data that is excluded is visitor event-data from website landing pages and data from the online tutoring virtual classroom. The website has thousands of visitors per month, clicking buttons and scrolling pages, but we assume this to be of little importance compared to other events during the customer lifetime. Virtual classroom video material would be interesting to analyze, but presents a great challenge due to the immense size of the data. Terabytes of data every day and thousands of lessons every month requires us to apply more sophisticated big data techniques for analysis, which is beyond the scope of this thesis.
2.5 Defining churn
To be able to compare data and to perform queries, we will need to establish a quantitative measure of churn.
Churn was not unambiguously defined when the work with this thesis was initiated.
Due to seasonal variations, distinguishing between an inactive customer and a lost one is not straightforward. Moreover, Learnlink has changed its business model twice over the past years. In order to compare data collected during different time periods, we would need to have a definition that is insensitive to changes in the pricing structure. Before proceeding to possible definitions, we will go through the different business models.
Business Models (BMs)
BM1: 2017 - June 2018: Pay-as-you-go
Customers pay a flat fee per lesson and there is no volume discount. There was no commit- ment to buy a certain number of lessons and the customer could quit at any time without any delay.
BM2: July 2018 - July 2019: Packages
Customers choose a package with a certain number of lessons, ranging from 10 to 80 lessons, with volume discounts. Lessons are paid for right after completion and not in ad- vance, and a cancellation fee applies if the customer quits before all lessons in the package are used.
BM3: August 2019 - today: Subscriptions
The current business model is based on subscriptions with recurring payments. Customers choose between 3 different subscriptions with different number of lessons included per month and the subscriptions with more lessons are cheaper. Payments are in advance and recurring. In order to stop the recurring payments, a customer must proactively cancel their subscription by contacting a customer support representative. Subscriptions can be paused for one or two months. Customers can downgrade their subscriptions by changing to one with fewer lessons per month. Widespread downgrading can lead to significant rev- enue loss for the company.
Irregularities
Some customers take breaks from tutoring and then reengage a few months later. Oth- ers make payments but do not complete lessons, yet others pause their subscriptions and complete lessons saved up from previous months, so are having lessons while not making payments. Most customers take breaks during the summer holidays, but the duration of those breaks vary from 1 to 4 months. Some customers take breaks for 1-2 months in December and January.
2.5 Defining churn Criteria for the definition
The data set is already sparse, so our aim should be to find a definition that is compatible with data from all three business models. Summer holidays can not count as churn, as this would make the churn rate every summer 100 per cent and corrupt the data. Optimally, no pauses should count. Downgrades should be partial churns.
The definition we choose should be the one that will be most effective for completing the objectives. As the definition is derived from the underlying data and not the other way around, objective one is unaffected. In the exploratory analysis, we want to compare customers based on how fast they churn, so the definition should be time-sensitive (”when did the customer churn” and not ”didthe customer churn”). Asking how probable it is that a customer eventually will churn does not make any sense, as all customers are determined to churn at some point. Instead, finding out how likelycustomers are to churn within a given time-frame or finding the expected time to churn will be more useful.
In other words, we need to translate the following questions to a quantitative language that can be used to perform queries:
”Has this customer churned (or is it active)?” How fast did this customer churn?
Possible churn definitions Payment stop
Payment one month, then no payment the month after.
Payment stopcovers all business models, counts pauses as churn and one-time-orders or prepayment of lessons as immediate churn. The definition can be somewhat inaccurate as customers can keep having lessons even though they are not making payments. The fre- quency and volume of lessons is not taken into account, so downgrades are not included in this definition.
Subscription cancelled
A customer cancels their subscription.
Subscription cancelledleaves out all customers without a subscription. As BM1 and BM2 did not have explicit cancellations, this definition rules out large parts for the data set. The definition does not count downgrading subscription as churn. Due to the incom- patibility with older business models, this definition can be useful in the prediction model, but not in the analysis.
Lesson stop
A customer has not attended any lessons for the previous n months, but did attend lessons during month n-1 and the months prior to this.
Lesson stopis compatible with all business models, as lessons have been the core prod- uct across the whole time period. Pauses and holidays do not count as churn as long as lessons are resumed after a break. Downgrading a subscription is not counted as churn.
One might argue that a customer is active as long as lessons are completed, regardless of payments, which supports this definition.
After comparing the three above definitions, we can see that definitionlesson stop covers most use-cases and is compatible with all business models. Note that downgrades are not counted as churn, so reductions in activity without a complete stop are not covered in our data.
Measuring the time until churn
A measure of how fast the customer has churned will be of importance when determining premature cancellations.
Lifetime
The lifetime of a customer is the time between their first lesson and their last lesson.
Even though lifetime is not an explicit churn definition, it covers the most important aspect; the time period of which the customer is generating revenue for the company.
Whether a customer has churned or not is binary, but lifetime is continuous and is more useful for comparisons.
Lifetime is compatible with all the three business models. Pauses are insignificant, and one-time-orders are can be filtered out by only looking at the lifetime span over 1 month.
Summer holidays do not count as churn. However, this definition does not take downgrad- ing into consideration.
Lifetime has a neat relationship with churn rate, which makes it possible to use them interchangeably. Using this formula, it is also possible to calculate the average lifetime based on the percentage churn rate of the whole customer body, even though all customers have not churned yet.
Let the average lifetime beLa, and the churn rate c.
The average lifetime in months will then be
La= 1 + (1−c) + (1−c)2+...=
∞
X
n=0
(1−c)n (2.2)
This is a geometric series withr = 1−c, so using the formula for summing infinite series, we obtain
La = a0
1−r = 1
1−(1−c)= 1
c (2.3)
Using the above equation, lifetime can be used as a metric for measuring the speed of the churn rate.
We have now established a precise definition of churn and a measure that can differen- tiate fast and slow churns.
2.5 Defining churn
2.5.1 Adjusting our definition to obtain prediction results
A challenge with the chosen definition of churn, is that it takes one whole month to know whether a customer has churned or not. Additionally, as the definition carries a special condition during summer holidays, it takes even longer time to determine whether a cus- tomer has churned in May and June. In order to obtain results faster for evaluating the prediction model, we can choose another definition for the prediction model. A definition that is not applicable in the past because of the adjusted business model issubscription cancellation. At the time of writing, however, all customers who wish to stop having tutoring lessons must cancel their subscription. Subscription cancellation is recorded im- mediately. In order to be able to obtain the best comparison possible for prediction results versus actual events, we will use thesubscription cancelleddefinition for evaluating the prediction model.
2.6 Prediction model
Objective five and six deal with validating the results from the exploratory analysis by forming simple predictions that are tested against actual events. This thesis is written during the spring of 2020, and the analysis is carried out in March and April. The aim is to form predictions that can provide indications of which customers are more likely to churn during May 2020.
The purpose of the prediction model is to provide indications of whether a customer relationship is ”on-track” or whether there is a high risk of a premature cancellation. A foundation for the prediction model will be established from the correlations in the ex- ploratory analysis. In order for the model to be practical for the Learnlink team, the model should identifyindividualchurn risk as opposed to an aggregated risk. A model that will predict a correct aggregated churn percentage every month would be useful for financial planning purposes, but will not enable the team to solve the problem by imposing mea- sures affecting individual customers.
Many employ artificial intelligence and machine learning to improve predictions over time. As we are facing a large number of parameters that can be relevant to the lifetime of a customer, our model must be multivariate. Collins (2014)[16] conducted a review of reports from multivariate prediction models and concluded that the majority of the models lacked external validation, meaning that the models were not tested on other data than the data that was the foundation of the model. We will work around this challenge by testing on data from events that are yet to occur when the model is built.
One representation of the major classes of prediction algorithms group them as follows[17]:
1. Decision trees: A tree-structured algorithm that groups data based on a series of functions.
2. Neural networks: Acyclic networks inspired by the human brain. This is a form of unsupervised prediction, which means that labelling data before the prediction is unnecessary.
3. Instance-based learning: The whole training set is used to build a function that clas- sifies data.
It should be noted that this is no exhaustive list of models or algorithms. We will further only discuss the prediction algorithms that will be used in this thesis.
2.6.1 Prediction model criteria
Accuracy
The accuracy of our model is the number of correctly predicted values compared to the total number of values predicted. Abbott (2014) [18] proposes Percent Correct Classifica- tion (PCC) as the main metric to assess accuracy. With PCC, all errors are handled equally and the score is determined based onwhether errors existrather thanhow they occur.
P CC= Correct predictions
N umber of predicted values (2.4)
2.6 Prediction model Precision and recall
When predicting churn, errors where the model fails to identify potential high churn risk customers (false negative) has potentially more severe consequences than an error where low-risk customers are identified as high risk (false positive). A false negative might result in losing a customer, while a false positive may result in unnecessary measures imposed to stall a churn. We assume that anti-churn measures are somewhat effective and less expensive than losing a customer, and will choose false positives rather than false nega- tives. We measure precision as the ratio of correct positive predictions to the total positive predictions [19]. In other words,
P = T rue positives
T rue positives+f alse positives (2.5) Recall, also called sensitivity, is a measure on how many of the true positives we have predicted.
R= T rue positives
T rue positives+f alse negatives (2.6) Simplicity
As stated in the Introduction section, we aim to establish a model that can be matured by the Learnlink team in the future and used when more data becomes available. A model that is easy to implement and can be frequently adjusted will be easier to maintain and improve. Third-party algorithms that have already been trained and shown to work are preferable to writing code from scratch for the algorithm.
Overall performance
A supplementary measure to the performance metrics measured above is the confusion matrix, providing an overview ofwhat kindof errors the model makes [19].
Actual value
Prediction outcome
p n total
p0 True Positive
False
Negative P0
n0 False Positive
True
Negative N0
total P N
2.6.2 Classification
Classification is the process of predicting values and assigning labels or classes based on these values. Regression is often used when predicting numerical values, while classifi- cation is used when predicting categorical data. We often separate between monoclass classification models using solely yes/no-labels, and multiclass models which have more than two classes.
When predicting churn, monoclass can be used as the answer to the question ”Will this customer churn?”. Multiclass algorithms can differentiate between ”low risk”, ”medium risk” and ”high risk”.
Contrary to another technique that is often used, clustering, classification groups data based on predefined labels. Classification is a type of supervised learning, which means that a training data set with both input and output values are used, while clustering only uses input values. For the purpose of this thesis, we will use classification, as we have predefined labels we want to use.
Decision tree algorithms
Decision trees are a form of supervised machine learning algorithms which are both used for regression and classification. For classification, the algorithms use simple if/then-rules to classify samples. The name is derived from a way of visualizing the algorithm as a tree, where the first decision is made in the root, and the algorithm traverses down the branches to reach a decision in one of the leaf nodes.
Decision trees can be constructed manually based on prior knowledge, without ma- chine learning. Advantages to this approach is that you can leverage prior knowledge and gain insights into how changes in parameters of the model affect outcomes and thus under- stand more of the underlying effects. Machine learning algorithms can lack transparency and it can be hard to understand how the algorithm reaches certain conclusions. The disad- vantage is that the model will not learn by itself and has to be updated when new patterns are found.
Logistic regression algorithm
Logistic regression is a supervised classification algorithm. The algorithm is usually based on Machine Learning and uses the sigmoid function instead of a linear function, which is useful for dealing with outliers in the data set: a linear regression model will give too much weight on extreme values. Logistic regression can be used for binary classification or for multilinear functions. [20] Many regression algorithms handle non-numeric input values.
σ(t) = 1
1 +e−t (2.7)
2.6 Prediction model
Figure 2.4:A standard decision tree.
2.6.3 Regression models
Based on the above consideration of prediction techniques, we proceed with two differ- ent prediction models and use both for evaluating the classification results. First, we will utilize correlations from the exploratory analysis to construct decision trees manually. Sec- ondly, we will feed the variables that have shown to have significant correlation with the churn rate to a pre-built logistic regression algorithm in BigQuery. Both of the resulting algorithms are fairly simple and will be easy for the Learnlink team to develop further.
Classification model I - Manual decision tree
After retrieving a set of variables that correlate with customer churn / lifetime, we will write an algorithm based on if/then-statements in BigQuery. The model will apply labels based on the answers to these statements and apply the same labels that are used in model I. As the model is built based on the exploratory analysis, more details and figures are presented in the Results chapter.
Classification model II - Logistic regression in BigQuery
The first classification model is made using logistic regression in BigQuery. BigQuery’s premade machine learning algorithms are developed and trained by Google and are open to use in the BigQuery ML library. The ML library let the user build models with standard