Farzan Majeed Noori
Multimodal Deep Learning
Approaches for Human Activity Recognition
Thesis submitted for the degree of Philosophiae Doctor
Department of Informatics
Faculty of Mathematics and Natural Sciences
2023
© Farzan Majeed Noori, 2023
Series of dissertations submitted to the
Faculty of Mathematics and Natural Sciences, University of Oslo No. 2587
ISSN 1501-7710
All rights reserved. No part of this publication may be
reproduced or transmitted, in any form or by any means, without permission.
Print production: Graphics Center, University of Oslo.
To my Loving Family, Teachers, and Friends Who Have Always been Source of Inspiration, Motivation and Learning Throughout My Life
Abstract
Smart homes may be beneficial for people of all ages, but this is especially true for those with care needs, such as the elderly. To assist, monitor for emergencies, and provide companionship for the elderly, a substantial amount of research on human activity recognition systems has been conducted. Several algorithms for activity recognition and prediction of future events have been reported in the scientific literature. However, the majority of published research does not address privacy concerns or employ a variety of ambient sensors.
The objective of this thesis is to contribute to the progress in research relevant to activity recognition systems that use sensors that collect less privacy-related information. The following tasks are included in the work: assessment of sensors while keeping privacy concerns in mind, selection of cutting-edge classification methods, and how to fuse the data from multiple sensors. This thesis contributes to making progress on systems for analyzing human activity and state—or vital signs—for application in a mobile robot.
This dissertation examines two topics. First, it examines the privacy concerns associated with having a robot in the home. On a robot, an ultra-wideband (UWB) radar-based sensor and an RGB camera (for ground truth) were installed.
An actigraphy device was also worn by the users for heart rate monitoring. The UWB sensor was selected to maintain privacy while monitoring human activities.
Considering different ways to represent data from a single sensor is the second topic under investigation. That is, how data from multiple representations can be combined. For this purpose, we investigate various data representations from a single sensor’s data and analysis using cutting-edge deep learning algorithms.
The contributions provide considerations for equipping a mobile home robot with activity recognition abilities while reducing the amount of privacy-sensitive sensor data. The work also concerns examining the potential privacy restrictions that must be established for the analyzing systems. The thesis contains new methods for combining data from multiple information sources. To achieve our objective, convolutional neural networks and recurrent neural networks were applied and validated using conventional methods.
The conclusion of the thesis is that we can achieve good accuracy with limited sensors while maintaining privacy. It is, however, likely adequate for assisting healthcare personnel and caregivers in their work by indicating current activity status and measuring activity levels, providing alerts about abnormal activities.
The results can hopefully contribute to older people being able to live alone in their homes with a larger chance of any unwanted events being quickly detected and notified to the caregivers and providers.
Preface
This thesis is submitted in partial fulfillment of the requirements for the degree ofPhilosophiae Doctor at the University of Oslo.
The research presented here was conducted at the Robotics and Intelligent Systems group at the Department of Informatics during the period 2018–2022, under the supervision of Dr. Jim Tørresen, Dr. Md Zia Uddin, and Dr. Michael Alexander Riegler. This work was partially supported by the Research Council of Norway (RCN) as a part of the Multimodal Elderly Care Systems (MECS) project under Grant Agreement No. 247697, Predictive and Intuitive Robot Companion (PIRC) under Grant Agreement No. 312333, and through its Centres of Excellence scheme, Project No. 262762. In 2022, the author had a four-month stay at Information Systems, Security and Forensics Lab at University of Michigan-Dearborn United States, being supervised by Dr. Hafiz Malik.
Acknowledgements
There are several people I would like to thank for their help and support during the period I have been working on my PhD.
First of all, I would like to thank my supervisors; Jim Tørresen, Md Zia Uddin, and Michael A. Riegler. Without their support and motivation, this work would not have been possible.
In particular, I would like to thank; Jim for serving as my mentor and keeping me focused on the research while allowing me to pursue everything needed for research; Zia for teaching me so much in research and beyond; Michael A. Riegler for guiding me to start my research and sometimes working with me at 7 am.
I would also like to give many thanks to all my current and former colleagues in the Robotics and Intelligent Systems group for a great working environment. I would like to thank Enrique G-Ceja for his challenging and motivating discussions and feedback in the early days of my PhD.
I would like to thank Dr. Noman Naseer, who motivated me toward research during my masters and is still helping me out. I would like to thanks Dr. Rayyan Azam Khan for being both a very good friend and a fellow researcher, and for always having an open ear for me. I would also like to thank all of my co-authors and collaborators.
Further, I would like to thank my friends from Pakistan and Norway. You know who you are!
I would like to thank my parents – Abdul Majeed, SI and Abida Parveen – for their support and good thoughts during my research. A huge thanks to my family members from Pakistan and Denmark.
Finally, I would like to thank my wife – Aaima, for supporting me through this challenging but rewarding journey and my beloved son – Yusuf, for not being born before my submission deadline. If I forgot to mention anyone, please accept my apologies and sincere gratitude.
Farzan Majeed Noori Oslo, December 2022
List of Publications
Paper I: A Robust Human Activity Recognition Approach Using OpenPose, Motion Features, and Deep Recurrent Neural Network.
F. M. Noori, B. Walace, M. Z. Uddin, J. Tørresen.
2019 Scandinavian Conference on Image Analysis.
DOI: 10.1007/978-3-030-20205-7_25. Citations1 : 52
Paper II: Robot-Care for the Older People: Ethically Justified or Not?
F. M. Noori, M. Z. Uddin, J. Tørresen.
2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob).
DOI: 10.1109/DEVLRN.2019.8850706. Citations : 6
Paper III: Fusion of Multiple Representations Extracted from a Single Sensor’s Data for Activity Recognition Using CNNs.
F. M. Noori, E. Garcia-Ceja, M. Z. Uddin, M. Riegler, J. Tørresen.
2019 IEEE International Joint Conference on Neural Networks (IJCNN).
DOI: https://doi.org/10.1109/IJCNN.2019.8851898.
Paper IV: Human Activity Recognition from Multiple Sensors Data Using Multi-Fusion Representations and CNNs.
F. M. Noori, M. Riegler, M. Z. Uddin, J. Tørresen.
2020 ACM Transactions on Multimedia Computing, Communications, and Applications, Volume 16, Issue 2.
DOI: https://doi.org/10.1145/3377882. Citations : 19
Paper V: In-Home Emergency Detection Using an Ambient Ultra-Wideband Radar Sensor and Deep Learning.
M. Z. Uddin, F. M. Noori, J. Tørresen.
2020 IEEE 10th International Conference on Ultrawideband and Ultrashort Impulse Signals (UWBUSIS).
DOI: 10.1109/UkrMW49653.2020.9252708. Citations : 5
Paper VI: Ultra-Wideband Radar-Based Activity Recognition Using Deep Learning.
F. M. Noori, M. Z. Uddin, J. Tørresen.
2021 IEEE Access, Volume 9.
1All citation numbers are without self-citatons
DOI: 10.1109/ACCESS.2021.3117667. Citations : 9
Papers written during the PhD, but not included in the thesis:
Heart rate prediction from head movement during virtual reality treatment for social anxiety
F.M. Noori, S. Kahlon, P. Lindner, T. Nordgreen, J. Torresen, M. Riegler.
2019 International Conference on Content-Based Multimedia Indexing (CBMI) DOI: 10.1109/CBMI.2019.8877454.
Semantic Temporal Object Search System Based on Heat Maps M. Mantelli, F.M. Noori, D. Pittol, R. Maffei, J. Torresen, M. Kolberg.
2022 Journal of Intelligent and Robotic Systems DOI: 10.1007/s10846-022-01760-8.
One-dimensional convolutional neural networks on motor activity measurements in detection of depression
J.I. Frogner, F.M. Noori, P. Halvorsen, S.A. Hicks, E. Garcia-Ceja, J. Torresen, M. Reigler
2019 Proceedings of the 4th International Workshop on Multimedia for Personal Health & Health Care
DOI: https://doi.org/10.1145/3347444.3356238.
Emotion Recognition using Speech Data with Convolutional Neural Network M.H. Pham, F.M. Noori, J. Torresen.
2021 IEEE 2nd International Conference on Signal, Control and Communication (SCC)
DOI: 10.1109/SCC53769.2021.9768372.
Monitoring In-Home Emergency Situation and Preserve Privacy using Multi-modal Sensing and Deep Learning
D.A. Bordvik, J. Hou, F.M. Noori, M.Z. Uddin, J. Torresen.
2022 International Conference on Electronics, Information, and Communication (ICEIC)
DOI: 10.1109/ICEIC54506.2022.9748829.
Challenges and possible solutions in cross-disciplinary and cross-sectorial research teams within the domain of e-mental health
T. Nordgreen, F. Rabbi, J. Torresen, ... F. M. Noori, Y. Lamo.
2021 Journal of Enabling Technologies
DOI: https://doi.org/10.1108/JET-03-2021-0013.
Towards Adaptive Technology in Routine Mental Healthcare
Y. Lamo., S.K. Mukhiya, F. Rabbi, J. Torresen, ... F. M. Noori, ... T.
Nordgreen,
2022 Digital Health Journal DOI: 10.1177/20552076221128678.
Mood Recognition From Daily Phone Calls of Patients With Bipolar Disorder Using Attention Network
P. Minh H., F. M. Noori, J. Petter, O. Ketil, T. Nordgreen, J. Torresen Journal paper under review
Contents
Abstract iii
Preface v
List of Publications vii
Contents xi
List of Figures xiii
List of Tables xv
1 Introduction 1
1.1 Motivation and Aim . . . 1
1.2 Interdisciplinarity . . . 3
1.3 Research Question . . . 4
1.4 Research Methods . . . 5
1.5 Contributions . . . 6
1.6 Summary of Papers . . . 7
1.7 Thesis Outline . . . 8
2 Background 9 2.1 Sensing for Elderly Care and Robots . . . 9
2.2 Fusion of Data from Sensors . . . 13
2.3 Combining Multiple Data Representations . . . 16
2.4 Preprocessing Techniques . . . 19
2.5 Machine Learning . . . 23
3 Sensors and User-Generated Data 29 3.1 Datasets . . . 29
3.2 Sensors Used in MECS Data Collection . . . 34
4 Summary of Papers and Author Contributions 37 4.1 Overview . . . 37
4.2 Papers . . . 37
5 Discussion 47 5.1 Research Questions . . . 47
5.2 Limitations and Future Work . . . 49
6 Conclusion 53
Bibliography 55
Papers 68
I A Robust Human Activity Recognition Approach Using OpenPose, Motion Features, and Deep Recurrent Neural
Network 69
II Robot-Care for the Older People: Ethically Justified or Not? 83 III Fusion of Multiple Representations Extracted from a Single
Sensor’s Data for Activity Recognition Using CNNs 91 IV Human Activity Recognition from Multiple Sensors Data
Using Multi-Fusion Representations and CNNs 99 V In-Home Emergency Detection Using an Ambient Ultra-
Wideband Radar Sensor and Deep Learning 121 VI Ultra-Wideband Radar-Based Activity Recognition Using
Deep Learning 129
List of Figures
1.1 Percentage of population aged 60 years or over by region, from
1980 to 2050 [24]. . . 1
1.2 Percentage of people aged 65 or older who reside alone, by country or region, 2006–2015 [25]. . . 2
2.1 Training steps of the HWC model [12]. . . 11
2.2 Human activity detection framework by [100]. . . 14
2.3 A smart apartment schematic design based on several ambient sensors for the care of older adults [128]. . . 15
2.4 Pipeline for data-level fusion. The data are first combined into one large vector before a decision is made based on the fused data. 17 2.5 Pipeline for feature-level fusion. The features are first extracted and combined into one large vector, then a decision is made based on fused features. . . 18
2.6 Pipeline for decision-level fusion. Each sensor/ representation is trained first and then combined. . . 19
2.7 Basic architecture of a CNN [112]. . . 25
2.8 Basic architecture of an RNN. . . 26
3.1 Setup of cameras from Berkeley MHAD acquisition system [93]. 30 3.2 Schematic setup of the five indoor activities during the MECS experiment (A1 – lying, A2 – sitting with the legs on the bed, A3 – sitting with the legs on the floor, A4 – standing, and A5 – walking). . . 31
3.3 Ultra-wideband XeThru X4 sensor [7]. . . 34
3.4 Polar A370 watch to collect heart rate. . . 34
3.5 Asus Xtion Pro Live 3D Sensor [116]. . . 35
3.6 Glimpse of the collected dataset using Asus Xtion. . . 35
4.1 Papers included in the thesis, grouped by the research questions they address. For the full research questions, see Section 1.3. . . 37
4.2 Extraction of body joints while performing jumping jacks [93] . 38 4.3 Data representation using early fusion [94]. . . 40
4.4 Flowcharts of all methods: (a) single-sensor data/baseline; (b) Data-level fusion; (c) Feature-level fusion; and (d) Decision-level fusion in Paper IV [96]. . . 42
4.5 A person sitting on the sofa beside a robot with a mounted UWB radar-based sensor [129]. . . 43 4.6 Schematic setup of the five indoor activities used in Paper VI [90] 44
List of Tables
2.1 Summary of research works that use ambient and wearable sensor technology. . . 12 3.1 Datasets used for HAR method evaluation. . . 33
Chapter 1
Introduction
This chapter introduces the motivation for and foundation of the research contained in the thesis. The research question, sub-questions, and thesis outline are presented in this chapter.
1.1 Motivation and Aim
Figure 1.1: Percentage of population aged 60 years or over by region, from 1980 to 2050 [24].
The elderly cohort is growing faster than the population of other age groups, according to a 2017 report by the United Nations Department of Economic and Social Affairs [24], as shown in Figure 1.1. In 2015, individuals over the age of 60 accounted for one out of every eight people. According to projections, the number of elderly persons is predicted to reach about 2.1 billion by 2050. One of the most difficult challenges in dealing with an elderly population is ensuring the efficient delivery of healthcare services [117]. The healthcare of older persons is also a major subject of concern for their families and friends. Because they are at greater risk of being impacted by unplanned situations, such as falls, this is especially important when older individuals are living at home alone. There
are many public institutions that provide care services, but people may not feel comfortable with them because they sometimes do not respect the human rights of elderly patients [117].. There have been instances where individuals have experienced feelings of reduced dignity.
Living independently, either alone or with a partner, is likely to provide greater privacy and control over household decisions but less companionship and task sharing. However, the experience of living independently may vary between older people in developed and less developed nations. There is a higher percentage of older adults living alone in more developed nations [25], as shown in Figure 1.2. People in their old age who have the means to support themselves, whether via pensions, personal assets, or access to government-funded healthcare, are more likely to live by themselves as long as their health permits. Although many elderly people live alone, they often still depend on their children for support and communication. They may also depend on other family members, such as siblings and cousins, or non-related individuals, such as friends and neighbors. In the last few years, assisting seniors in maintaining their health and autonomy has become a topic of great research interest [9].
Figure 1.2: Percentage of people aged 65 or older who reside alone, by country or region, 2006–2015 [25].
As people get older, their healthcare costs can rise significantly. However, with fewer financial resources available to assist the elderly, many will struggle to manage this increased burden. Caregiving that is both efficient and accurate is therefore critical to ensuring the health and financial security of the elderly [114]. In order to offer timely and appropriate services, it is essential to collect accurate and up-to-date data via automated health monitoring systems [18]. In addition to cost-effectiveness, these systems would allow the elderly to retain their autonomy, enabling them to remain in their homes. Activities of daily living (ADLs) are a combination of everyday activities, such as eating, walking,
Interdisciplinarity and sleeping, that are often used in healthcare to measure cognitive and physical well-being. Sensing technology advancements in the last several years have made it feasible to integrate sensors into older people’s houses, effectively enabling a permanent monitoring system [110]. However, the difficulty of automatically recognizing ADLs from sensor data remains unresolved [61].
Many studies on smart home features have focused on making life safer and more independent for the elderly [125]. Diagnostic tools and the ability to foresee, anticipate, and prevent hazardous situations are some examples of prompting via reminders. The reliability of activity recognition and prediction algorithms determines the performance of these functions. Several high- performing algorithms for activity recognition and prediction can be found in the relevant scientific literature [53, 107].
When a single sensor is unable to measure all critical aspects, or when perception is unclear, the sensor is vulnerable [31]. A single sensor modality is not reliable to accurately detect human actions because having only one source of data can lead to uncertainty. Multiple sensors in a health or activity monitoring system may be more effective for identifying complex activities. By integrating data from multiple modalities, confidence and reliability are improved, which reduces uncertainty. In this thesis, we aim to investigate the performance of cutting-edge algorithms in real-world settings such as the home. For smart home functions to be implemented and useful to older adults, algorithms must perform at a sufficient level of accuracy. This work is intended to investigate how far away we are from achieving smart home functions that can help older adults live safely and independently in their own homes.
1.2 Interdisciplinarity
This dissertation is a part of the Multimodal Elderly Care System (MECS) research project1. The MECS project investigates how robot sensing technologies can support older adults in being independent at home when their age and health would otherwise make them more dependent on support from other people.
The project’s goal is to make progress in welfare technology that can assist persons with health-related concerns who are living alone [115]. It focuses on robots and sensors that can assist in monitoring an elderly person who is living alone at home, predicting potential problems, and contacting other individuals when needed. Sensors like cameras that can be attached to robot companions instead of being permanently installed in an individual’s home can improve both performance and privacy. These would be utilized to detect falls and other abnormal activities. Using new sensors technology, it is possible to remotely monitor medical conditions via vital signs, such as heart rate, respiration, etc.
Instead of relying on the elderly to activate their security alarms in an emergency, alarms in such a system are intended to be automatically activated by these sensors. A robot may notice an issue early, such as when the individual requires assistance or is in a danger of falling, and may even intervene. Robots may
1https://www.mn.uio.no/ifi/english/research/projects/mecs/index.html
also be a better system for older adults in terms of data collection and privacy protection than a house with multiple built-in sensors. The MECS project was undertaken through collaboration between the Robotics and Intelligent Systems (ROBIN)2 group and the Design of Information Systems (DESIGN)3group at the Department of Informatics, University of Oslo.
1.3 Research Question
The primary objective of this thesis is to apply state-of-the-art sensors and activity recognition algorithms in order to determine their performance, applicability, and limitations. Our study aim was to investigate this by extracting useful information from the data coming from multiple sensor sources. A number of activity recognition methods were tested on data from lab settings and showed promising performance. The hypothesis of this thesis can be formulated as follows:
Hypothesis: Multimodal deep learning approaches can efficiently identify human activities by combining multiple data sources in the context of robotic support in elderly care.
To test the main hypothesis, we have defined the following research questions (RQs).
• RQ-1:What are the privacy implications of using personal robot assistants in the home?
What privacy issues and trade-offs must we be aware of when having a robot in a home environment? Robots can aid seniors by assisting them with daily tasks, monitoring their health, and providing companionship. Despite the benefits of robots, elderly–robot interaction raises several ethical concerns.
Seniors may become lonely as a result of reduced human interaction.
• RQ-2: How can a single sensor’s data be converted into multiple representations?
We can enhance the performance of our models if we present the data in multiple representations. Furthermore, relying solely on a single sensor modality to detect human actions is not very reliable due to uncertainties in the system.
• RQ-3: How many ways can multiple data representations be combined?
Multiple sensors or multiple representations in a health or activity monitoring system can help distinguish between different types of activities.
2https://www.mn.uio.no/ifi/english/research/groups/robin/
3https://www.mn.uio.no/ifi/english/research/groups/design/
Research Methods
Confidence and reliability are enhanced and uncertainty is reduced by integrating data from multiple sensors.
• RQ-4: How can non-wearable sensors be used for and by elderly people to maintain privacy while also detecting some levels of emergency?
This research question focuses on how to recognize human activities while keeping users’ privacy.
1.4 Research Methods
As stated by Dodig-Crnkovic [26], traditional research methodology is more difficult to use in computer science due to its interdisciplinary nature. The ACM Education Board, established in 1989 by a task force on the foundations of computer science, sets and defines the structure of how computing research should be conducted [23]. This report defines the essence of computer science as the convergence of numerous fundamental processes, with the key processes being applied mathematics, science, and engineering. These key processes are fundamentally reflected in (i) theory, (ii) abstraction, and (iii) design paradigms.
In this thesis, the theoretical paradigm involves establishing an ethical problem regarding robot care at home and collecting data that can be used to solve the problem. The abstraction paradigm entails creating the algorithms conceived by the earlier paradigm. Finally, the design paradigm refers to developing the prototype of the system by establishing its requirements and outlining its specifications. This thesis primarily touches on the elements of the first two processes. The following is a summary of how the thesis fits into each process:
• Theory: The report presented that the theoretical paradigm is rooted in mathematics and relates in the development of a coherent and valid theory.
This phase consists of four steps, as follows:
1. characterize the objects of study (definition)
2. hypothesize the possible relationships among them (theorem) 3. determine whether the relationships are true (proof)
4. interpret the results
We have introduced our main objective and four research questions to investigate machine learning models for human activity recognition using multimodalities while maintaining privacy. We hypothesized that we could generate multiple representations of data using single sensor. To design the system’s algorithmic basis, we developed a deep learning–based classification algorithms.
• Abstraction (modeling): The abstraction paradigm is rooted in the experimental scientific method. This phase consists of four steps, mentioned in the ACM report, as follows:
1. form a hypothesis
2. construct a model and make a prediction 3. design an experiment and collect data 4. analyze the results
In this research, we performed several experiments using publicly available datasets in addition to a new dataset that was generated to support the hypothesis. We explored multiple sensor fusion techniques for classification.
We also improved the models using different dimensionality reduction methods.
• Design: Thedesignparadigm is closely related to the engineering systems.
The report describes this phase as a process consisting of four steps, which are described as follows:
1. state requirements 2. state specifications
3. design and implement the system 4. test the system
This paradigm is intended to be implemented using a real-time AI system that employs the theories and abstractions from the two previous paradigms.
Such a system’s development is beyond the scope of this thesis.
1.5 Contributions
In order to meet the goal and fulfil the objectives of the thesis, we researched several challenges and addressed various issues in the domain of human activity recognition. The main contributions of this research are as follows:
• This study highlights the major ethical concerns regarding companion, assistive, and pet robots. Additionally, some preliminary suggestions for addressing these ethical issues through the use of appropriate guidelines and discussions with the elderly have been provided.
• We generated multiple data representations from only one sensor for human activity recognition. The idea is to use minimum number of sensors and extract maximum useful information.
Summary of Papers
• We proposed three different levels to merge the data extracted from multiple representations or multiple sensors.
• We employed an ultra-wideband radar-based deep-learning approach to classifying human activities while maintaining privacy. A UWB sensor collects multiple data points, which can be challenging to handle; as such, dimensionality reduction techniques were introduced.
1.5.1 Other Contributions
In addition to the main contributions, this study also contributes to other topics related to our research. We looked into how to detect depression using a dataset that included motor activity recordings from a condition group (individuals with unipolar depression and bipolar disorder) and a control group [35]. Using speech data and convolutional neural networks, we were able to recognize emotions [106]. Following that, we were able to recognize the moods of bipolar disorder patients’ daily phone calls. The research is still in progress. For this study, we collected data for nearly two years as part of the INTROMAT project4[49]. We also tried to predict future heartrate using head movements. Each participant wore a wireless wristband that detected their heartbeat in real-time, while the VR headset captured data on head movement [95].
We collaborated with the Haukeland University Hospital, Bergen on a project that sought to introduce mental health disorder treatment technologies [71]. We focused on enhancing knowledge and skills in interdisciplinary and cross-sectional research, frequently mentioned as a critical tool for finding sustainable solutions to significant societal concerns [98].
In summary, by researching human activities and health-related challenges, we were able to explore a promising and essential path for society. We were also able to form collaborations with Brazil[77] and the United States[91]. Thus, this study establishes a solid basis for future collaboration and research in healthcare and human activity recognition.
1.6 Summary of Papers
The list of publications is shown in Pulications Section . This thesis is a collection of six research papers that comprise the overall research contribution.
Paper I demonstrates a robust human activity recognition approach using the OpenPose library, motion features, and recurrent neural network (RQ-1).
Paper II focuses on the ethical justification of the robot care for elderly people (RQ-1).
Paper III demonstrates how multiple representations can be produced using a single sensor’s data and how they can be fused (RQ-2 & 3).
4https://intromat.no/workpackages/wp1/
Paper IV contains work with multiple sensors, each with multiple representations.
Later, the representations are fused using data-level fusion, feature-level fusion, and decision-level fusion (RQ-2 & 3).
Paper V shows in-home emergency detection (normal vs abnormal situations) using an ambient ultra-wideband radar-based sensor (RQ-4).
Paper VI focuses on ultra-wideband radar-based activity recognition using deep learning (RQ-4).
1.7 Thesis Outline
The current chapter provided an introduction to the thesis and the aim of the work. Chapter 2 presents the background for the thesis. Chapter 3 discusses data collection and the sensors used in our research. Chapter 4 presents an overview of the contributions of the research papers and individual summaries for each paper. Chapter 5 reviews the research findings and presents conclusions and suggestions for future work. The thesis is concluded in Chapter 6. All the papers are available at the end of this thesis.
Chapter 2
Background
This chapter will open with discussion of human activity recognition (HAR) systems and their associated benefits and challenges. It will then present multiple sensors and concerns regarding privacy in HAR. Finally, machine learning and deep learning, as well as preprocessing techniques, will be reviewed.
2.1 Sensing for Elderly Care and Robots
When it comes to using robots to monitor people, there are a number of sensor devices that play an essential role in obtaining data and passing them to the robot to be processed [87, 128]. Most research on assistive technology has focused on wearable sensors, which are often combined with ambient sensors to help older people live independently [94]. Wearable sensors are frequently more reliable for data collection than ambient sensors [93]. However, wearable sensor restrictions, such as their needing to be constantly carried or frequently charged, may discourage older people from actively using them. For instance, using on-body sensors may be uncomfortable for certain older people, and some people my experience unpleasant sensations from wearable sensors due to long-term direct contact with the body [115]. Employing external or ambient sensors is a solution that might be more popular with potential consumers due to their unobtrusive nature, but sometimes there is a chance that these sensors will produce inaccurate data. In order to strengthen the autonomy of people who are cared for by robots, researchers have been actively working on sensor-based technologies to find workable solutions for a variety of sensor combinations [64].
2.1.1 Privacy and Data Protection
Personal rights and the right to privacy are closely tied to individuality. Each individual has the right to personal autonomy and identity, which includes deciding when and how personal information is shared with others [111, 113].
In the last decade, significant technological breakthroughs have put this right to privacy at risk. The technological capabilities of continuous observation have made significant strides, especially technologies that provide continuous surveillance of human behavior [63, 132]. This ability to monitor people using data collection via multiple sensors raises the question of whether humans can live autonomously and freely under these conditions. When employing technology such as robots to monitor the health and behavior of the elderly, it is critical to protect their privacy [113].
However, there may still be issues with data holders’ fundamental rights to data protection, which means that data should only be maintained when doing so
is necessary for a specific purpose, such as treating or preventing an exceptional incident, such as falls. First and foremost, data protection should entail ensuring the data owner’s satisfaction with data preservation security. This means that the privacy and data protection of the elderly must include accessible services and ensure legal safeguards for their fundamental rights [114].
2.1.2 Wearable Sensors
Multimodal data, such as human acceleration and heart rate monitoring, has the potential to be applied to detect human activity using wearable sensors [131, 139].
Guo et al. [46] developed multimodal activity recognition using MARCEL, a classifier ensemble system1 that can work with labeled and unlabeled data. The neural network’s error function incorporates the classifier ensemble’s diversity.
For multi-classifier fusion algorithms, majority voting has been widely used.
In majority voting, all classifiers have equal value, so weak classifiers can easily reduce the efficiency of the fusion algorithm. In [47], the authors proposed an entropy-weighted, multi-sensor, multi-classifier, hierarchical fusion technique for HAR using wearable inertial sensors. The authors proposed two-fold fusion, classification fusion, and sensor fusion, and the weights were estimated using the entropy weight method. The proposed method was tested on five classifiers having accelerometer and gyroscope data. Actigraphy devices or wearable sensors are subject to faults and failures. In [12], the authors introduced a hierarchical weighted classifier (HWC) that could cope with technological sensor anomalies.
Standard activity recognition approaches with single sensors demonstrate low tolerance against sensor anomalies. Figure 2.1 shows the training steps of the HWC model.
Garcia et al. [37] introduced multi-view stacking with accelerometer and sound data for activity recognition. Initially, each sensor was trained separately and then fused using a stacked generalization2. The approach was validated on three publicly available datasets: the Berkeley MHAD dataset, the UTD-MHAD dataset, and the Opportunity dataset.
A multimodal, multi-stream deep learning approach is introduced [120] for egocentric3 activity recognition. In the proposed methodology, egocentric video and sensor data are used to train multi-stream convolutional neural networks (CNNs) and multi-stream long short-term memory (LSTM) architectures to learn discriminative spatial and temporal features. Two-stream CNNs were upgraded to three-stream CNNs by adding stabilized optical flow, which was able to capture the foreground motion information. The results showed very encouraging performance (slightly worse results than handcrafted features),
1Ensemble learning is a method of building multiple base classifiers, from which a new classifier is created that outperforms any constituent classifier.
2A type of ensemble approach, also known as stacking, with the idea of combining several learners. The technique entails using the initial training data to train a group of learners (referred to as the first-level learners). Then, a second-level learner, known as the meta-learner, is trained using the first-level learners’ outputs.
3The study of cameras that can be worn on the body, often on the head or chest. They look like the wearer’s field of vision, which is known as egocentric vision.
Sensing for Elderly Care and Robots
Figure 2.1: Training steps of the HWC model [12].
despite the fact that they had limited egocentric samples for training. In HAR, most works rely on handcrafted features, which are sometimes unable to distinguish features accurately to classify activities. Recently, CNNs have shown significant improvement in activity recognition. In [140], the authors built a CNN-based architecture to analyze multi-channel time series data. Before classification, several channels were combined into one layer. The multi-channel time-series architecture built on the CNN was task-dependent and had greater discriminative power pertaining to categorizing human activities.
2.1.3 Hybrid Sensors: Wearable and Ambient
As wearable sensors can provide more precise information on the health status of the elderly, such as heartbeat, muscle movements, respiration, and blood flow, wearable sensors have been the subject of numerous studies on healthcare for this population [8] . However, seniors may not be willing to adopt body-worn sensors easily. Nonetheless, a number of research projects utilize both wearable and ambient sensors [51, 104], as shown in Table 2.1
AuthorsPurposeSensors Tolkiehnetal.[123]Direction-sensitivefalldetection.Accelerometer,barometricpressuresensor Simetal.[118]Dailylifeactivityrecognition.Accelerometer,RFID,pressuresensor, PIRmotionsensor Nyanetal.[101]Falldetectionusinggyroscopes.Gyroscope,videocamera Heinetal.[54]Recognizedifferentdailylifeactivities.Accelerometer,doorsensors,PIRmotion sensors,videocamera Medjahedetal.[79]Afuzzy-logicbasedapproachtorecognize dailylifeactivitiesPIRmotionsensors,soundsensors, physiologicalsensors Bangetal.[11]Anenvironmentalsensor-andaccelerometer- basedapproachwasproposedtorecognize activitiesofdailylife
PIRmotionsensors,environmental sensors,accelerometer Kerdjidjetal.[66]FalldetectionandHARusingwearable sensorsandcompressedsensingaccelerometer,gyroscope Martínez-Villaseñoretal.[78]UP-falldetectiondatasetforHARand falldetectionInfraredsensors,accelerometer,web cameras,gyroscope,electroencephalograph Ferdaetal.[102]Multimodalhumanactiondatabasefor actionrecognitionandposeestimation
Accelerometers,microphones,motion capturesystem,fourmulti-viewcameras, oneKinectsystem OzcanandVelipasalar[103]Wearablecamera-andaccelerometer- basedfalldetectiononportabledevicesAccelerometer,camera(smartphone) Table2.1:Summaryofresearchworksthatuseambientandwearablesensortechnology.
Fusion of Data from Sensors In [105], the authors proposed an algorithm to fuse data from environmental sensors and wearable motion sensors via particle filtering to localize humans.
The location of the person was detected by passive infrared sensors. Due to privacy concerns, no image was captured to localize the person. Thus, the body movement was extracted by an inertial measurement unit (IMU) sensor, which was attached to the human body, and the correlation was calculated to improve accuracy. In another study, a multi-source fusion framework driven by user-defined knowledge for egocentric activity recognition was proposed [141].
The information was collected from three sources: wearer, wearable camera, and sensor data from IMU and GPS sensors. In [29], the authors proposed feature-level fusion for HAR. Wearable sensors and RGB-D dataset were fused at the feature level. Time-domain features were extracted from wearable sensors, and histogram of oriented gradients (HOG) descriptors were extracted from RGB-D videos. Features were fed into conventional classifiers, such as k-nearest neighbors (KNNs). The authors claimed better performance than CNNs. Eitel et al. [30] introduced a two-streamed multimodal deep learning approach for robust RGB-D object recognition. This method was able to fuse the RGB and depth information automatically before classification. A novel depth-data augmentation improved the recognition in noisy real-world environments.
A multi-sensor fusion approach based on multiple classifier systems for HAR was introduced by [100]. Researchers have fused multiple sensors at the feature level in the past, but mostly single classifiers were used at the end. In this approach, heterogeneous sensors were introduced with various classifiers, and the predicted performance was then stacked and evaluated by a meta classifier, with more than one classifier is used as a meta-level. The final results were the average of the prediction of the two classification algorithms. Their proposed framework is shown in Figure 2.2.
Ambient sensors for senior care can be installed in a smart home’s various locations to track seniors’ actions or physical conditions [128]. Figure 2.3 shows a schematic representation of a smart home for tracking an elderly person’s behavior using a variety of ambient sensors. There are several frequently used sample sensors displayed throughout the apartment. It is also possible to install additional sensors, such as ambient sensors that measure the temperature and humidity.
2.2 Fusion of Data from Sensors
A sensor is vulnerable to, for instance, occlusions or missing features when it cannot measure all relevant attributes of an object or when the perception is uncertain [137]. Due to these uncertainties, it is impractical to rely solely on a single sensor modality to identify human behaviors. Health or activity monitoring systems with multiple sensors may be more effective at differentiating complicated behaviors [51]. Combining data from multiple sensors reduces ambiguity and uncertainty while enhancing dependability and confidence. Several researchers have presented strategies for fusing various datasets or channels utilizing early
Figure 2.2: Human activity detection framework by [100].
or late fusion techniques [30, 37, 48]. Data fusion, feature fusion, and decision fusion are the three main types of fusion that are generally used [5] [46]. A detailed overview of these techniques is presented in Section 2.3.
Wearable sensing technologies, such as actigraphy devices, and smartphones have allowed us to gather behavioral data and monitor mental states [35]. We need machine learning tools to analyze these kinds of data. In 2018, a survey was conducted regarding mental health monitoring with multimodal sensing and machine learning [38]. The survey focused on mental health monitoring systems (MHMSs) for mental conditions such as anxiety disorders, bipolar disorder, depression, epilepsy, and stress. Lecun et al. [72] reviewed the vast deep learning improvements in object detection, healthcare, and many other domains. The authors provided an overview of supervised learning, back-propagation to train multilayer architectures, CNNs, and RNNs. Ultimately, unsupervised learning would be the future of AI, as animal or human learning is unsupervised.
Fusion of Data from Sensors
Figure 2.3: A smart apartment schematic design based on several ambient sensors for the care of older adults [128].
Since single classifier systems are not capable of reducing the uncertainty and ambiguity in HAR, multiple classifier systems are introduced by fusing outputs of different classifiers. Though we have several resources, computationally efficient deep learning development for mobile and wearable devices is still lacking. This is needed for multimodal data fusion for context-aware detection, data security, and many other challenges [143]. A survey conducted by Aguileta et al. [4] about multi-sensor fusion for activity recognition explored the benefits of multiple sensors to determine whether other sensors can compensate for the information missed by one sensor. They showed the significance of using a single fusion method and using two fusion methods. The survey explained that the combination of heterogeneous sensors performed better than homogeneous sensors. Similarly, automatically extracted features were compared with manually extracted features.
The literature is also lacking in providing reasons for choosing specific methods for specific datasets.
A 2019 study reviewed human activities and health monitoring systems based on data fusion and multiple classifier systems [99]. The authors reviewed the three basic fusion approaches: decision, feature, and data fusion. Multimodal sensors and inertial sensors were utilized for health monitoring and distinguishing activities of similar data patterns. Deep feature fusion performed significantly well with respect to handcrafted features. In [67], the authors critically reviewed data fusion with state-of-the-art techniques. The data-related fusion problems were categorized into four challenging issues: imperfection, correlation, inconsistency,
and disparateness.
Shortcomings in the data are the most challenging problem in data fusion.
In addition, algorithms might need prior knowledge of the cross-covariance of the data to produce consistent results. Lastly, multiple sensors might be used in the fusion system. A fusion of disparate data 4is another challenging task. In [16], the authors explored deep temporal multimodal fusion, which is similar to early fusion. This is end-to-end deep learning, which means there is no need for handcrafted features. The correlations between features within each modality were minimized by implementing a hierarchical model and fusing temporally afterward.
Multimodal deep learning approaches using audio and video modalities were introduced by [89]. Features were learned over both modalities and trained and tested according to the given settings. Three machine learning approaches were introduced: multimodal fusion, shared-representation learning, and cross- modality learning. In the multimodal fusion setting, all the data were available in supervised training and testing. Shared-representation learning correlated across both modalities. For instance, after feature learning, a model of audio data was trained by supervised learning and tested on the videos, and vice versa. Cross-modality was done using all data in feature learning, but only a single modality was introduced in training and testing. Recently, actigraphy devices and smartphones have been used for activity recognition. In multi-sensor datasets, researchers extract the features from each sensor and aggregate the features for final classification, which is not optimal, since each sensor would have different statistical properties [36].
2.3 Combining Multiple Data Representations
A detailed overview of fusing different data from sensors was provided earlier in the chapter. It is worth noting that most researchers have worked on combining multiple sensors, rather than combining multiple views of single sensor. Various combinations of the supported features will produce more accurate and precise results. The combination of different sensors or features, encompassing feature spaces with varying dimensions, is a non-trivial problem that has been the subject of study for many years.
A collection or combination of features can assist a system in producing more precise classification and search results. However, there are pitfalls associated with the application of data combination that must be taken into consideration.
If data combination is not conducted properly, it can result in performance loss.
Combining characteristics can occur in three distinct ways. Data-level fusion basically fuses all the data at the very first instant, prior to applying a decision- making algorithm. Feature-level fusion combines the values of many features into a single representation prior to the application of the decision-making process.
Decision-level fusion occurs after a step of decision-making, and features are
4Disparate data is heterogeneous data that is collected from any number of sources.
Combining Multiple Data Representations merged for late fusion. The following sections discuss these techniques in more detail.
2.3.1 Early Fusion
The concept of early fusion is to combine unimodal data or features after preprocessing or feature extraction into a multimodal representation, which entails combining the feature values into one large vector for representation [59, 138]. These large representation vectors can be used for further search or classification tasks, such as supervised or unsupervised learning. Supervised learning must have labeled testing and training data for learning while unsupervised learning doesn’t hold any labeled training examples. Early fusion has two subcategories: data-level fusion and feature-level fusion.
2.3.1.1 Data-Level Fusion
Figure 2.4: Pipeline for data-level fusion. The data are first combined into one large vector before a decision is made based on the fused data.
A detailed overview of the pipeline of data-level fusion is shown in Figure 2.4. All the data would be concatenated into one long vector during data- level fusion. The user can combine text, audio, and visual elements, as well as their variants. Sometimes the combined vector contains a large number of artifacts that could be ignored by employing feature selection techniques, such as independent component analysis (ICA), principal component analysis (PCA), and linear discriminant analysis (LDA). Using one of these techniques can reduce the length of the vectors.
2.3.1.2 Feature-Level Fusion
Before combining the modalities, the features would be extracted using feature- level fusion [76]. The combination of features is entirely dependent on the user or task solver. For instance, the user can combine textual, audio, and video modalities. A challenge posed by feature-level fusion is that combining multiple features prior to classification may increase data noise. Additionally, combining the characteristics of distinct modalities can be difficult. Typically, these sorts
of issues arise when modalities have distinct dimensions and value ranges for their dimensions. For example, combining textual features with approximately one hundred dimensions with audio features with one thousand dimensions. To handle such data, preprocessing by selecting or reducing features and scaling or normalizing the data prior to fusion is recommended [45, 119]. A detailed overview of the pipeline for feature-level fusion can be found in Figure 2.5.
Figure 2.5: Pipeline for feature-level fusion. The features are first extracted and combined into one large vector, then a decision is made based on fused features.
2.3.1.3 Late Fusion
Inlate fusion, each representation of modality will be classified separately. Late fusion is referred to as decision-level fusion [138]. The final results are the result of combining the decisions of each classifier. Figure 2.6 depicts a summary of the late fusion pipeline. As stated, the decisions of all classifiers will be combined, entailing the need for a separate classifier in late fusion, which is incredibly costly during data learning. Since we will not be combining all the features, we may lose some useful information [119].
Combining the results of all classifiers is a crucial step that can be accomplished in a variety of ways. The optimal method depends on the datasets, the features used in the datasets, and the metrics used to calculate the distances between various features. A smart and well-selected combination method can enhance classification outcomes [41]. Some datasets are better suited for late fusion based on rank score, whereas others are more suited for fusion based on weighted rank [27, 33, 124]. The experiments showed that when features with the same metrics for their distance scores are combined, the outcomes are better.
Alternatively, if they use different distance metrics for the scores, a combination by rank may produce more accurate results [33]. Escalante et al. [32] concluded that late fusion works better for multimedia retrieval techniques. To achieve late fusion, they used ranked lists generated by their system’s search queries.
In the following section, we will discuss data preprocessing techniques.
Preprocessing Techniques
Figure 2.6: Pipeline for decision-level fusion. Each sensor/ representation is trained first and then combined.
2.4 Preprocessing Techniques
Preprocessing, the first step in the deep learning pipeline, entails putting the raw input into a form that a network can understand. It relies on the data dimensions or type. For instance, we may remove noise or normalize the input data. The input picture can also be resized to match the dimensions of an input image layer. Preprocessing the data might improve desired features, which eventually helps to avoid biasing the network. To make sure the deep model performs effectively, we must employ specific preprocessing methods prior to feeding data into the model. Segmentation, scaling, one-hot encoding, handling missing data, and transformation are the basic preprocessing methods.
2.4.1 Segmentation
The data may need to be segmented before being fed to a deep model, depending on the sensor data used. In image-based activity recognition system, differentiating actions like gestures and walking from a single image is possible.
However, we are unable to determine activities from a single magnetometer or accelerometer data. A single data sample (except an image) cannot represent activity characteristics. As a result, we must segment this data into sequences using a fixed time window (for instance, 5s), from which we can develop HAR models. The alignment of the same time window is very important when merging data from several sensors because the sampling frequency for each sensor may differ. In recent years, time-aware models have been developed to handle data with irregular time periods [13].
2.4.2 Scaling
To enable deep models to attain optimal performance, it is typically necessary to rescale the raw data to a specific range, as deep models prefer to work with small input data (e.g., between 0 and 1). If a model’s input values are extremely large, it tends to learn large weight values, which raises the computing cost [42].
Normalization and standardization are two typical scaling approaches.
Normalization is used to rescales data from the original range to the interval between 0 and 1. Let w represent a sensor input vector, which typically corresponds to one column of the input matrix. The mathematical expression of the normalizing procedure is as follows:
w′= w−min(w)
max(w)−min(w) (2.1)
whereminandmax are functions that calculate the minimum and maximum values of the given vector. The resultant vectorw′ has a range between 0 and 1.
In cases where the vector’s minimum or maximum value is unavailable, or when there are outliers, normalization may not be effective [39].
Standardization is a common scaling method that is less affected by outliers.
It is described mathematically as:
w′= w−µw
σw (2.2)
whereµy andσy are the means and standard deviation of the input vector y, respectively. The obtained result would have a mean of 0 and a standard deviation of 1 after standardization.
2.4.3 Label Encoding
In general, activity labels like walking and sitting are categorical. Nevertheless, deep models cannot operate effectively with categorical labels and require that all input data be numeric. We can easily achieve this by encoding each label as an integer. However, integer encoding may not perform well if the model seeks to identify a natural ordering link across categories. In order to avoid this, the label is encoded with one-hot encoding [86]. One-hot encoding employs an identity matrix with the same dimensions as the number of activity kinds.
Each row contains a single element with the value 1, denoting the activity it represents.
2.4.4 Feature Selection
Different features possess distinct properties, depending on the context in which they are used. To make the classifier fast and reliable, we must first determine which features will be used for particular use cases. This preliminary stage is necessary, as randomly selected features and combinations of those features may
Preprocessing Techniques lead to undesirable outcomes or poor performance. Moreover, the presence of random features may increase the system’s noise and slow down the predictions.
It is important to explore the specific features required for specific tasks. A significant amount of research has been performed in the field of feature selection, and so many machine learning techniques have been applied [82, 88, 97]. Using a variety of techniques, we can also eliminate unnecessary characteristics and reduce dimensionality. For instance, PCA is used to reduce a larger set of features to a smaller one while preserving approximately the same amount of information. Another example is information gain attribute evaluation, which measures the information gain of a feature relative to the classification problem in order to determine which feature provides the most information. In some studies, LDA was used for dimensionality reduction [122].
2.4.5 Transformation
Before using the input data to train a deep model, it is frequently desirable to apply specific transformations. Transformations are typically applied to input data to reduce correlations. ICA, PCA, and LDA are popular transformation or dimensionality reduction techniques.
2.4.5.1 Principal Component Analysis
PCA generates a new feature space and focuses on the direction of maximum covariance. Normally, PCA is used to reduce the dimension of the data by focusing on the essential variations. The covariance matrix of the data is defined as:
O= 1 N
N
X
i=1
(∇(Mi).∇(Mi)T), (2.3)
∇(Mi) =γ(Mi)−¯γ, (2.4)
γ¯= 1 N
N
X
i=1
γ(Mi), (2.5)
where ¯γ represents the Gaussian kernel and N represents the number of events in the activity period. Eigenvalue decomposition can be applied as:
O=ETαE, (2.6)
whereE represents the principal components, αrepresents the eigenvalues, andOis the diagonal matrix of the eigenvalues. Herein, the features for an event can be represented by projection of the principal components as:
P =M EmT. (2.7)
The size of the matrixE becomest×m, wheret is the dimension of each vector,mis the number of principal components to be considered, and Ois an
m×m diagonal matrix. Moreover, E reflects the original coordinate system onto the eigenvectors. The eigenvector corresponding to the largest eigenvalue indicates the axis of largest variance, and the eigenvector corresponding to the next largest eigenvalue indicates the axis that is orthogonal to the first indicating the second largest variance, and so on. In general, eigenvalues close to zero have low variance and can thus be excluded. Hence, themeigenvectors corresponding to certain large eigenvalues can be used to define the subspace.
PCA is a second-order statistics-based analysis approach that represents global information independently [127]. Applying PCA to human activities produces global features that represent frequently moving parts of the human body engaged in various activities [62, 126].
2.4.5.2 Linear Discriminant Analysis
LDA, which is common in supervised classification methods, creates hyperplanes to divide the various classes. The separation between classes is maximized and intraclass variance is minimized by these hyperplanes. The input data are projected in a lower-dimensional space by LDA, which is known to extract the best features and reduce the dimensionality of the data [14]. The following equations define the within-class SW and between-classSB scattering comparison.
SB=
C
X
i=1
Ji(ni−n)(ni−n)T (2.8)
SW =
C
X
i=1
X
mk∈Ci
Ji(nk−ni)(nk−ni)T, (2.9) where Ji is the number of vectors in the ith class Ci, c is the number of classes (number of activities), ni is the mean of classci, andmk is the vector of a specific class.
The optimal discrimination matrix is selected by maximizing the ratio of the determinant of the between- and within-class scatter matrices as:
Dopt=argmax|DTSBD|
|DTSWD|, (2.10)
whereDopt is the set of discriminant vectors ofSW andSB corresponding to the (c−1) largest generalized eigenvaluesλand can be obtained by solving (2.11):
SBdi=λiSWdi, i= 1,2, ...,(c−1), (2.11) where the rank ofSB is (c−1) or less; hence, the upper bound value oftis (c−1).
Machine Learning
2.5 Machine Learning
Machine learning is a type of artificial intelligence (AI) that improves the predictive accuracy of software applications. Machine learning algorithms predict output values using historical data as input. In general, machine learning algorithms can be classified as supervised or unsupervised. In unsupervised algorithms, the final class or label will be unknown. Supervised algorithms contain labels for the training data, and we sometimes know the labels for the testing data. Hence, unsupervised algorithms are used to comprehend and explore unlabeled data [52].
When conventional HAR systems are designed with shallow learning5, the performance of the machine learning techniques is highly dependent on the data representations [15]. The most frequently used features are time-domain features[20], frequency-domain features, and other transformations (wavelet transforms) [58]. The time-domain features include time sequences, variance, and mean, while frequency-domain features are Fourier transform and entropy.
In contrast, deep-learning features are learned hierarchically from raw data via nonlinear transformations. The deep learning network type is determined by the nonlinear transformation. In recent years, deep learning has gained popularity, and various activity recognition projects have used deep learning [17, 21, 43]. Convolutional neural networks (CNNs), deep neural networks (DNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM) RNNs are common deep-learning techniques in activity recognition. In this thesis, we applied various classification algorithms to the dataset for comparative and performance analyses. These will be briefly introduced in the following sections.
2.5.1 Conventional Machine Learning Classifiers
• Support Vector Machines: Cortes and Vapnik [22] introduced support vector machines (SVMs), which employs support vectors. It has been widely used in HAR systems due to its superior classification performance [1, 3]. It generates hyperplanes in order to maximize the gap between classes. The optimal solution can be obtained by minimizing the cost function, which means maximizing the distance between the hyperplane and the nearest training point. Sigmoid, RBF, and Gaussian kernels are used based on the linearity or nonlinearity of the data.
• AdaBoost: Adaptive boosting, or AdaBoost [34], is primarily used for ensemble learning and meta-learning. It employs an iterative method to improve the performance of classifiers by learning from their mistakes.
AdaBoost is widely used by HAR researchers [121].
• Quadratic Discriminant Analysis: Quadratic discriminant analysis (QDA) is closely related to linear discriminant analysis, but it does not
5Shallow learning refers to everything other than deep learning, such as traditional machine learning models like support vector machines.
assume that the covariance of each class is identical[40]. It assumes that the distribution of each class is gaussian.
• K-nearest neighbors: K-nearest neighbor (KNN) is the most basic classification approach used in machine learning. The KNN algorithm identifies training datapoints that are close enough to be considered when choosing the class to predict a new observation [68].
• Decision Trees: A decision tree is a decision assistance tool that incorporates a model of decisions or tree-like graphs and their potential outcomes, including their usefulness and likelihood. A decision tree is a well-known classifier for machine learning. Each internal node represents an attribute test, such as the chance of receiving heads or tails when flipping a coin. Each branch represents a potential test result, and every leaf node represents the class label. After analyzing all relevant aspects, a decision is taken. The classification criteria are based on the path from the root to the leaf [6].
• Random Forests: The random forest (RF) method is applied to classification and regression problems [55]. It generates multiple decision trees based on a random selection of variables and data and uses the decision trees to identify dependent variables. RF has been widely used to identify various human activities [93, 109].
2.5.2 Deep Learning
Deep learning algorithms use recently developed training techniques to train their models based on neural network techniques. They are essentially an abstract representation of individual data points. The networks aredeepbecause data are represented at a high abstraction level and are processed using numerous layers.
The different layers can learn about the data at different levels of abstraction by using the information from the layers below them until they reach the last layer, which makes the final decision for the class. The rise of GPU computing, which permits network training in a reasonable time [72], is largely responsible for developing new training strategies for deep learning. Multiclass classification can be greatly aided by deep learning. Disadvantages include a lengthy training period, difficult-to-explain categorization boundaries (why a certain datapoint was placed in this class), and a strong reliance on data [83]. In this thesis, we worked primarily with CNNs and RNNs. A more detailed discussion about deep learning in the context of HAR can be found in section Section 2.5.2.3.
2.5.2.1 Convolutional Neural Networks
CNNs, also called ConvNets, are extensively used for their ability to learn features from raw sensor data. CNNs have been used most commonly for image analysis, but they can also be applied to other data analysis and classification problems. LeCun et al. [73] first employed CNNs for handwritten number