Data Preparation - Cyber Grooming Detection: Human or Machine? Or Hybrid?

In order to use the data from the datasets it was necessary to do some preparations and preprocessing, as there were several data sources that needed to be combined

Chapter 3: Data 29

Figure 3.5:Experiment GUI: The remaining messages of the conversation is dis-played to the user after submission.

in order to get meaning from the data.

A python script was made using the libraries lxml and Pandas. The lxml library was used to get powerful tools for parsing the XML files, and the Pandas library was used to store, keep track of and utilize all the data.

Each XML file of the dataset containing 32063 files from Bours and Kulsrud were first parsed through to find the file number, conversation ID and the au-thor IDs of the conversation. Each file was stored as a new data row in a Pandas DataFrame, where the different values were stored in their own columns. When files were parsed, the first author ID of the conversation, meaning the first per-son to send a message, was checked against the ground truth file. If the ID was found, the author ID was stored in the DataFrame. The subresult column of the DataFrame were labeled left for the file in question, the conversation was labeled predatory and the script proceeded to the next file. If the first ID of the conver-sation was not found, the second one was checked against the ground truth file.

If this second ID was found, it was labeled right. The conversation was labeled predatory and the script proceeded to the next file. If none of the two author IDs of the conversation were found, the file was labeled non-predatory and the script proceeded to the next file. The labeling of left and right is a reference for visual representation of the conversations, just as the conversations are displayed in the experiment GUI. The left hand side is always the first party to send a message, and the right hand side is always the other party of the conversation. The labeling of predatory (left and right) and non-predatory also makes it easier to process and analyse the data at a later point of time.

After the processing of the 32063 files, the dataset of 4084 files was processed in the same way and the data stored in a new DataFrame. In order to find what files in the 32063 dataset the 4084 manually evaluated files corresponds to, the conversation IDs stored for each conversation in the two DataFrames were com-pared. For those matching, the filename and number from the 32063 dataset was added to a new column for the file in question from the 4084 dataset.

Next, the processed data was combined with the output data from the experi-ment in order to see how the different evaluations were compared to the ground truth. Initially, the data from the experiment was without ground truth, only hold-ing the information given from the evaluation. Therefore, to be able to analyze the data, one necessary part of it was to link it up with the ground truth. This was achieved by extending the python script to import the evaluations data into another DataFrame. This was used to match the file name and numbers of the eval-uation data with the data stored in the DataFrame for the 4084 dataset holding information about ground truth. In addition to adding a column with the ground truth, another column was added comparing the evaluation to the ground truth to ease the process of finding evaluations deviating from the ground truth.

Finally, the new DataFrame holding the evaluations data combined with the ground truth and comparison was exported into a CSV file, which figure 3.6 shows an excerpt of.

The CSV file was now ready for analysis and all preprocessing was completed.

Chapter 3: Data 31

Figure 3.6:Excerpt from the CSV file generated by the python script.

The analysis will be described in the next chapter.

Chapter 4

Analysis and Results

This chapter explains the analysis performed on the data collected from the ex-periment and the summer intern and the results from it.

The analysis process on the evaluations and the corresponding conversations was performed to get a better understanding of what features makes conversations predatory or non-predatory. A machine learning model has limited abilities to understand conversations, as it only recognizes certain types of repeating patterns.

By analysing evaluations and conversations manually, it is possible to discover patterns a machine learning model is not able to find. It is also possible to find single features humans find to be describing and useful in detection of potentially predatory conversations.

When talking about the different conversations and evaluations from the dif-ferent datasets we will from now refer to the evaluations and conversations per-formed by the summer intern as PAN and the ones gathered from the data collec-tion experiment as Hybrid.

The prepared CSV file containing data from the experiment and the ground truths, plus the conversations themselves were the basis for the analysis. As a comment was made for each of the evaluations, the analysis process aimed to derive meaningful trends and patterns from the quantity of multiple evaluations of many conversations. As the existing system today is created to only evaluate the risk of each message isolated, it lacks the ability to address more complex features of conversations that can be used to detect potential predatory conversations. The analysis aims to discover such features.

In document Cyber Grooming Detection: Human or Machine? Or Hybrid? (sider 50-55)