Natural Language Processing with machine learning for anomaly detection on system call logs
- Authors: Goosen, Christo
- Date: 2023-10-13
- Subjects: Natural language processing (Computer science) , Machine learning , Information security , Anomaly detection (Computer security) , Host-based intrusion detection system
- Language: English
- Type: Academic theses , Master's theses , text
- Identifier: http://hdl.handle.net/10962/424699 , vital:72176
- Description: Host intrusion detection systems and machine learning have been studied for many years especially on datasets like KDD99. Current research and systems are focused on low training and processing complex problems such as system call returns, which lack the system call arguments and potential traces of exploits run against a system. With respect to malware and vulnerabilities, signatures are relied upon, and the potential for natural language processing of the resulting logs and system call traces needs further experimentation. This research looks at unstructured raw system call traces from x86_64 bit GNU Linux operating systems with natural language processing and supervised and unsupervised machine learning techniques to identify current and unseen threats. The research explores whether these tools are within the skill set of information security professionals, or require data science professionals. The research makes use of an academic and modern system call dataset from Leipzig University and applies two machine learning models based on decision trees. Random Forest as the supervised algorithm is compared to the unsupervised Isolation Forest algorithm for this research, with each experiment repeated after hyper-parameter tuning. The research finds conclusive evidence that the Isolation Forest Tree algorithm is effective, when paired with a Principal Component Analysis, in identifying anomalies in the modern Leipzig Intrusion Detection Data Set (LID-DS) dataset combined with samples of executed malware from the Virus Total Academic dataset. The base or default model parameters produce sub-optimal results, whereas using a hyper-parameter tuning technique increases the accuracy to within promising levels for anomaly and potential zero day detection. , Thesis (MSc) -- Faculty of Science, Computer Science, 2023
- Full Text:
- Date Issued: 2023-10-13
- Authors: Goosen, Christo
- Date: 2023-10-13
- Subjects: Natural language processing (Computer science) , Machine learning , Information security , Anomaly detection (Computer security) , Host-based intrusion detection system
- Language: English
- Type: Academic theses , Master's theses , text
- Identifier: http://hdl.handle.net/10962/424699 , vital:72176
- Description: Host intrusion detection systems and machine learning have been studied for many years especially on datasets like KDD99. Current research and systems are focused on low training and processing complex problems such as system call returns, which lack the system call arguments and potential traces of exploits run against a system. With respect to malware and vulnerabilities, signatures are relied upon, and the potential for natural language processing of the resulting logs and system call traces needs further experimentation. This research looks at unstructured raw system call traces from x86_64 bit GNU Linux operating systems with natural language processing and supervised and unsupervised machine learning techniques to identify current and unseen threats. The research explores whether these tools are within the skill set of information security professionals, or require data science professionals. The research makes use of an academic and modern system call dataset from Leipzig University and applies two machine learning models based on decision trees. Random Forest as the supervised algorithm is compared to the unsupervised Isolation Forest algorithm for this research, with each experiment repeated after hyper-parameter tuning. The research finds conclusive evidence that the Isolation Forest Tree algorithm is effective, when paired with a Principal Component Analysis, in identifying anomalies in the modern Leipzig Intrusion Detection Data Set (LID-DS) dataset combined with samples of executed malware from the Virus Total Academic dataset. The base or default model parameters produce sub-optimal results, whereas using a hyper-parameter tuning technique increases the accuracy to within promising levels for anomaly and potential zero day detection. , Thesis (MSc) -- Faculty of Science, Computer Science, 2023
- Full Text:
- Date Issued: 2023-10-13
Property price prediction: a model utilising sentiment analysis
- Authors: Botes, Rhys Cameron
- Date: 2019
- Subjects: Natural language processing (Computer science) , Computational linguistics Text processing (Computer science) Social networks
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: http://hdl.handle.net/10948/37117 , vital:34119
- Description: The increase in the use of social media has led many researchers and companies to investigate the potential uses of the data that is generated by these social media platforms. This research study investigates how the use of sentiment variables, obtained from the social media platform Twitter, can be used to augment housing transfer data in order to develop a predictive model. The Design Science Research (DSR) methodology was followed, guided by a Social Media Framework. Experimentation was required within the Design Cycle of the DSR methodology, which lead to the adoption of the Experimental Research methodology within this cycle. An initial literature review identified regression models for property price prediction. Through experimentation, Gradient Boosting regression was identified as an optimal regression model for this purpose. Thereafter a review of sentiment analysis models was conducted which resulted in the proposal of a CNN-LSTM model for the classification of Tweets. Initial experimentation conducted with this proposed model resulted in an obtained accuracy comparable to the top performing sentiment analysis models identified. A dataset obtained through SemEval, a series of evaluations of computational semantic analysis systems, was used for this phase. For the final experimentation, The CNN-LSTM model was used to obtain sentiment variables from Tweets that were collected from the Western Cape Province in 2017. This property dataset was augmented with the sentiment variables, after which experimentation was conducted by applying Gradient Boosting regression. The augmentation was done in two ways, either based on suburb pertaining to the property, or to the month in which the property was transferred. The results indicate that a model for Property Price Prediction Utilising Sentiment Analysis demonstrates a small improvement when suburb-based sentiment, obtained from Tweets with a minimum threshold per suburb, is utilised. An important finding was the fact that, when geo-coordinates are removed from the dataset, the sentiment variables replace them in the regression results, producing the same level as accuracy as when the coordinates are included.
- Full Text:
- Date Issued: 2019
- Authors: Botes, Rhys Cameron
- Date: 2019
- Subjects: Natural language processing (Computer science) , Computational linguistics Text processing (Computer science) Social networks
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: http://hdl.handle.net/10948/37117 , vital:34119
- Description: The increase in the use of social media has led many researchers and companies to investigate the potential uses of the data that is generated by these social media platforms. This research study investigates how the use of sentiment variables, obtained from the social media platform Twitter, can be used to augment housing transfer data in order to develop a predictive model. The Design Science Research (DSR) methodology was followed, guided by a Social Media Framework. Experimentation was required within the Design Cycle of the DSR methodology, which lead to the adoption of the Experimental Research methodology within this cycle. An initial literature review identified regression models for property price prediction. Through experimentation, Gradient Boosting regression was identified as an optimal regression model for this purpose. Thereafter a review of sentiment analysis models was conducted which resulted in the proposal of a CNN-LSTM model for the classification of Tweets. Initial experimentation conducted with this proposed model resulted in an obtained accuracy comparable to the top performing sentiment analysis models identified. A dataset obtained through SemEval, a series of evaluations of computational semantic analysis systems, was used for this phase. For the final experimentation, The CNN-LSTM model was used to obtain sentiment variables from Tweets that were collected from the Western Cape Province in 2017. This property dataset was augmented with the sentiment variables, after which experimentation was conducted by applying Gradient Boosting regression. The augmentation was done in two ways, either based on suburb pertaining to the property, or to the month in which the property was transferred. The results indicate that a model for Property Price Prediction Utilising Sentiment Analysis demonstrates a small improvement when suburb-based sentiment, obtained from Tweets with a minimum threshold per suburb, is utilised. An important finding was the fact that, when geo-coordinates are removed from the dataset, the sentiment variables replace them in the regression results, producing the same level as accuracy as when the coordinates are included.
- Full Text:
- Date Issued: 2019
- «
- ‹
- 1
- ›
- »