Challenges and Opportunities for Generative Methods in the Cyber Domain
USMA Research Unit Affiliation
Army Cyber Institute
Large, high quality data sets are essential for training machine learning models to perform their tasks accurately. The lack of such training data has constrained machine learning research in the cyber domain. This work explores how Markov Chain Monte Carlo (MCMC) methods can be used for realistic synthetic data generation and compares it to several existing generative machine learning techniques. The performance of MCMC is compared to generative adversarial network (GAN) and variational autoencoder (VAE) methods to estimate the joint probability distribution of network intrusion detection system data. A statistical analysis of the synthetically generated cyber data determines the goodness of fit, aiming to improve cyber threat detection. The experimental results suggest that the data generated from MCMC fits the true distribution approximately as well as the data generated from GAN and VAE; however, the MCMC requires a significantly longer training period and is unproven for higher dimensional cyber data.
Record links to items hosted by external providers may require fee for full-text.