New technological developments in the field of multimodal interaction show great promise for the improvement and assessment of public speaking skills. However, it is unclear how the experience of non-native speakers interacting with such technologies differs from native speakers. In particular, nonnative speakers could benefit less from training with multimodal systems compared to native speakers. Additionally, machine learning models trained for the automatic assessment of public speaking ability on data of native speakers might not be performing well for assessing the performance of non-native speakers. In this paper, we investigate two aspects related to the performance and evaluation of multimodal interaction technologies designed for the improvement and assessment of public speaking between a population of English native speakers and a population of non-native English speakers. Firstly, we compare the experiences and training outcomes of these two populations interacting with a virtual audience system designed for training public speaking ability, collecting a dataset of public speaking presentations in the process. Secondly, using this dataset, we build regression models for predicting public speaking performance on both populations and evaluate these models, both on the population they were trained on and on how they generalize to the second population.