Contact Project Developer Ashish D. Tiwari []
Download Synopsis Abstract
Mobile Apps Cloud C#.NET ASP.NET Data Mining BE-Engineering(CO/IT) ME-Engineering(CO/IT) BCS MCS BCA MCA MCM BSC Computer/IT MSC Computer/IT Diploma (CO/IT) IEEE-2016

A Performance Evaluation of Machine Learning-Based Streaming Spam Tweets Detection

The results show the streaming spam tweet detection is still a big challenge and a robust detection technique should take into account the three aspects of data, feature, and model


The popularity of Twitter attracts more and more spammers. Spammers send unwanted tweets to Twitter users to promote websites or services, which are harmful to normal users. In order to stop spammers, researchers have proposed a number of mechanisms. The focus of recent works is on the application of machine learning techniques into Twitter spam detection. However, tweets are retrieved in a streaming way, and Twitter provides the Streaming API for developers and researchers to access public tweets in real time. There lacks a performance evaluation of existing machine learning-based streaming spam detection methods. In this paper, we bridged the gap by carrying out a performance evaluation, which was from three different aspects of data, feature, and model. A big ground-truth of over 600 million public tweets was created by using a commercial URL-based security tool. For real-time spam detection, we further extracted 12 lightweight features for tweet representation. Spam detection was then transformed to a binary classification problem in the feature space and can be solved by conventional machine learning algorithms. We evaluated the impact of different factors to the spam detection performance, which included spam to no spam ratio, feature discretization, training data size, data sampling, time-related data, and machine learning algorithms. The results show the streaming spam tweet detection is still a big challenge and a robust detection technique should take into account the three aspects of data, feature, and model.

Proposed System 

Consequently, the research community, as well as Twitter itself, has proposed some spam detection schemes to make Twitter as a spam-free platform. For instance, Twitter has applied some “Twitter rules” to suspend accounts if they behave abnormally. Those accounts, which are frequently requesting to be friends with others, sending duplicate content, mentioning others users, or posting URL-only content, will be suspended by Twitter. Twitter users can also report a spammer to the official @spam account. To automatically detect spam, machine learning algorithms have been applied by researchers to make spam detection as a classification problem . Most of these works classify a user is spammer or not by relying on the features which need historical information of the user or the exiting social graph. For example, the feature, “the fraction of tweets of the user containing URL” used in must be retrieved from the users’ tweets list; features such as, “average neighbors’ tweets” in and “distance” in  cannot be extracted without the built social graph. However, Twitter data are in the form of stream, and tweets arrive at very high speed.Despite that these methods are effective in detecting Twitter spam, they are not applicable in detecting streaming spam tweets as each streaming tweet does not contain the historical information or social graph that are needed in detection.


In this paper, we provide a fundamental evaluation of ML algorithms on the detection of streaming spam tweets. In our evaluation, we found that classifiers’ ability to detect Twitter spam reduced when in a near real-world scenario since the imbalanced data brings bias. We also identified that Feature discretization was an important preprocess to ML-based spam detection. Second, increasing training data only cannot bring more benefits to detect Twitter spam after a certain number of training samples. We should try to bring more discriminative features or better model to further improve spam detection rate. Third, classifiers can detect more spam tweets when the tweets were From the third point, we thoroughly analyzed the reason why classifiers’ performances reduced when training and testing data were in different days from three point of views. We conclude that the performance decreases due to the fact that the distribution of features changes of later days’ dataset, whereas the distribution of training dataset stays the same. This problem will exist in streaming spam tweets detection, as the new tweets are coming in the forms of streams, but the training dataset is not updated

Comment is Only Available for registered users! Create Account or Login Now!