Huge traffic from a botnet looking for datasets

Today, I received an e-mail from the Web hosting company indicating that my website had exceeded the bandwidth limit of the content delivery network (CDN) for my package. I was quite surprised. Hence, I checked the control panel, and I saw a huge increase in bandwidth for the last three days, as shown below (in GBs).

By looking at the logs, I saw that some bots from thousands of different addresses where trying to access datasets from the SPMF website using malformed URLs with multiple times the word “datasets” inside. Here is an excerpt from the logs:

These URLs do not exist, however due to the configurations of the server, they were redirected to the real dataset URLs thus and consuming a huge amount of bandwidth.

Since all the requests came from different IPs from dozen different countries, it would not be realistic to ban all the IP addresses.

Thus, I have check how to fix the configuration. Finally, I modified the .htaccess file of the server to block malformed requests and also deactivate the default fuzzy URL matching done by the server to match paths that dont exist with real paths on the server. This may have caused some slight issue on the website during the last hours. But now, I think that the problem is fixed and the website will be faster!

So why my website was flooded by requests for datasets? I think that the most likely reason is that some people have decided to launch a web scraping botnet for data, and that the bot is buggy such that it would recursively add /datasets/ to the same path dozens of times like in this URL:

… /spmf/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/costtrans/datasets/onshelf/datasets/husp/datasets/husp/datasets/husp/BIBLE_sequence_utility.txt

Than the botnet would not realize that it is actually always downloading the same files over and over again from similar URLs….

Update a few hours later: I see that my new rules in .htaccess are working as now all invalid requests are now blocked:

This entry was posted in Website and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *