PHISHSIM: Phishing Website Detection

Authors

  • Nivyashree R Asst. Professor, Department of Computer Science & Engineering, Malnad College of Engineering Author
  • Bhoomika S R Department of Computer Science & Engineering, Malnad College of Engineering Author
  • Chiranth H M Department of Computer Science & Engineering, Malnad College of Engineering Author
  • Bhavan N Gowda Department of Computer Science & Engineering, Malnad College of Engineering Author
  • Chakram Janya H U Department of Computer Science & Engineering, Malnad College of Engineering Author

DOI:

https://doi.org/10.47392/IRJAEM.2025.0509

Keywords:

HTML, Webpage, Prototype Extraction, Website Similarity, Compression, dthreshold (Distance Threshold), Quality of Clustering (QC) metric

Abstract

In this paper, we introduce a powerful new approach for detecting phishing websites that is entirely feature-free. Our method, called PhishSim, uses the Normalized Compression Distance (NCD), a technique that requires no specialized parameters. NCD works by measuring the similarity of two websites through compression, eliminating the time and effort typically needed for feature extraction. We classify suspicious pages by comparing their HTML content to a database of known phishing sites. To keep our database efficient, we employ the Furthest Point First (FPF) algorithm to extract "prototypes"—representative examples of phishing webpage clusters. Furthermore, we integrate an incremental learning algorithm to make the system continuously adaptable, ensuring detection remains sharp even as attack methods evolve (concept drift). Tested on a large, realistic dataset, PhishSim significantly outperforms previous methods, achieving an outstanding AUC score of 98.68% , a high true positive rate (TPR) of about 90% , and a remarkably low false positive rate (FPR) of 0.58%. By using prototypes, we avoid storing large amounts of historical data, making the system practical for real-world deployment with a fast processing time of approximately 0.3 seconds.

Downloads

Download data is not yet available.

Downloads

Published

2025-11-27