PHISHSIM: Phishing Website Detection
DOI:
https://doi.org/10.47392/IRJAEM.2025.0509Keywords:
HTML, Webpage, Prototype Extraction, Website Similarity, Compression, dthreshold (Distance Threshold), Quality of Clustering (QC) metricAbstract
In this paper, we introduce a powerful new approach for detecting phishing websites that is entirely feature-free. Our method, called PhishSim, uses the Normalized Compression Distance (NCD), a technique that requires no specialized parameters. NCD works by measuring the similarity of two websites through compression, eliminating the time and effort typically needed for feature extraction. We classify suspicious pages by comparing their HTML content to a database of known phishing sites. To keep our database efficient, we employ the Furthest Point First (FPF) algorithm to extract "prototypes"—representative examples of phishing webpage clusters. Furthermore, we integrate an incremental learning algorithm to make the system continuously adaptable, ensuring detection remains sharp even as attack methods evolve (concept drift). Tested on a large, realistic dataset, PhishSim significantly outperforms previous methods, achieving an outstanding AUC score of 98.68% , a high true positive rate (TPR) of about 90% , and a remarkably low false positive rate (FPR) of 0.58%. By using prototypes, we avoid storing large amounts of historical data, making the system practical for real-world deployment with a fast processing time of approximately 0.3 seconds.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2025 International Research Journal on Advanced Engineering and Management (IRJAEM)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
.