Summary to scan, fetch each of these by

Summary This paper describes web-cloaking as an activity consisting of a set of techniques used by a web server to find
incoming clients and customize the data or
content accordingly. Paper also describes various web cloaking
techniques involved in the black hat search engine optimization for hiding the malicious sites. These techniques cause hindrance
for the security web crawlers to work effectively and are not able to expose
the Internet users to these harmful contents 1.  The authors
use their findings to develop an anti-cloaking system, which detects split-view
content returned to two or more distinct browsing profiles by applying their
system to sets of searches and advert URLs based on high-risk terms.


Problem Solved The authors have explained what
factors are necessary to be taken into account in order to make the system more
accurate. Some of them are finding the sources of errors i.e. listing the false
positives and false negative, finding the salient features i.e. for examining
the overall accuracy of the system when it is trained only on a single class of
similarity metrics and finding minimum profile set, for quantifying the
trade-off between the anti-cloaking pipeline performance and its efficiency and

Best services for writing your paper according to Trustpilot

Premium Partner
From $18.00 per page
4,8 / 5
Writers Experience
Recommended Service
From $13.90 per page
4,6 / 5
Writers Experience
From $20.00 per page
4,5 / 5
Writers Experience
* All Partners were chosen among 50+ writing services by our Customer Satisfaction Team


Approach The approach of evaluation that
the authors follow for this is by training and testing a decision tree
classifier using 10-fold cross validation over an imbalanced dataset of a large
number of non-cloaked URLs. They achieve this by leveraging a large dataset of
URLs to scan, fetch each of these by multiple browser profiles as well as
network vantage points in order to trigger any kind of cloaking logic, compare
their content, structure and redirect graphs associated with each fetch. They
then feed these features into a classifier to detect the presence of black hat


Related Work In the paper, the authors refer to
earlier works by Wang et al. 2 which relied on a cross-view comparison
between search results extracted by a browser setup like a crawler and a second
extract through a browser that imitates a real user 2, 3. This same
approach dominates subsequent cloaking detection strategies that fetch pages
via multiple browser profiles to examine divergent redirects (including
JavaScript, 30X, and meta-refresh tags) 4, 5, 6, inspect the redirection
chains that lead to poisoned search results, isolate variations in content
between two fetches (e.g., topics, URLs, page structure) 6, 7, or apply
cloaking detection to alternative domains such as spammed forum URLs 8.

Methodology For the valid detection of
cloaking, the authors perform multiple steps which include a selection of a candidate URL. The authors use
Google Compute Engine to implement their de-cloaking crawler and distribute its
featurization on 20 Ubuntu machines and they perform the classification on a
single instance. The authors evaluate their crawler system and explore the
overall performance of their classifiers. They conduct training for the crawler
and test it on the basis of decision tree classifiers to get the accuracy and
coverage of the crawling performed.   

Conclusions In this paper, the author’s
avenues the cloaking arms race by venturing out between the security crawlers
and miscreants seeking to earn money from the search engines and ad networks
via counterfeit storefronts and malicious advertisements. The authors very
nicely address this discontinuity between the two, developing an anti-cloaking
system that covers a spectrum of browsers, network, and contextual black-hat
targeting techniques that they use to determine the minimum crawling
capabilities required to contend with cloaking today.