Web robots (Joe A. Dellinger)

Index Home About Blog
Newsgroups: comp.risks
X-issue: 17.70
Date: Sun, 4 Feb 96 20:50:03 CST
From: jdellinger@amoco.com (Joe A. Dellinger)
Subject: Risks of web robots

	Here are three risks of "web robots" I've run across recently that
I think Risks readers might find interesting.

1)	The first is probably already well known to Risks readers: password
files accidentally being exported to the world. Web servers are just yet
another way of making that mistake.

	Here is a post that has already had wide circulation (and may have
already appeared in Risks... I'm unable to scan back issues to check right now
because of heavy network load):

>Subject: BoS: Misconfigured Web Servers
>
>     A friend of mine showed me a nasty little "trick" over the weekend. He
>     went to a Web Search server (http://www.altavista.digital.com/) and
>     did a search on the following keywords -
>
>             root: 0:0 sync: bin: daemon:
>
>     You get the idea. He copied out several encrypted root passwords from
>     passwd files, launched CrackerJack and a 1/2 MB word file and had a
>     root password in under 30 minutes. All without accessing the site's
>     server, just the index on a web search server!
>
      ....
>
>     The guy that showed me this found it funny, but I find it disturbing.
>     Are there that many sites that are that poorly configured?
>
>     Mark_W_Loveless@smtp.bnr.com

	I just verified that indeed this search does work, although to my
relief the majority of the "hits" found are legitimate documents discussing
UNIX security. The risks are fairly obvious.

1')	Here is a variation on the above risk that I HAVEN'T seen discussed
before, however. See what happens if you search AltaVista for THESE keywords:

"unpublished proprietary source code actual intended reserved copyright notice"

	The results of this search are even more frightening, at least to me.

	The general risk is not just that you can conveniently find password
files, but ANY kind of document that shouldn't be widely distributed:
material useful for breaking into your system, copyrighted material, illegal
material, libelous material, incriminating or embarrassing material, etc...

2) 	The second risk works the other way: fooling stupid web robots so
as to lure people to your web site.

	A month ago I tried searching for "eisner reciprocity paradox" on
WebCrawler, hoping to find that it had indexed a paper of mine that I had
reprinted electronically under my home page. Nope, it hadn't (or at least
I was unable to find it using any of the likely keywords I could think of!).
Instead the single match was on a URL intriguingly entitled
"The information source".

	Gee, this "information source" must have an article in it about
Eisner's Reciprocity Paradox, one that I hadn't known of before! So I followed
the link, and ended up at something unexpected: "http://www.graviton.com/red/",
"The Red Herring Home Page"! (It comes complete with gifs of red fish!)

	A little experimentation revealed that almost ANY obscure search would
match "The information source", often as the only matching document found.
As near as I could figure out, his site recognized probes by web robots and
then threw a dictionary at them! (His point made, he has since stopped,
although the Red Herring page is still there for your perusal.)

	I contacted the author, Tom White, and asked for more details. He
didn't want to give his secrets away, but did reply:

> I will say that I spent no more than an hour on the whole thing, including
> writing the page, and it was effective far beyond what I thought a silly
> trick like that would muster.  I think that by virtue of not hiding what
> I am trying to do, people who write web indexers may see the page and think
> of ways to subvert feeble attempts like mine - which is a good thing since
> the page could have as easily been any propaganda I wanted to push on people.

	The risk? It can be frustratingly difficult (or impossible) to get a
web robot's attention for a legitimate page you WANT indexed, or to find a
page you know is there amist all the distractions of "false hits". Part of
the clutter may be wildly off-topic pages engineered to fool web robots into
thinking that almost anything matches them. (Or simply long rambling pages
containing lots of poems and such... documents that "fool" the robots more
by accident than design.)

3)	Finally, the act of being searched can cause problems for certain kinds
of sites: ones that carry hundreds of thousands of distinct URLs, often
generated only on demand, and that don't expect any one site to ever have
reason to download ALL of them, whether all at once or a few at a time.

	See for example "http://xxx.lanl.gov/RobotsBeware.html". The authors
state there: "This www server has been under all-too-frequent attack from
`intelligent agents' (a.k.a. `robots') that mindlessly download every link
encountered, ultimately trying to access the entire database through the
listings links. In most cases, these processes are run by well-intentioned but
thoughtless neophytes, ignorant of common sense guidelines."

	They have been forced to take a "proactive" stance to protect
themselves: "We are not willing to play sitting duck to this nonsensical
method of `indexing' information." The rather UNIQUE hot link that follows,
"(Click here to initiate automated `seek-and-destroy' against your site.)",
doesn't actually do anything but pause for 30 seconds, I'm told...

	I'll let readers examine the page and draw their own Risks!
Index Home About Blog