~ The amazing flying wizards ~
         to basic   

Version November 2002

The amazing flying wizards 
 (How a bunch of leet seekers enters databases)
by VVAA (Various Authors)

(edited by fravia+)

Great stuff for seekers, truncation galore, read and take note of the many tricks...

Everything began with a simple and interesting question...
site specific searches/filenames ?

the-scientist.com publishes some articles that are "hot":

Data derived from the Science Watch/ Hot Papers database and the Web of Science (ISI, Philadelphia) [these are paid services] show that Hot Papers are cited 50 to 100 times more often than the average paper of the same type and age.

The URL for one of them is

How could we search for a list of "hot" articles in that site (because we / cannot /will not / should not / pay for the original service from Science Watch)?.

I tried in Google:
site:www.the-scientist.com inurl:hot
but it fails, I think because hot is not in the domain but in the filename

digital gaucho
You will admit that this is an interesting question for a seeker. A typical case of 'closed database', where information is hoarded instead of being freely spread. But the fundamental law of the web is on the side of the spreaders, and very strong web-winds blow against the commercial hoarders...
Re: site specific searches/filenames ?

I think your problem is truncation... I only know to main search engines that perform this kind o search: altavista and hotbot

Here are the queries you can use:

(host:the-scientist.com OR host:www.the-scientist.com) AND url:hot_** on altavista

+"the-scientist.com" +domain:com +hot_* on hotbot

the second query brings the folowing url: http://www.the-scientist.com/hotpapersarchive.htm, maybe that's what you want.


Truncation! How important! Moreover, those among you that have visited the first link given above by Nemo will have noticed some URLs: www.the-scientist.com/yr2000/jan/hot_000110.html, www.the-scientist.com/yr2000/feb/hot_000221.html, www.the-scientist.com/yr2002/mar/hot_020318.html, etc... mmmm :-)
Re: Re: Re: site specific searches/filenames ?

well... heres a couple my tricks (guess thats what you'd call them :)

archive org of course will give you lots of info ... however even with many url names (including nemos above) you still get delivered TOO the registration page

(I am unclear as to the problem though --- it seems to indicate it is a free reg (?) and then u can proceed to read free of charge ?) thats hat i read

I didn't register so am not clear or sure about this (butt you indicate that you have to pay)?

butt to a trick that works sometimes ... sometimes...

lets take Your posted url: http://www.the-scientist.com/yr2002/oct/hot_021028.html

now lets feed it to google ...

now lets click on googles cache of (your url... or any url at the site that you have brought up in the search engine that produces the doc that you want to read)

google s cache wisks you off to their Registration page ...
because somewhere it is reading and if and or else statement telling it you are a bad guy
could it be in googles cache?
yup ...

so, back space now back to the google page
and click again on the cache link
now VERY QUICKLY before it can transfer you to the registration page HIT YOUR STOP BUTTON (if you are not quick enough u may have to try several times...

you will be staring at a blank googles cache page...
butt lets not stop there ... lets look at view source... [i tried pasting the source here but it didn't want to work correctly] ... make a copy of your 'source' and put it in an editor and view the html page :) :) works fine

not only is the code for transfering you to the bad guy page there (i guess ---im not a coder)... butt WhalaH --- also THERE you will see is the article that you wanted to read in the source :) :)

what i find even more interesting in this little project is the scientists robots txt
User-agent: *    # applies to all robots
Disallow: /surveys
Disallow: /webreports
Disallow: /eugene_garfield
now why in the world would they block out a specific name???
google shows some rather nice returns for that name :) [although i don't have time to figure out if he is listing and giving the articles essays away for free at his sites --- the pdf files seem to work --- or why his name is disallowed

??? ohoooooooooo well geeeeeeeeeeesh -- you know the clocks were rolled BACK yesterday and its only 11:20 here --- but really it should be 12:20 and by 12:20 I have had at least one beer so i guess my overactive mind isn't slowed down enough because of the lack of beeeeeeeeeeeeeeeeeeeeeeeeer ... ok all it means is this i guess --- http://www.the-scientist.com/eugene_garfield/

so lets re-evaluate
if you try nemo's page with your url on it ... and click it it brings you to the registration or login page ... they want an email address to let you in --- yes?

don't u supposse that someone who works there has an email address that lets him in???

lets ask google:

well sonofabeehive ... google lists a number of members
lets snatch the very first one's email address and try it

lets disguise this emailaddress a little here, so that the harvesting bots and others nasties wont index it:
at the-scientist
dot com

now let's paste that above (corrected of course, with @ and everything) into the email addy into the login page ... annnnnnnnnnnd

what was that quote?? oh yes 'That's funny ...'" Isaac Asimov

sunofabee there's your article with pictures ...

(DO NOT --- mess with the guys account info should you try this --- thanks)



He did it! Among many other useful information in the snippet above (like the checking of the robots.txt), Jeff shows you an incredibly powerful access-shortcut whenever someone dares to stop seekers asking for a registered email address: they want an email address? Let's give them proper emailaddresses a-plenty! :-)
Re: Re: Re: Re: Re: on second thought

Haha, jeff, you rock! (again and as always)

My 2c: since the redirection is handled by javascript, you can just disable js in opera or bypass it with a proxomitron filter

The voice of the rational! Google cache gives for certain that visitors have a javascript enabled browser! Indeed nowadays we all give all too often per aquired that everyone uses javascript enabled browsers per default (or flash enabled). Try browsing the wide web with lynx (see the tools page) and be prepared for some surprises! :-)
Re: Re: Re: Re: Re: Re: on second thought and third thought

hiya mor!!!
(I knew one of you js guys would know what to do!! :) :)

on third thought, as i was driving to the store and re-thinking, i feel i did something wrong in this thread

digital was trying very hard to understand google and proceed at abcdefg

I jumped him all the way to google - xyz

i should not have done that ... because his specific efforts and google-questions were not really answered by my tricks

i apologize digital ... please proceed with your questions

i just finally figured out what you meant
turn my javascript, in netscape, OFF
and then click on googles cache
oh yeeees ... so much easier :) thanks!
you rock! :)

This true he went to google "xyz", even if we did learn a lot by his digression :-)
Re: Re: Re: Re: Re: Re: Re: on second thought and third thought

jeff you post was excellent, also Nemo´s that showed how to master booleans on the engines capable of them (not google unfortunately).

strangely, your findings where due to a misunderstanding: the URL is freely accesible directly, I wonder why google´s cache redirects to a sign page! will have to check that. I use proxomitron so there are no automatic redirects :-)

but sure I will apply your steps in the future, thanks
digital gaucho

The registration policy has probably changed over time, and as Nemo pointed out: "it's not so much a question of booleans, because since late 2000 you can do that on google as well: Boolean Searching on Google. You can even use the operator OR inside phrases: "advertising OR advertisement statistics". Its a problem of truncation.    You can use * on google, but it doesn't work in the same way... on google it replaces an entire word, exemple: "ad augusta * angusta".
On altavista you can use * or **, here is an explanation how they work: truncation on altavista.
You can read some more information about * on google here".

Re: Re: Re: site specific searches/filenames ?

>I wonder why Google does not allow wildcards, it is true that most often you
>get the results, but the * is very useful sometimes.

Yes, truncation is very useful sometimes... the best search engine for truncation was northernlight where you could use queries like this one: +"*lempicka*.jpg" hehehe... here is the url:

>Lets see if I understand your logic:
>(host:the-scientist.com OR host:www.the-scientist.com) AND url:hot_** on >altavista
>is the OR here just to get all the domains if the URL changes?
>won´t host:the-scientist.com catch all?

You're right host:www.the-scientist.com is contained in host:the-scientist.com, but I joined the two just in case...

>+"the-scientist.com" +domain:com +hot_* on hotbot
>why the +domain:com here?

Because you said that the pages are in a comercial site... and you showed a working page...

the long way

looked at the example above and ran
url: www.the-scientist.com/yr2002/jan/hot**
at AV and shifted the months to get the following
seems to stop there so that may be the last time AV spidered it.
did a little induction (deduction? guessing?) and came up with
checked a few, seem to be good. except for the ones not out yet.


Once again, the power of guessing...hehe :-)
Let's just imagine that there exist on the web sites similar to the-scientist, this one, but that are not as open and freely accessible as this one is, and require a fee in order to access information (yes, it happens, alas). Let's imagine that they too have a subdirectory structure similar to the one explained above. Well, you could do the same, and 'guess' them. See, for an example, the bottom part of my flange page.

Petit image

(c) III Millennium: [fravia+], all rights reserved