paris2.htm: How to search the web, by fravia+ (¯`·.¸(¯`·.¸ Paris, ecole polytechnique ¸.·´¯)¸.·´¯)

fravia @ Paris
Ecole polytechnique, 6 February 2001

The main search engines: Altavista quirks

How do search engines work and how the commercial clowns do spam them

Since the invention of the "Altavista-type" of indexing texts, search techniques became COMPLETELY different than before. As you probably already know, there are many "main" search engines. In this workshop I'll limit my (relative) "depth" covering to Altavista, but there are many other main engines that use the same approach, like Alltheweb/Fast, for instance, a search engine that recently claimed to have the maximum 'coverage' of the web, whatever that is supposed to mean coming from people that did index whole databases of millions of -say- "galactic images" just to be the first to pass the 'one milliard' target in their race with Google...

But indexing à-la Altavista they do, indeed. You must understand correctly what this kind of word for word indexing meaned (and still means).
The COMPLETE text of hundred of millions of pages, "y compris" the positions of commas, has been indexed. This means that if you input in -say- Alltheweb/fast the relatively long phrase These are pages that the Nucleus points to, and that may (or may not) point back to the Nucleus." you'll find immediately the page where I began speaking about such matters (it was during a workshop in Milan last October)... and, even more interesting, using the same search, you'll soon find the minutes of this very workshop of mine, here in Paris to-day, PROVIDED they will be published with EXACTLY the same sequence of terms.

As you can and will see, the longer and more exact the phrase you have given as input, the more precise will be the 'magic' that will pull out of the dark huge hat containing millions of indexed sites exactly the page you wanted. As you can imagine this makes it on the one hand extremely easy to search for copycatted texts ('plagiarisms') around the web. At the same time, on the other hand, knowing how to search effectively could easily allow anyone of you to find an almost ready-made text to -ahem- "solve brilliantly" whatever assignment you are bound to do for tomorrow. So even plagiarism -as everything in life- may have two faces, eh :-)

The (brilliant) idea behind Altavista was to create a full-text search of the entire internet. This was an incredible step. The main search engines todays are Google (with Topclick), Fast/alltheweb, Altavista (with Raging) and Hotbot.
Google went even further: they have made a cached copy of every page they have indexed. You can imagine how important this can be when a page you'r stalking did disappear or was pulled down.

This said, if I were you, I wouldn't limit myself to one or just to a few engines: Northernlight with its folders and webtop with its new algos are for instance very useful TOOLS, that you'll have to use during some delicate queries...

Let's have a quick look at the sancta sanctorum of a search engine. For any search engine, ergo for Altavista as well, it is critically important to assign location values properly when indexing sites. Inside Altavista the assigning of locations is fully automatic in the simplest case... where a function called avs_addword does all the work.
In this case the words of the document are laid out end to end and are numbered sequentially starting with the value returned by other functions (avs_newdoc or avs_startdoc). The same is true for field boundaries and for values (indexed quantities like dates that can be range-searched). The following diagram shows how two very short documents would be stored inside altavista's index database.

	document 1						document 2
word	here	you	have	a	short	page	Thisnotwithstanding thisnotwithstanding	here	you	have	another	short	page
location	1	2	3	4	5	6	7	8	9	10	11	12	13

As the figure illustrates, each word is actually stored as a word-location pair. The index also contains information about the starting and ending locations of each document. Document1 starts at location 1, and Document2 starts at location 7. In Document2, the first word contains an uppercase letter, so the word is indexed twice: once with case preserved and once in all lowercase. Both versions of the word are at the same location, so that the word would be found appropriately regardless of whether a query is case sensitive or case-insensitive.

The words are added sequentially, the actual update to the index is made, using avs_makestable every so many documents, or when the last document of a linked bunch has been processed.

Thus Altavista keeps separate indexes both for case-insensitive and case-sensitive queries, to allow both types. Thus a search for Paris would give you at the moment more or less 3,310,210 pages as "Paris" but more or less 3,598,040 pages as "paris", which covers all possible uppercase/lowercase occurrences. And yes, before you ask: a search for "PAris" would give you "quand même" more or less 614 pages... 'mistakes' searches are btw at times quite useful for stalking purposes... but that is another story.

Why did I say "more or less"? Because the same searches on Altavista will give you DIFFERENT results depending on a series of parameters: which of the many altavista servers you have locked into, the time of the day (when the united states' populace is awake and cheerfully browsing, everything is slower on all backbones and the servers of the main search engines prefer to 'cut' the number of reported answers quite brutally to anyone querying). Note also that even if the results of your query are MORE than 1000, Altavista will only give you the first 1000 hits.

Finally take account of the fact that many 'first positions' on any search-query are being SOLD for money, or are being spammed by the many clowns that spend their time trying to outsmart the anti-spamming algorithms of all main search engines.
In Altavista words are indexed by their location AND their placement on the page (it makes an helluva difference, for ranking purposes, if the key words are in the title, in the abstract, in the description or in the text).

Altavista does not index non-standard text formats like adobe's pdf files (google now does, I bet Alta and the others will HAVE TO follow soon, eheh). Alta does not list excessively big files (5 megabytes), password-based access pages (obviously), and dynamic information, such as pages generated by a cgi script.
Scooter, Altavista's spider, also registers the date of the document and, which was quite a clever invention, the distance between different keywords, assuming that they are reciprocally related if they are close to each other. Other search engines do not do that, hence the differnces when using the NEAR operator... maybe the most important boolean operator for searching purposes.

The frequency of words on a page is quite a complicated parameter, because spammers (especially in the old times) used this as a quick method to get their pages in the first positions. Most of the people inside 'search engines algos cracking' are either spammers or commercial operators intent on 'pushing' their client sites into the first ranking positions. Since the average zombie user would never dream to go behyond the first 20 results ranked (if ever!: most lusers won't even look to the second page of their results!), high rankings on the first result page(s) are - as you may expect - of paramount relevance (and commercial value).

Each search engine has a set of specific algorithms, of course. This 'pool' of algos is responsible for the different results you'll get when querying. Spammers are thus compelled to (try to) reverse ALL these algos and then build a SET OF DIFFERENT PAGES each targeted through its specific keywords and structure for one single main search engine.
Would you always present the same page to the different search engines spiders, you would fare well on one engine and fail miserably on others.
For instance, while one search engine will give high relevancy to keyword occurrence in one position, others will ignore it.

Leeching from the spammers
Hence it pays to perform specific searches for a given target on as many search engines as possible and compare the generated pages of THE SAME SITE. You'll quickly discover that they are differently structured, and you'll be able to 'leech' some useful knowledge from the bastard commercial spammers that spend their working nights and days doing just that, eheh :-)

Additionally, some spammers are now running "stealth keywords" hoping that others wont be able to detect if they are viewing true keywords or bogus keywords. You can get around this somewhat if you use a browser or a browser-firewall combination that allows you to set both the USER_AGENT and the referring URL. By setting the USER_AGENT to one of the major search spiders (look in your site's own loggings to see how they declare themselves :-) and by setting the referring URL to null, you can sometimes trick spammers' web pages into thinking you are the spider of one of the major search engines and being able to see the juicy real set of keywords. It's amazing how often this works. I can't understand why the search engines themselves don't use rotating spiders' names.

Indeed there's a lot of money currently raining into this specific field... run around the search engines algos with your hat upside down and you'll surely get a part of it. Yet be warned: I have noticed that searchers and algo-reversers overly interested in money don't understand much of what's really going on under the hood... they always have a too 'contingent' attitude, their 'money' aims inevitably cover -soon or later- their original 'knowledge' aims... A nice "legge del contrappasso" if you understand what I mean.

Let's summarize some important points you should remember when searching with Altavista, I'll just list them here to give an idea of the depth of understanding you need to search effectively with the main search engines:

Basic Alta's recall

!! ALWAYS use the advanced search with boolean operators and result ranking options.
!! Phrase searching - through heavy use of "" - is the sine qua non when cutting through the web.
!! The rule is: case sensitive name searching. Lower case retrieves either lower or upper case. Upper case kills all other case occurrences. Note that Altavista IS accent and character sensitive.
!! Boolean logic (AND, OR, AND NOT, NEAR) should be used only in advanced search, do not use this in altavista's simple search form, nor in the advanced search 'sort' box. Always remember that the so-called simple search always defaults to OR (you can use it nevertheless for a 'quick and dirty' search session).
!! Subsearching inside sort box
Result ranking through sortbox... you MUST understand this! You should use the results ranking criteria "Sort by" box whenever possible or you'll get a less useful unsorted list. For instance, if you're looking for all "searching and seeking" related pages but have a particular interest in filtering, type filtering in the "Results Ranking Criteria" space in order to bring those useful listings to the top. Incredibly useful addition!
!! Field limiting: title: ~ url: ~ link: ~ host: ~ domain: ~ anchor: ~ text: ~ image: ~ applet:
You will realize how important this "field limiting" approach is only once you will have tried it, therefore just try it: probieren geht Über studieren
!! Truncation: * ~ ** ~ ?
Either you use truncation or you will be truncated!

Write it down and remember it: for queries at Altavista the best method is ALWAYS advanced search with boolean logic (and field limiting) and then heavy use of the sort box.

Anyway even knowing how the main search engines operate and reversing their algos is not enough for a good searcher: in order to search effectively you must also know that there are important seeking resources elsewhere. But first you must understand that you smear your data around everytime you're on the web!