~ Worms, angles and search engines quirks ~
         Petit image    ~HP~
Part of the [Lab 1]
Worms, angles and search engines quirks
by Humphrey P. February 2001

Once more Humphrey (dumping the text below on my [~S~ Seekers' msgboard] yesterday with the pseudo ~M) delivers us stimulating observations, clever findings, new paths and alltogether 'nuff material to work onto for the next couple of years :-)
As ususal his "zombie-killing" style may prove somehow hard to follow... at times you'll have to move your brain, not only your eyes to read his words...

What do we learn from ~Veliti~'s success with:
(2) [Text] A very easy introduction to "referenced" text; searching & search engines "oddnesses;" Main search engines quickfinding:
"Ingenium quondam fuerat pretiosius auro,
At nunc barbaria est grandis habere nihil"

1) There's a problem sometimes with quoting the whole text in a search window.
In other words, a search engine is a literal thingk, and may not recognize the whole hymn, but might recognize the first and last verse, or maybe only the first few words.

a) the Idea was to be quite specific, by entering a long string of words, which will be found in that order on only one literary vehicle. Sort of a like a brand. Like, "Stevenson AND Modestine" can only mean: Travels with a Donkey in the Cévennes. So too does "et vous marchez comme ca!"

b) but the Idea fails when there are differences in textual formatting on different web pages.

For instance:

"Ingenium quondam
fuerat pretiosius auro,
At nunc barbaria est grandis
habere nihil"

This formatting may not be the same to a search engine as our quote above, because there are carriage_return line_feeds in our "formatted text" where there were none in the original.

Likewise,

"Ingenium quondam fuerat pretiosius auro, at nunc barbaria est grandis habere nihil"

is not the same as the original quote. It lacks a carriage_return line_feed; "at" is not capitalized.

(A line break won't be a carriage_return line_feed in HTML. It could be a whole formatting function. Does the search engine store and index the HTML formatted text?)

(Throwing away the HTML formatting will also put strange things together, like Queequig and Ishmael, Franklin and Adams.)

So, it pays to break up the worm, and put only one piece on your hook. It's still the appropriate worm for the appropriate fish.

If you find you are attracting millions of fish, then put a bigger piece of the worm on the hook, to attract only the few bigger fish, not the many merely quoting fish.

And try not to include the 'angle' when breaking apart your angle worm. Use "pretiosius auro" rather than "auro, At"

(Both
http://pot-pourri.fltr.ucl.ac.be/files/AClassFTP/TEXTES/Ovide/ovid_amor3.txt
and
http://cyrill2.newmail.ru/ovidiy_amores_3.html
have
"ingenium quondam fuerat pretiosius auro;
at nunc barbaria est grandis, habere nihil."
Hmmm, there's a semicolon where used to be a comma, and a comma where used to be a space. Hmmm. That seems to be the preferred punctuation at Kulichki's also.)

c) the Idea fails also, when there is a difference in spelling. svd has shown us this with author's names, and a lot of other common words. Sometimes the "odd" spelingk is just the rite won: "Randonner avec un âne en Cévenne."

2) All search engines are not the same.
Every search engine is different.

You'd expect the grown up search engine, AltaVista, to give you the best results.
Well, it's not always so. Not only because any one search engine has only seen a small part of the web, but also because of the way they collect their data, what data they store to index, and how they find thingks in their index.

~Veliti~ assumes that [http://www.grenadines.net/carriacou/johnsmithhomepage.htm] did have our quote within it, at one time, which is why you'd want to try Google's cache to see the original web page with the quote in it.

Hmmm. No quote. Is there a way to tell how old the cached page is? a way to tell when Raging indexed the page?
On Google's cached page:
http://www.google.com/search?q=cache:www.grenadines.net/carriacou/johnsmithhomepage.htm+&hl=en
the author, johnsmith, is helping us out: "This page last updated on 8 May 2000." The current version is: "... 19 January 2001"

Now, when did Raging index it? We'll have to go to "Customize" [http://doc.altavista.com/raging-custom/results.html] and turn on [Last modified date].
(I thought Google used to print the date it had cached a page. Where do you turn that on again?)

Sometimes, when you put something like "183-02-01.jpg" into a metasearch engine, like MetaEureka.com, then you get all possible misinterpretations of what you are asking for. Some of the engines think you are asking for ".jpg" Some for "01.jpg" Some for 183 OR 02 OR 01 OR jpg.

Another form of this 'simplification unto inananity' is to throw away the common words "a, an, the, ..." and index the deboned flesh and not the blood and carcass.
In other words, it is not only what the thingk is called, but also how the search engine has broken it apart and stored it in it's index and then how it retrieves it. Hopefully, its retrieved the same way it was stored. "Ribs, in the meat section. Ah, yes, Flossy."

3) Where the heck is "fuerat pretiosius" at Chertovy Kulichki's [http://www.kulichki.com/] ? If you come in the front door, you are greeted in Russian! Oh, sure, once you know that it's from Ovid, (from Google's search: http://www.gmu.edu/departments/fld/CLASSICS/ovid.amor3.html)
then you can peek around and discover an Ovid link.
I suppose it's just as plain as day, in Russian.

Search engine cuts through all that. (It knows there's a card game going on ih the back room.) Or so you would hope.

(Would AllTheWeb/Fast with Domain: Only include: [kulichki.com] and Search for: [the exact phrase] [fuerat pretiosius] find it quick?
Hmmm. Seems to be blind in one eye. It does find some pages at kulichki-???.rambler.ru, but apparently not all of them.)

Which brings up a number of questions.
For one, kulichki.com comes in five flavors: KOI WIN LAT MAC ISO.
At least LAT is using a different port [http://www.kulichki.com:8105/] to display it's flavored page.

(Umph. I shouldn't be so lazy. Each of the flavors are using a different port. Here are the equivalents, port to flavor: :8100 KOI, :8101 WIN, :8102 ALT, :8103 ISO, :8104 MAC, :8105 LAT)

Now, should Raging Search index all of these flavors on one page: kulichki.com? Or, include the port in the indexed address [kulichki.com:8105]?

http://www.kulichki.com:8104/~risunok/literatu/antique/ovid/ovid_amor3.html

We already know, that when you ask AllTheWeb/Fast to consider only one "domain" it doesn't care whether you include the "www." or not. Does it care about the port? Should it care about the port? Should it know that kulichki.com:8104... is the same as kulichki-mac.rambler.ru... ?

Secondly, the different flavors (of Russian calligraphy) are not "spelled" the same. Much like printing Chinese in simplfied characters, or traditional characters, or in PinYin. Here is the simplist form of "translation" yet what search engine can manage it? No babble fish here, at all.

Third, in the past Inference Find (infind.com) was found to be useful in fetching phrases.

Inference Find does it with "Parallel Search." It's a metasearch engine. It finds the same two sites as Raging Search. No brains itself; depends upon hearsay.
http://kulichki-mac.rambler.ru/~risunok/literatu/antique/ovid/ovid_amor3.html is the flavor it finds. (Well, that's not the same as "kulichi-alt.rambler.ru")

MetaEureka.com [fuerat pretiosius] works pretty well.

Shouldn't Raging Search index all of kulichki? Does it recognize duplications? Or does it only index a few things from a domain name? Maybe it doesn't know about kulichki.com, but only kulichki-???.rambler.ru? Maybe RagingSearch is trying to be quick, and only pops up one from kulichki-???.rambler.ru?

Google seems to know about the kulichki-???.rambler.ru... series. Found: http://kulichki.rambler.ru/~risunok/literatu/antique/ovid/ovid_amor3.html
and kulichki-win, -alt, -iso, -lat, -mac.
(Hmmm. does "kulichki.rambler.ru" equal "kulichki-koi.rambler.ru"?)

As a rule of thumb, you should have more success finding a quote with a metasearch engine, than with any single one of the participating engines. But your numbers of hits may be cut down by time limitations, or some fumbling between the center and the quarterback, the metasearch and the search engines.

Fourth, I'm having trouble searching for Latin and Chinese in English search engines. Do Russian search engines have trouble searching for English and Latin and Chinese in Russian search engines?

(For instance, the search on kulichki.com can't find "fuerat pretiosius" and I don't know enought Russian to know why not, or even what all the options are.)

Fifth, surely Ovid is more widely represented upon the web than what we have found with RagingSearch, Google, AllTheWeb, Infind, MetaEureka and NorthernLights. Perhaps some of P. Cook's Latin texts links?
~M

of course suggestions and comments are welcome...

Petit image

(c) 1952-2032: [fravia+], all rights reserved