Computers & Geosciences, Volume 22, Number 2, 1996

John C. Butler
Department of Geosciences
University of Houston
Houston, TX 77204

Searching The Web for Fun ..... Profit ...... and Potential Headaches

As you become proficient in moving through cyberspace, there will come a time when you ask Ñis there information about geoscience-related newsgroups or geophysics degree programs at the University of Houston or general statistical packages on the Net or what is Enid Grubbmanºs e-mail address at Miami Universityæ? On the ANON home page [updated] you will find several references that may prove helpful. The ANON page has links to listings of Geosciences Resources, People, Places, Universities, Education Resources, and Software.

None of these lists of lists should be viewed as definitive. One of the simultaneous strengths and weaknesses of the way that the Internet has evolved is that relatively few people are paid to maintain Web resources and maintenance is most often accomplished by volunteers. This is but one of a complex of issues faced by those who use the Net and want to see its utility increased.

Sooner rather than later will come the need for locating specific information and a number of searching services are available on the Internet. This issue will focus on WWW search procedures. [The following material has been added to replace a link no longer available. If the reader has little experience in searching, the following Scavenger Hunt which was designed for a physical geology homework exercise. ] Save this file (save the source) to your desktop/main directory and access it with any browser that allows you to specify a particular file. In Netscape command O will allow the user to specify which file is to be viewed. If you put the hypertext mark-up language file in the same folder (or directory) you will have the duplicate of the Searching Exercise file. Thus, any material which you can view with a browser can be downloaded to your machine. This raises other critical issues regarding intellectual property, copyrights, and plagiarism. I will try and arrange a guest article on these subjects for a future issue.

Each of these algorithms is underlain by a set of rules which the user should be aware of. Some search on full-text whereas others search on the titles of files or the contents of the IP address.

The World Wide Web Worm (WWWW) search engine is limited to searching: URL references, URL addresses, document titles, and document addresses. If you produce WWW resources which you would like others to locate via this searching strategy, be certain to name the files so that their contents are obvious. A file labeled stuff.html might be in great demand but the Worm would only locate it if the user specified Ñstuffæ as the key word in the search. The Worm user should read the sections on Instructions, Examples, Search, Failures, Register, WWWW Paper which are referenced at the top of the WWWW page at the address given above or in the downloaded Anon Search file.

Other search strategies in the file search on the text of the html documents. The Web Crawler was an early text search service. Users should take time to read the explanations: Help, Facts, Top 25 Sites, Submit URLs, Random Links, No- forms Search which appear as links near the top of the home page. The design is described in the Facts section. Note that a developer can submit the URL of a particular site. The Web Crawler Ñagentæ will visit the site and Ñreadæ the text. Most of these search agents allow submission of URLs which should be considered when the developer is Ñsemi satisfiedæ with the product.

Yahoo, Lycos, and Inktomi (now Hotbot) the other text-based search algorithms listed in the Search file, are similar to the Web Crawler but each has its own spin on how the search takes place and how the results are displayed. All require submission of key words and all allow use of some form of Boolean operation(s). The sequence in which Ñmatchesæ are displayed is based on the frequency of occurrence of the key word(s) in the documents with those at the top of the list having the highest frequency. All of these search engines allow the user to specify how many Ñmatchesæ are displayed starting from the top of the list and all allow the user to examine the entire set of matches. The matches are presented as links to the articles. The search engines differ in the size of the database which is accessed by the searching algorithm and how the database was assembled. Inktomi (the newest of the four referenced above) has a paper on Counting URLs which should be reviewed by the user. ttp:// A limited, but personally revealing, set of tests are described in the remainder of this monthsº column. While looking for some resources for the physical geology course I was teaching, I found a short article on diamonds found at the Ries Crater. The article is "buried" in files maintained by the Open University and is located at: Find Me.

The number of matches for Ries and Crater (Boolean and) obtained from the four search engines is given below.

  1. Inktomi - 13
  2. Lycos - 7
  3. Web Crawler - 4
  4. Yahoo - 0

None of the searches uncovered the article which means that either the URL has not been submitted or that a search agent has not found that site. This is particularly frustrating as the amount of useful material about the Ries Crater on the Web is unknown. The four matches found by Web Crawler are anomalous in that Ries does not occur but Ñ....riesæ as part of summaries, repositories, etc do occur. Perhaps this resulted from a problem in coding the text of these four sources. Five of the seven Lycos matches occur within the set of 13 Inktomi matches. The top three documents located by Inktomi are identical versions of the same document stored at three different addresses. Up to a point this is good as these sites literally can disappear over night.

A second test was a search on the key word Petrography.

  1. Lycos - 656
  2. Inktomi - 647
  3. Web Crawler - 76
  4. Yahoo - 4

Most, but certainly not all, of the more than 600 references identified by Lycos and Inktomi are parts of detailed descriptions of courses, academic programs and faculty research efforts published by academic institutions. The searcher should keep this in mind as the number of such Web "publications" continues to increase. At some point a coding scheme to identify materials as: a) course descriptions; b) research publication; c) course presentation; etc. may help discriminate among kinds of Web contributions.

Each user should read the pertinent material supplied with a particular search algorithm and experiment prior to deciding which one(s) will be used routinely.

Although there are obvious shortcomings, as noted above, life on the Web without these algorithms truly would be like "drinking from a firehose."

   Since January 27, 1997