Doug Oard's Research Software Page

The research software that is available on this page has been produced as a byproduct of the work described on my main research page. I am making it available here in order to help others avoid reinventing the wheel and as a way of making life easier for anyone interested in duplicating my results. My students and I have tried to make it as easy as possible to reproduce the software environments that we have used, but path names and other system specific information will need to be manually edited to get any of this to work on your system. The documentation for these routines ranges from terse to nonexistant, but I'll be happy to answer questions by email.

This software may be used for research purposes in a manner consistent with the SMART and SVDPACK distribution statements. Requests for other uses of this software should be referred to the author so that we can sort out the tangled interconnection of authorship in these files. Persons interested in commercial applications of these techniques should be aware that Bellcore holds two patents on Latent Smenatic Indexing.

SMART Modifications

These modications are posted with the permission of the SMART project at Cornell University. In order to use them you should first download a clean copy of SMART version 11.0 from the SMART FTP site at ftp://ftp.cs.cornell.edu/pub/smart/ and then untar the files you need on top of it in the appropriate order.
Solaris Patch
An example of how to modify the SMART 11.0 distribution to run on Solaris 2.x (SunOS 5.x). There are some useful instructions on how to do this on the smart-people mailing list, but I find it always helps to have an example in which all the pieces are put together.
ISO 8859-1 Character Set Patch
A simple and fairly well documented patch that provides 8-bit clean processing for characters in the Latin-1 (ISO 8859-1) character set with appropriate recognition. We used this to run experiments in Spanish. Stemming for other languages is NOT included, but there is a Spanish stopword list available on the SMART FTP site.
Adaptive Multilingual Filtering Patch
This tar file contains all of the modified and new SMART routines that I used for my thesis research. Components are included which compute Latent Semantic Indexing (LSI) feature vectors, perform adaptive text filtering using those vectors, use that as the basis of three different techniques for cross-language text filtering, and evaluate informaton filtering results. There is also some code there for Gaussian user modeling, a technique that proved no more effective than LSI in my experiments. My dissertation contains the best description of what all these parts do and how I used them. The dissertation also describes the test collections that I used and tells you how they can be obtained, but they cannot be redistributed with these routines because almost all of the data and tthe tools used to preprocess it are proprietary. I'll be happy to provide the shell scripts and C programs that I used to control the preprocessing to anyone who wants them, but I have not included them in this tar file because they are so specific to the data formats produced by the tools that I used and to the structure of our file system that I doubt that they would be of much use to most people. My internal documentation in these files is abysmal, as are some of the design decision about where in the SMART hierarchy to place things. But it should provide a useful starting point for people who are interested in adapting some of the components that I have implemented for other purposes.

Last modified: Wed Apr 16 08:18:55 2003
Doug Oard oard@glue.umd.edu