$Id: FLAT-DATABASES-HOWTO.txt 2588 2003-03-14 14:58:30Z kdj $

CREATING OBDA-COMPLIANT INDEXED SEQUENCE FILES

The Open Biological Database Access (OBDA) standard specifies a way of
generating indexes for entry-based sequence files (e.g. FASTA, EMBL)
so that the entries can be looked up and retrieved quickly. These
indexes are created using the org.biojava.bio.program.indexdb package
and accessed using the org.biojava.bio.seq.db.flat package.

The org.biojava.bio.program.indexdb package has the same functionality
as the previously available BioJava flat file indexing API. The main
reason to use it is if you want to use the BioSequence Registry system
(see BIODATABASE_ACCESS.txt), or if you want to share the same indexed
files among scripts written in other languages, such as those written
with BioPerl or BioPython.

There are three steps to creating an org.biojava.bio.program.indexdb
database:

  1) select a root directory in which all the indexed databases will 
     be stored.
  2) move the flat files into a good location.
  3) choose a symbolic name for the database.
  4) run the bioflat_index.pl script to load the sequence files into
     the database.

1. Select a Root Directory

Select a directory in which the flat file indexes will be stored.
This directory should be writable by you, and readable by everyone who
will be running applications that access the sequence data.

2. Move the Flat Files Into a Good Location

The indexer records the path to the source files (e.g. FASTA, or local
copies of GenBank, Embl or SwissProt). This means that you must not
change the location or name of the source files after running the
indexer. Pick a good stable location for the source files and move
them there.

3. Choose a symbolic name for the database

Choose a good symbolic name for the database. For example, if you are
mirroring GenBank, "genbank" might be a good choice. The indexer will
create files in a subdirectory by this name located underneath the
root directory.

4) Run the org.biojava.app.BioFlatIndex program

The final step is to run the org.biojava.app.BioFlatIndex
program. This program is located in the BioJava distribution, in the
apps.jar jarfile.

The first time you run the program, the typical usage is as follows:

  java org.biojava.app.BioFlatIndex -c -l /usr/share/biodb -d genbank \
   -i flat -f fasta data/*.fa

The following command line options are required:

  -c   create a new index
  -l   path to the root directory
  -d   symbolic name for the new database
  -i   indexing scheme (discussed below)
  -f   source file format

The -c option must be present to create the database. If the database
already exists, -c will reinitialize the index, wiping out its current
contents.

The -l option specifies the root directory for the database indexes.

The -d option chooses the symbolic name for the new database. If the
-c option is specified, this will cause a new directory to be created
underneath the root directory.

The -i option selects the indexing scheme.  Currently there is only
one indexing scheme supported: "flat". The alternative "bdb" selects
an index based on the BerkeleyDB library. It is generally the faster
of the two, but it would require a Java BerkeleyDB librrary to be
installed on your system. We have chosen not to support this at the
moment due to cross-platform portability issues. "flat" is a sorted
text-based index that uses a binary search algorithm to rapidly search
for entries. Although not as fast as bdb, the flat indexing system has
good performance for even large databases, and it has no requirements
beyond Java itself. Once an indexing scheme has been selected there
is no way to change it other than recreating the index from scratch
using the -c option.

The -f option specifies the format of the source database files. It
must be one of the formats that BioJava supports, including "genbank",
"swiss", "embl" or "fasta". Consult the org.biojava.bio.seq.io package
documentation for the complete list. All files placed in the index
must share the same format.

To update an existing index run BioFlatIndex without the -c option and
list the files to be added or reindexed. The -l and -d options are
required, but the indexing scheme and source file format do not have
to be specified for updating as they will be read from the existing
index.

MOVING DATABASE FILES

If you must change the location of the source sequence files after you
create the index, there is a way to do so. Inside the root directory
you will find a subdirectory named after the database, and inside that
you will find a text file named "config.dat." An example config.dat is
shown here:

  index	flat/1
  fileid_0	/share/data/alnfile.fasta	294
  fileid_1	/share/data/genomic-seq.fasta	171524
  fileid_2	/share/data/hs_owlmonkey.fasta	416
  fileid_3	/share/data/test.fasta	804
  fileid_4	/share/data/testaln.fasta	4620
  primary_namespace	ACC
  secondary_namespaces	ID
  format	URN:LSID:open-bio.org:fasta

For each source file you have moved, find its corresponding "fileid"
line and change the path. Be careful not to change anything else in
the file or to inadvertently replace tab characters with spaces.


---------------------------------------------------------------------------- 

ChangeLog

$Log$