Bioinformatics databases are often distributed in flatfiles or as dumps for loading into relational databases. The manner in which you will interact with the database depends very much upon the content of the database in question and the application for which it is being used. Many algorithms which make use of databases require an index to be generated to allow the database to be searched, however each tool frequently has it's own index format.
Bioinformatics databases are typically distributed as flatfiles (i.e. in plain text formats contained in one or more files), typically containing thousands of records in each file. Retrieval of individual entries from these databases require the database to be indexed, for example with EMBOSS. We previously maintained local indexed copies of many databases, however these databases are generally now available using web-services interfaces so there is currency little demand for maintaining our own copies of these now, consequently only a minimal set of local databases are maintained. See retreiving sequences from databases for details on how to make use of these services. While web-service based entry retrieval works perfectly well for many applications, it is not appropriate for high-throughput usage where connection rates wil be throttled by service providers. If you have a requirement for a flat-file database to be locally provided, please contact us to discuss your requirements.
Indexed Reference Databases
In order for databases to be searched by an application, they typically require indexing using a format which is typically specific to the application. Examples of such applications are Blast, bwa, bowtie and hmmer. These indexing processes can take a considerable time therefore we maintain pre-indexed copies of various databases along with other useful reference datasets.
As with flatfile databases, we previously maintained a large volume of blast databases, however these were seeing only limited usage, consequently the scope of our holdings has been reduced to the Uniprot, Refseq and the NCBI non-redundant sequence databases nr and nt (for protein and nucleotide sequences respectively) along with certain popular organism genomes from Ensembl. Other databases or taxonomic subsets of databases can be made available on request. See for details on available blast databases.
Organism Reference Databases
We also maintain indexed databases of common reference datasets recurred for analysis of specific organisms. These typically include the reference genome sequence, indexed for both BWA and Bowtie read alignment algorithms. Additional databases may be available for particular organisms i.e Human 1000 genomes variant call set, GATK resource bundle. Available datasets can be found in '/data/databases/reference_data' on BSS managed systems. Please contact us if you have requirements which are not currently met by these resources.
Relational databases are usually used to manage a collection of data, generally in association with a software interface to permit data curation and/or querying. A relational database is stored using software such as MySQL or Postgres. Distribution of bioinformatics databases in relational form is becoming increasingly common as the complexity of the data increases. We maintain copies of other certain databases in relational forms (i.e. GO - the Gene Ontology) to provide programmatical interfaces to the data using publicly available APIs (Application Programming Interfaces), which are extremely useful for developing tools including data from these sources.
We can also build relational databases to suit your projects storage requirements. Please see the Projects page to see examples of some of the previous kinds of work we have carried out.