The methods developed by the MIG Unit are made available on-line to the wider community of biologists, bioanalysts and bioinformaticians through the MIG software listed below. Due to the computing power they require and the volume of data they process, these applications are not necessarily suitable for use on users' work stations.
Moreover, several projects for setting up Web Services are being developed in the unit, the most accomplished forming the framework of AGMIAL.
Here is a list of our main software :
- AGMIAL is an integrated system for bacterial genome annotation. It is currently used at INRA for the newly sequenced bacterial genomes : Lactobacillus bulgaricus, Lactobacillus sakei and Flavobacterium psychrophilum, as well as the re-annotation of Lactococcus lactis, Enterococcus faecalis and faecium.
- AlvisIR is a tool for document indexing and semantic search; dealeing with synonyms, disambiguation and concept hierarchies, whilst using an indexing component named Zebra. Our group has developed a web front-end to the search engine, which is integrated with Alvis NLP/ML. This software will be soon available on freeware licensing. Part of this work has been funded by the European project Alvis and the French project Quaero.
- Alvis NLP/ML is a pipeline that annotates text documents using Natural Language Processing (NLP) tools for sentence and word segmentation, named-entity recognition, term analysis, semantic typing and relation extraction (see the paper by Nedellec et al. In Handbook on Ontologies 2009 for a comprehensive overview). Most of these tools rely on resources such as terminologies or ontologies. Alvis NLP/ML contains several tools for (semi)-automatic acquisition of these resources, using Machine Learning (ML) techniques. New components can be easily integrated into the pipeline. The software will be soon available on freeware licensing. Part of this work has been funded by the European project Alvis and the French project Quaero.
- AnovArray is a set of SAS subroutines for analysing microarray- and macroarray-type expressional data. It quantifies biological and technological variation sources and detects differentially expressed genes between several conditions. Statistical methods used are analysis of variance (ANOVA) and FDR method (False Discovery Rate) for calculating probabilities adjusted in a multiple hypotheses test framework.
- Asium built an ontology (hierarchy of concepts) from a text analysis. It is associated with the LP2LP software that transforms Link Parser outputs into Asium inputs and with another software that transforms an Asium output into a RDF output (contact Philippe Veber or Claire nedellec).
- BasyLiCA A user-friendly open-source interface and database dedicated to the automatic storage and standardized treatment of Live Cell Array data.
- Beluga We developed a platform for indexing scientific literature ('Beluga') to enable the extraction of associations between features over periods of time. Beluga is proposing several modules based on the indexing of documents according to the following features : references, authors, terms, countries, keywords, sources, and institutions. The diachronic analysis of the corpus enables to describe the topic structure of the documents thanks to the underlying network of co-evolution between authors and terminology. For that purpose learning processes, scoring and visualization of the data are used. (download)
- Bioalvis is an instance of AlvisNLP/ML/IR platform devoted to molecular biology of micro-organisms, that indexes 400,000 Pubmed references. It integrates AlvisNLP/ML pipeline (for the annotation of article abtracts) and AlvisIR search engine. See Bossy et al. at the 1st International Semantic Web Applications and Tools for Life Sciences Workshop (SWAT4LS). More details ( .pdf) on the architecture and the resources.
- Cadixe is a graphical editor developed within the Caderige and ExtraPloDocs projects. It is used to annotate a text with XML tags, which are displayed using configurable conventions (color, style, police, size, etc ...) defined by CSS (cascade style sheets).
- Dynamocell allows the visualization of the metabolic pathway and its enzymatic and genetic regulations. It can also integrate the major available tools used for the analysis of the metabolic networks (contact Vincent Fromion).
- ESAP (Extended Simulated Analysis Process) is a program for predicting loop conformation in proteins. It is based on a Monte-Carlo technique in the space of dihedral angles download.
- FADO (Favored or Avoided Distances between Occurrences) allows to detect favored or avoided distances between occurrences of two motifs along sequences. It is available upon request (contact Sophie Schbath).
- GOR IV is a program for predicting the secondary structure of proteins. 3 states are taken into consideration: alpha helix (H), beta strand (b) and aperiodical structures (C). This program is based on statistical considerations originating from the information theory. It does not use multiple alignment. It provides a Q3 result of 65% download.
- GOR V is derived from GOR IV by introducing information from multiple amino acid sequence alignments using PSI-BLAST (Altschul et al. Nucl. Acids Res. 25, 3389, 1997). Its prediction accuracy, Q3, is reaching 73.5%.
- hmmtiling This program implements the approach presented in our paper "Transcriptional landscape estimation from tiling array data using a model of signal shift and drift" (Nicolas et al., Bioinformatics, 2009). It takes as input the log intensities measured along the genome and it outputs an estimated transcriptional landscape with a prediction of the breakpoints (typically promoters and terminators).
- ISLAND is a program which simulates the progress of a genome mapping project by the anchoring method. In particular, it provides the average number of contigs obtained, their average length and the average proportion of the genome covered by the contigs, according to genome length, clone and anchor number and clone length.
- KAKSI is a program to assign protein secondary structure. The secondary structure assignment, alpha helix (H), beta strands (b), turns (T) and coils (c) is based on characteristic distances between alpha carbons and phi-psi angle values. The program also compute the curvature of the main chain download.
- KASKAD As an integrative "knowledge grabber", KASKAD is a tool for scanning the biomedical literature to collect information about the gene networks over time. The program has been written in Perl/Tk and is freely available on the web (download). The main starting features are selection of the corpus (on-line from medline, or an existing one), selection of a gene list with their patterns and selection of a stage list with their patterns. Patterns have to be defined by the user.
- MuGeN (Multi-Genome Navigator) is an interactive tool enabling exploration in several annotated genomes completed by results of in silico analysis. It can also run in batch mode enabling it to generate images of various formats. This operating mode means that it can be integrated into Websites for displaying annotated physical maps. MuGeN is listed on the FreshMeat and Bioinformatics.Org portals.
- OSS-HMM (Optimal Secondary Structure prediction Hidden Markov Model) is a software for secondary structure predictions (3 states, alpha helix, H, beta strand, b and coils, C) that is based on a hidden Markov model formalism. When it is used with a single sequence it provides a Q3 of 68.8%. When it is used with a multiple sequence alignment it provides a Q3 of 75.5%. This tool can also be used for generating protein sequences having a given secondary structure pattern download.
- PCM (Pairwise Correlation Method). A Matlab program for the partition of a matrix of co-occurrence. This program is used in DOMIRE, “DOMain Identification from Recurrence” in proteins, see: Tai CH, Sam V, Gibrat JF, Garnier J, Munson PJ and Lee BK. Protein domain assignment from the recurrence of locally similar structures. PROTEINS: Structure, Function, and Bioinformatics, 2011; 79:853–866. download
- RBA_B168 1.0 Computation of optimal resource distribution maximizing biomass formation with respect to extracellular medium for the Gram+ model bacterium Bacillus subtilis.
- RenBio is a program to identify gene and protein names in a textual document based on machine learning techniques. Its general version is AlvisNER. AlvisNER is integrated into Alvis NLP/ML and RenBIO is integrated into BioAlvis (contact Robert Bossy).
- R'HOM (Research of HOMogeneous regions in DNA sequences) is software designed for the use of hidden Markov chain models for the segmentation of DNA sequences in homogeneous regions. R'HOM makes it possible to estimate a more realistic model of DNA sequence composition than a homogeneous Markov chain model and then to segment the sequence under this model. It has been used in particular to look for horizontal transfers in B. subtilis and to estimate models designed to calculate the significance of word counting. R'HOM has been developed in cooperation with the Laboratoire Statistique et Génome in Evry. It is free.
- R'MES is a set of C++ programs devoted to the detection of motifs with an exceptional frequency in sequences (DNA, protein or other). It is freely available with a user guide and an online manual. R'MES has a companion tool, RMESPlot which is available at http://mulcyber.toulouse.inra.fr/projects/rmesplot and provides a graphical user interface for the visualization of R'MES generated results. It comes with its own user guide.
- SHOW (Structure Homogeneities Watcher) is an adaptation of "R'HOM" which makes it possible to define with flexibility a complex hidden Markov chain model and then to use this model in various ways by implementing segmentation (forward-backward, Viterbi), estimation (EM) and simulation algorithms. Up until now, SHOW has been mainly used to predict bacterial genes but it has also been used with other objectives in mind such as splicing site detection in humans. In the future, it should facilitate developing models designed for studying numerous biological problems. SHOW has been developed in collaboration with the Laboratoire Statistique et Génome in Evry.
- SIMPA (SIMilar Peptide Analysis) is a program for predicting the secondary structure of proteins. 3 states are taken into consideration: alpha helix (H), beta strand (b) and aperiodical structures (C). This program is based on the nearest neighbour notion. It provides a Q3 result of 67% download.
- STFilter (Sentence Filter) takes a set of summaries in MedLine format as input and extracts the "relevant" sentences. The notion of relevance/irrelevance is learned automatically from examples of sentences classified as relevant and irrelevant. The learned classifiers available in STfilter are classifiers on gene interactions in Bacillus subtilis, in drosophila and in chicken. The software is free and is written in Java.
- SMF (Symmetric Matrix Factorization). A Matlab program for the partition of a matrix of co-occurrence. This program is used in DOMIRE, “DOMain Identification from Recurrence” in proteins, see: Tai CH, Sam V, Gibrat JF, Garnier J, Munson PJ and Lee BK. Protein domain assignment from the recurrence of locally similar structures. PROTEINS: Structure, Function, and Bioinformatics, 2011; 79:853–866. download
- svcR SvcR is an R package which takes a numerical matrix format as data input, and computes clusters using a support vector clustering method (SVC). We have implemented an original 2D-grid labeling approach to speed up the cluster extraction. In this sense, svc can be seen as an efficient cluster extraction if clusters are separable in a 2-D map. Secondly we showed that this SVC approach using a Jaccard-Radial base kernel can help to classify a set of terms into ontological classes and help to define regular expression rules for extracting information from the documents. The case study concerns a set of terms and documents about developmental and molecular biology. download
- SVD (Singular Vector Decomposition). A Matlab program for the partition of a matrix of co-occurrence. This program is used in DOMIRE, “DOMain Identification from Recurrence” in proteins, see: Tai CH, Sam V, Gibrat JF, Garnier J, Munson PJ and Lee BK. Protein domain assignment from the recurrence of locally similar structures. PROTEINS: Structure, Function, and Bioinformatics, 2011; 79:853–866. download
- treemm This program is dedicated to unsupervised clustering of bacterial promoter sequences. It is based on the modelling of distinct classes of bipartite motifs designed to represent binding sites of different Sigma factors. It allows to account for the non-random distribution of such motifs across a tree aimed at summarizing the correlation between promoter ativity profiles. The approach was described in our paper "Condition-Dependent Transcriptome Reveals High-Level Regulatory Architecture in Bacillus subtilis" (Nicolas et al., Science, 2012).
- TyDI is a collaborative tool for manual validation/annotation of terms either originating from terminologies or extracted from training corpus of textual documents. It is used on the output of so-called term extractor programs (like Yatea), which are used to identify candidates terms (e.g. compound nouns). Thanks to TyDI, a user can validate candidate terms and specify synonymy/hyperonymy relations. These annotations can then be exported in several formats, and used in other natural language processing tools (contact Wiktoria Golik or Claire Nedellec).
- trish2 is a library to handle PATRICIA trees, a structure suited for large dictionary lookups. The library provides C functions to create, search, write and read PATRICIA trees. The tree nodes can be associated to arbitrary data so PATRICIA trees can act as a hash. An executable is also provided for fast search in a list of fixed string patterns.
- VAST (Vector Alignment Search Tool) is a progam for comparing protein 3D structures. Download here