London Algorithmic Workshop (LAW) 2007

 

King's College London, UK

7-8 February 2007



Abstracts of the talks

 

Tugkan Batu (London School of Economics)
Oblivious String Embeddings and Edit Distance Approximations
Abstract:
We introduce an oblivious embedding that maps strings of length n under
edit distance to strings of length at most n/r under edit distance for
any value of parameter r. For any given r, our embedding provides a
distortion of \tilde{O}(r^{1+\mu}) for some \mu=o(1), which we prove to
be (almost) optimal. The embedding can be computed in
\tilde{O}(2^{1/\mu}n) time.

We also show how to use the main ideas behind the construction of our
embedding to obtain an efficient algorithm for approximating the edit
distance between two strings. More specifically, for any 1> \epsilon
\geq 0, we describe an algorithm to compute the edit distance ed(S,R)
between two strings S and R of length n in time
\tilde{O}(n^{1+\epsilon}), within an approximation factor of
\min\{n^{\frac{1-\epsilon}{3}
+o(1)},(ed(S,R)/n^\epsilon)^{\frac{1}{2}+o(1)}\}.
For the case of \epsilon=0, we get a \tilde{O}(n)-time algorithm that
approximates the edit distance within a factor of
\min\{n^{\frac{1}{3}+o(1)},\ed(S,R)^{\frac{1}{2}+o(1)}\}, improving the
recent result of Bar-Yossef et al.~\cite{BarYossefJKK04}.

Joint work with Funda Ergun and Cenk Sahinalp.
 
Marie-Pierre Béal (Université de Marne-la-Vallée)
On the synchronization delay of complete local automata
Abstract:
A (non-deterministic) local automaton is an automaton for which there
is an integer d such that any label of a path of length d is a
synchronizing word. The smallest integer d such that the property
holds is called the synchronization delay of the automaton.
Czeizler and Kari proved that an n-state local complete automaton
has a synchronization delay at most n-1. We give an alternative
proof of this result that makes it a consequence of previous
work from Césari on monoids of unambiguous relations.
 
Alexander Chashkin (Moscow State University)
Two algorithms for computing monotone Boolean functions
abstract:
Two new algorithms for computing monotone Boolean functions are presented.
These algorithms have a simple structure and are efficient for functions
with small numbers of variables.
The first algorithm is over a complete Boolean basis and the second one is
over a monotone Boolean basis.
 
Frantisek Franek (McMaster University)
On a lower bound of the maximum number of runs
Abstract:
A run is a succinct notion capturing the idea of a maximal
non-extensible repetition. Knowing all runs in a string is thus
important in many context -- data compression, cryptography,
computational biology. It is natural thus to investigate the
maximum-number-of-runs function q(n) = max{ r(x) | |x|=n }, where
r(x) is the number of runs in a string x. In 2000 Kolpakov and Kucherov
determined that q(n) has a linear upper bound. In 2006 Rytter radically
improved this to q(n) < 5n. Later Rytter, and independently Smyth,
Simpson, Puglisi lowered the upper bound to 3.5n (not yet published).
There are indications that the lower bound can in fact be pushed as low
as 1.5n.
We present two different methods for creating a recursive sequence of
strings "rich in runs". The first method is based on Franek, Smyth,
Simpson method presented in 2003, the second method is based on the idea
of loose cube-free strings. Both methods allow to create a sequence of
binary strings {x_n}, starting with a given string x_0, so that the
limit of the ratio r(x_n)/|x_n| for n tending to infinity is approaching
the value of alpha that is approximately 0.927. Then we present a method
how to use these constructions to produce a family of asymptotic lower
bounds for q(n). These lower bounds are arbitrarily close to (alpha)n,
i.e. for any epsilon, for sufficiently large n's, q(n)>(alpha-epsilon)n.
In 2003 Franek, Smyth, Simpson conjectured that in fact the limit of the
ratio q(n)/n for n tending to infinity is alpha. The fact that two
different construction methods lead to sequences with identical limits
lends some substantiation to the conjecture.
 
Xiangchao Gan (King's College London)
Microarray missing data imputation based on biological knowledge
abstract:
In this workshop, some basic knowledge about microarry experiments is introduced first. Gene expressions measured using microarrays usually suffer from the missing value problem. Although existing missing value imputation algorithms have shown good performance to deal with missing values, they also have their limitations. In this paper, we propose a set theoretic framework based on projection onto convex sets (POCS) for missing data imputation. POCS allows us to incorporate different types of a priori knowledge about missing values into the estimation process. In our algorithm, we take into consideration three kinds of information: the local correlation structure among genes in microarray data, the global correlation structure among arrays and the biological phenomenon of synchronization loss in microarray experiments. Experiments show that our algorithm can achieve a significant reduction of error compared to the available methods.
 
Gregory Gutin (Royal Holloway, University of London)
Algorithms for Directed Maximum Leaf Problems
abstract:
We prove that finding a rooted subtree with at least k leaves in a
directed graph is a fixed parameter tractable problem. Similar
result holds for finding rooted spanning trees in strongly connected
directed graphs. In fact, our algorithms are linear time for a fixed
k. This settles completely one open question asked by Mike Fellows
and solves another one for strongly connected directed graphs. Our
algorithms are based on a combinatorial result which can be seen as
a generalization of many results for a `spanning tree with many
leaves' in undirected graphs and is interesting on its own. In this
regard, we show that every strongly connected digraph on n vertices
with minimum in-degree at least 3, contains a rooted spanning tree
with at least $(n/2)^{1/5}$ leaves.
 
Jan Holub (CTU Prague)
Evolutive Pattern Retrieval using Suffix Automata
abstract:
 
Lucian Ilie (Universite de Marne-la-Vallee & University of Western Ontario)
Repetitions in Strings
Abstract:
Repetitions in strings constitute one of the most fundamental areas of string combinatorics with very important applications to text algorithms, data compression, or analysis of biological sequences. Squares have been studied already a century ago at the foundations of stringology and runs (maximal repetitions) helped obtaining the first linear-time algorithms for computing all repetitions in a string. I shall discuss a number of recent results and open problems concerning the number of squares and runs in a string.
 
Rob Irving (University of Glasgow)
A 5/3-approximation algorithm for a hard case of stable marriage
Abstract:
The stable marriage problem (SM) has been studied over many years, both
for its intrinsic interest and for its important practical applications.
These are well established in large scale centralised matching schemes,
particularly in the medical employment field, and involve the many-one
extension of SM, which, as a result has come to be known as the
Hospitals/Residents problem (HR).

In classical versions of these problems where all preferences are
strict, there may be many stable matchings, but all stable matchings
have the same size. However, when preference lists can contain ties, as
is often the case in practice, stable matchings can be of different
sizes. In this context, the problem of finding a maximum cardinality
stable matching is NP-hard, even when there are severe restrictions on
the nature of the ties.

There is a trivial polynomial-time 2-approximation algorithm, and recent
papers have presented various approximation algorithms, some for special
cases and some for the general case. The best of these, to appear in a
forthcoming STACS paper by Iwama et al, is a 15/8-approximation
algorithm for the general case.

Here we focus on a special case that is important in real applications -
it reflects the practice in at least one large scale medical matching
scheme where hospitals are permitted a single tie, of arbitrary size, at
the end of their list. We describe a polynomial-time 5/3-approximation
algorithm for this case. This is the best performance guarantee known
for an deterministic approximation algorithm for any hard case of stable
marriage.
  
Jens Kleinjung (National Institute for Medical Research)
Automatic derivation of maximally representative database subsets based
on fragments.
Abstract:
The protein structure database (PDB) contains currently more than 40000
structures, with a yearly increase of about 5000 structures.
The size of the PDB poses a considerable challenge to computationally
intensive applications, such as molecular dynamics, structure prediction
and molecular docking simulations.

Protein structures can be clustered into hierarchical classes representing
different levels of similarity, such as 'topology', 'homologous
superfamily' or 'homologous family'. Hierarchical protein structure
clustering is exemplified by the SCOP and CATH databases. Taking a
different viewpoint, those similarities that allow for the clustering of
structures can also be regarded as redundancies. We were interested in
deriving subsets of the PDB database to reduce the computational expense
in studies using the above mentioned applications.

To this end we have developed a method to extract subsets from the PDB
database that are maximally representative for structure fragments of a
given size. Protein structures were translated into character strings by
means of a structural alphabet. Using a combination of a Suffix Tree and a
Genetic Algorithm, random combinations of protein structures (subsets)
were created and assessed through an entropy-based fitness score. We
employed the Suffix Tree for efficient entropy and coverage computations,
while the Genetic Algorithm was used to sample and select subsets.
The entropy score reaches maximal values when structure fragments
(substrings) of a given size are evenly distributed in the subset.
Therefore, maximal representativeness is equivalent with maximal
information content based on a meta-alphabet of constant length words.

Optimised subsets yielded fitness values with Z-values of about 10 sigma
when compared to the fitness of random subsets, demonstrating the high
efficiency of the selection process. We obtained fragment coverage of 90%
for fragment lengths 1 to 3 in subsets that were 10% of the PDB database
size, and about 30% coverage for fragments of length 9.
We conclude from these results that the optimised PDB database subsets
capture most of the fragment composition and properties of the original
set. Further studies on structure-function properties of fragments and
their distribution within the PDB are underway.
   
Thierry Lecroq (University of Rouen)
Fast exact string matching algorithms
Abstract:
We present two different startegies that lead to fast
exact string matching algorithms. First we present an efficient
method to compute the best matching shift of the Boyer-Moore algorithm.
The searching algorithm using this best matching shift is the most
efficient in
particular cases such as the search for patterns of length from 7 to
15 in natural language texts. Then we propose a very fast new family
of string
matching algorithms based on hashing $q$-grams. These new algorithms
can be viewed as a secial case of the Wu-Manber multiple string
matching algorithm. The new algorithms are the fastest on many cases,
in particular on small size alphabets.
 
Barnaby Martin (University of Durham)
The Computational Complexity of the Regular Resolution Width Problem
Abstract:
By exploiting a connection between k-width resolution refutations and winning
strategies in k-pebble existential Ehrenfeucht-Fraisse games, Hertel and Urquart
have recently proved the following problem complete for Exponential time: given
a set of clauses C and a positive integer k, does C have a resolution refutation
of width k?. We address the related problem for regular resolution, i.e.
resolution in which a variable may be resolved only once along any path. We
prove that the ensuing problem is complete for PSPACE. Along the way, we
introduce a novel Ehrenfeucht-Fraisse game problem, itself complete for PSPACE.
 
Bořivoj Melichar (CTU Prague)
Palindromes and Finite Automata
Abstract:
Palindromes are strings of the following forms:
wwR, where wR is reversed form of w,
wcwR, where c is “central marker”.
The first form is often called even palindrome having even number of symbols.
The second form of palindrome is called odd palindrome having odd
number of symbols. It is matter of fact, that reading palindrome from left
to right and rigt to left leads to reading the same string. The examples of
palindromes are:
deed level now I won
An approach how to find palindromes in a string is shown using a finite
automata approach. This is a result of the general effort to use a unified approach
in stringology for at least some problems. This research is performed
by Prague Stringology Club, Prague, Czech Republic.
 
Laurent Mouchard (University of Rouen)
Searching for shortest tags in a set of sequences using a suffix array
Abstract:

Pierre  Peterlongo (IRISA/INRIA Sybiose team)
Utilisation of Subset Seeds on a Reconfigurable Architecture
abstract:
As it's commonly known in the bioinformatic community, the amount of
available biological data is huge and it's evolution is exponential.

The well known program Blast~\cite{Altschul1990} is able to quickly
find local similarities among biological sequences. As this task is
the key of much biological analysis, Blast is by far the most used
bioinformatic tool. Because of the evolution of the size of the banks
and of the size of the data used in studies, one still needs to
increase it's speed as well as its sensibility.

We propose a method both based one the use of subset
seeds~\cite{Kucherov2006} and one the use of a specialised
architecture called ReMiX~\cite{Lavenier2006}.  We generated sets of
alignments obtained with ssearch~\cite{Lipman1985}. On the first tests
we did, subsets seeds detect 98 \% of this alignments while blast
detects 96 \% of them. At first glance, we observe that the difference
concerns relevant biological data.

Furthermore, the use of a ReMiX will permit to increase the speed of
the application of the programme. ReMiX disposes of 512 Gb of flash
memory in close association with a FPGA hardware allowing the
parallelisation of Hamming distance computation.
 
Wojciech Rytter (Warsaw University)
On some Occurrence  and Lexicographic  Properties of  Standard Sturmian  Words
abstract:
 
Bill Smyth (McMaster University)
Fast & Practical Algorithms for Computing All the Runs in a String
Abstract:
A repetition in a string x is a substring w = ue of x, maximum
e >= 2, where u is not itself a repetition in w. A run in x is a substring
w = ueu*of “maximal periodicity”, where ue is a repetition and
u* is a maximum-length possibly empty proper prefix of u. A run may
encode as many as |u| repetitions. The maximum number of repetitions
in any string x = x[1..n] is well known to be \Theta(n log n). In 2000 Kolpakov
& Kucherov showed that the maximum number of runs in x is O(n); they
also described a \Theta(n)-time algorithm, based on Farach’s \Theta(n)-time suffix
tree construction algorithm (STCA), \Theta(n)-time Lempel-Ziv decomposition,
and Main’s \Theta(n)-time leftmost runs algorithm, to compute all the
runs in x. Recently Abouelhoda et al. proposed a \Theta(n)-time Lempel-
Ziv decomposition algorithm based on an “enhanced” suffix array —
that is, a suffix array together with other supporting data structures. In
this paper we introduce a collection of fast space-efficient algorithms for
computing all the runs in a string that appear both in their theoretical
behaviour and in practice to be superior to those previously proposed.
 
Alexander Tiskin (University of Warwick)
Semi-local string comparison: algorithmic application
Abstract:
For two strings, the longest common subsequence (LCS) problem
consists in comparing them by computing the length of their LCS.
In a previous paper, we defined a more general form of string
comparison, called ``the all semi-local LCS problem'', for which
we proposed an efficient geometric output representation, and an
algorithm based on fast implicit highest-score matrix
multiplication. This algorithm turns out to be a useful plug-in,
which allows to unify, and often to improve, various subsequence-
related algorithms considered in the past. The talk will describe
the main algorithm and several of its applications.
 
Gabriel Valiente (Technical University of Catalonia)
Computing the Transposition Distance between Phylogenetic Trees
Abstract:
Unlike most metrics for phylogenetic tree comparison that are based
on elementary edit operations, the transposition distance between
phylogenetic trees can be computed in polynomial time. As a matter of
fact, a linear time algorithm is known for computing the
transposition distance between two fully resolved phylogenetic trees.
In this talk, we recall a bijection between fully resolved
phylogenetic trees and matching permutations, present an embedding of
general phylogenetic trees with $n$ leaves labeled $1,\ldots,n$ into
the symmetric permutation group $\mathcal{S}_{2n-2}$, and present a
linear time algorithm for computing the transposition distance
between two general phylogenetic trees.
 
Michal Voracek (CTU Prague)
The Constrained Longest Common Subsequence Problem for Degenerate Strings - A Finite Automata Approach
abstract:
Given strings X, Y and V. U is a common subsequence of X and Y if U is a
subsequence of both X and Y.
U is constrained common subsequence of X and Y with respect to V if U is
a common subsequence of X and Y and V is a subsequence of U.
U is the constrained longest common subsequence of X and Y with respect
to V
if U is the longest among all the constrained common subsequences of X
and Y with respect to V.
The CLCS problem for given strings X, Y and V is to find their CLCS with
respect to string V.

We introduce a new variant of the CLCS problem by considering X and Y to
be degenerate strings (strings of sets of symbols).
We present a simple finite automata based algorithm for solving the new
presented problem working in O(|\Sigma| |X| |Y| |V| ) time and space.
Our algorithm can also be used for solving the CLCS problem for multiple
strings.
 

 

 
  

This workshop is sponsored by the Department of Computer Science, King's College London