|
|
|||||||||
|
|||||||||
|
|
|||||||||
The project development platforms are to be discussed with the supervisor.
Background: mainly CSMACMB or CSMTSP
Subject: to be agreed with the supervisor. Typically, a project aimed at implementing algorithms in the domains of:
|
|
Deliverables: report, software and its documentation.
Aim: the project is on the design of a compact indexing data structure, called Compact Suffix Array. The project consists in implementing the algorithms to query the index.
Plan: literature review based on article below, choice of an algorithmic methods, test on real data, documentation.
Deliverables: report, program and its documentation.
References:
V. Mäkinen and G. Navarro.
Compressed Compact Suffix Arrays.
CPM 2004: 420-433.
S. Grabowski, G. Navarro, R. Przywarski, A. Salinger and V. Mäkinen.
A Simple Alphabet-independent Fm-index.
Int. J. Found. Comput. Sci. 17(6): 1365-1384 (2006).
Aim: the goal of the project is to implement a string-searching algorithm running in real-time on any text and using a fixed amount of memory.
Plan: literature review based on the article below, algorithm implementation, test on real data, documentation.
Deliverables: report, program and its documentation.
Reference:
D. Breslauer, R. Grossi and F. Mignosi.
Simple Real-Time Constant-Space
String Matching, Combinatorial Pattern Matching ,
volume 6661 of LNCS, pages 173--183. Springer, 2011.
Aim: searching a book, a library, or a collection of Web sites is rather done with an index. Searching remains to execute operations on the index, which is much faster than exploring the data themselves. The project consists in implementing an algorithm that builds an approximate index on a text and some operations related to it.
Plan: study of the indexing methods described in the article, implementation of the index, documentation of software.
Deliverables: report, approximate index software and its documentation.
Reference:
M. Crochemore, G. Tischler.
Gapped Suffix Arrays: a New Index Structure
for Fast Approximate Matching,
In E. Chavez and S. Lonardi, editors, String
Processing and Information Retrieval - SPIRE 2010 , volume 6393 of LNCS,
pages 359--364. Springer, 2010.
Aim: the project is on the detection of segments (strings) that do not occur in genomic sequences. Present solutions make use of index data structures such as Suffix tree, Suffix array, or Suffix automaton. The project consists in implementing an algorithm that reports absent words and in testing its time and space efficiency on real data.
Plan: literature review based on article below, choice of an algorithmic methods, test on real data, documentation.
Deliverables: report, program and its documentation.
Reference:
J. Herold, S. Kurtz, and R. Giegerich.
Efficient computation of absent
words in genomic sequences,
BMC Bioinformatics 2008, 9:167, doi:10.1186/1471-2105-9-167.
Aim: the project is on the detection of microsatellites in genomic sequences based on the method described in the article below. The originality of the method is that it accounts for some polymorphism in the detected segments. The project consists in implementing the method and running it on real data.
Plan: literature review based on article below, choice of an algorithmic methods, test on real data, documentation.
Deliverables: report, program and its documentation.
Reference:
J. Tang et al.
Large-scale identification of
polymorphic microsatellites using an in silico approach,
BMC Bioinformatics 2008, 9:374, doi:10.1186/1471-2105-9-374.
Aim: the project focuses on a new heuristics for Multiple Sequence Alignement (MSA). Heuristics are important to solve efficiently the problem that is known to be NP-hard. The project consists in implementing the method described in the article below and running it on real data.
Plan: literature review based on article below, choice of an implementation, test on real data, documentation.
Deliverables: report, program and its documentation.
Reference:
A. Szabo, A. Novak, I. Miklos, and J. Hein
T. D. Otto et al.
Reticular Alignment: A progressive corner-cutting
method for multiple sequence alignment,
BMC Bioinformatics 2010, 11:570, doi:10.1186/1471-2105-11-570.
Aim: many problems on string processing arise from the fact that strings may contain consecutive repetitions. The project consists in implementing an algorithm that reports all runs, that is, maximal repetitions of exponent at least 2. A local period at a position $i$ on a string is the shortest nonempty square centered at $i$.
Plan: Implementation of a simple algorithm that reports all local periods, search for sites presenting similar software, study and implementation of the algorithm in reference below, extend the algorithm to detect centers of runs, documentation.
Deliverables: report, local period and run detection software and its documentation.
References:
J.-P. Duval, R. Kolpakov, G. Kucherov, T. Lecroq, and A. Lefebvre.
Linear-time computation
of local periods. Theoretical Computer Science, 326(1-3) (2004) 229-240.
Aim: the project consists in implementing a text compression algorithm based on the notion of antidictionary. The algorithm reduces the size of the input by considering segments that do not appear in it: the minimal forbidden factors. An antidictionary is a set of well chosen minimal forbidden factors for the computation of which several strategies have to be implemented.
Plan: study of forbidden factors, study and implementation of compression and decompression algorithms related to a given antidictionary, Report with experiments on the choice of an antidictionary, software and its documentation.
Deliverables: compression/decompression software, its documentation, and report on the experiments in the form of a Web site.
Reference:
M. Crochemore, F. Mignosi, A. Restivo, and S. Salemi.
Data compression using antidictonaries.
Proceedings of the I.E.E.E. 88 (11), 2000, pp. 1756–1768.
Aim:
the project consists in implementing a coding system based on special automata.
In the first part, the system is given by a list of "forbidden" strings and
a set of unconstrained position. The system is then represented by an automaton
that is able to encode any source text into a text in which none of the given
strings appear.
In the second part, the project will consider how to generate lists of forbidden
strings from a given text. This will be applied to the detection of periodic
forbidden segments in coding genes.
The software will implement the two parts of the coding system.
The reference below contains material for the project.
Plan: study of periodic coding systems, study and implementation of a suffix automaton construction algorithm, implementation of algorithms used to build the coding system, implementation of algorithms used to detect periodic forbidden segments, examples and tests on coding genes, documentation.
Deliverables: report, entire coding system software and its documentation.
Reference:
M.-P. Béal, M. Crochemore, and G. Fici.
Presentation of constrained systems with
Unconstrained Positions
IEEE Transactions on Information Theory, 51(5):1891--1900, 2005.
Aim: searching a book, a library, or a collection of Web sites is rather done with an index. Searching remains to execute operations on the index, which is much faster than exploring the data themselves. The project consists in implementing an algorithm that builds an approximate index on a text and some operations related to it.
Plan: study of exact indexing methods, search for references on approximate indexing, study the article below, implementation of the index, documentation of software.
Deliverables: report, approximate index software and its documentation.
Reference:
M. Crochemore, C. Epifanio, A. Gabriele, and F. Mignosi.
On the Suffix automaton with mismatches,
Implementation and Application of Automata, 12th International Conference, CIAA 2007,
LNCS 4783, pp. 144-156. Springer-Verlag, Berlin, 2007.
Aim: the project is on the detection of similarities between texts. many problems on string processing arise from the fact that strings may contain consecutive repetitions. The project consists in implementing an algorithm that reports all repeats including don't care symbols.
Plan: review of existing software to detect plagiarism, study of algorithmic methods, implementation of several algorithm, experimentation on texts in natural language and on source codes, documentation.
Deliverables: report, plagiarism software and its documentation.
Reference:
R. W. Irving
Plagiarism
and collusion detection using the Smith-Waterman algorithm,
TR-2004-164, Dept of Computing Science, University of Glasgow, 2004.
Aim: the project consists in implementing a motif inference algorithm aimed at reporting repetitions in molecular sequences. Motifs are strings drawn from an alphabet containing a don't care that matches all letters. Interesting motifs are maximal and irreducible. The software will report lists of motifs satisfying some extra properties. The reference below contains material for the project.
Plan: study of motif definitions, study of generating algorithms, search for sites presenting similar algorithms or software, implementation, documentation.
Deliverables: report, motif inference software and its documentation.
Reference:
N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot.
A Basis for Repeated Motifs in Pattern Discovery and Text Mining
Technical report 2002-10, Institut Gaspard-Monge, University of Marne-la-Vallée, 2002.
Aim: the project consists in implementing a text compression algorithm described in the article below. The algorithm makes use of an algorithm that sorts the suffixes of a text.
Plan: study and implementation of a few suffix sorting algorithms, study and implementation of the compression and decompression algorithms, experiments on the choice of the suffix sorting method, documentation.
Deliverables: report, compression/decompression software and its documentation.
Reference:
A. J. Ferreira , A. L. Oliveira , M. A. T. Figueiredo.
Time and Memory Efficient Lempel-Ziv
Compression Using Suffix Arrays. To appear.
Aim: the project consists in implementing a text compression algorithm described in the article below. The algorithm is a variation of a standard method used in many compression software.
Plan: study of LZ method for text compression, study and implementation of the compression and decompression algorithms, experiments, documentation.
Deliverables: report, compression/decompression software and its documentation.
Reference:
For an optional project on compression:
A. Apostolico.
Of Lempel-Ziv-Welch Parses with Refillable Gaps.
Data Compression Conference (DCC 2005), 2005, pp. 338-347.
Aim: Recent cybersecurity incidents suggest that Internet worms can spread so fast that in-time human-mediated reaction is not possible, and therefore initial response to cyberattacks has to be automated. The first step towards combating new unknown worms is to be able to detect and identify them at the first stages of their spread. The paper below presents a method for detecting new worms based on identifying similar packet contents directed to multiple destination hosts.
Plan: Study and Implementation of a method described in the paper, possible extension of the technique, evaluation of the implementation, documentation.
Deliverables: report, worm detection software and its documentation.
References:
P. Akritidis, K. Anagnostakis, E. P. Markatos.
Efficient Content-based Worm Detection.
Proceedings of the 40th IEEE International Conference on Communications (ICC '05), 2005.
Aim: the project is on the detection of repetitive elements in genome survey sequences. These sequences offer a large view of a genome because they include non coding DNA, and their detection is important for sequencing projects. The project consists in implementing the method described in the article below and running it on real data.
Plan: literature review based on article below, choice of an implementation, test on real data, documentation.
Deliverables: report, program and its documentation.
Reference:
T. D. Otto et al.
ReRep: Computational detection
of repetitive sequences in genome survey sequences (GSS),
BMC Bioinformatics 2008, 9:366 doi:10.1186/1471-2105-9-366.
Aim: Automata provide highly efficient implementations of methods related to coding, searching, parsing, etc. But their size matters. The aim of the project is to implement a minimization procedure for a specific kind of automata. The algorithm processes the automaton in an original way compared to known methods and runs in linear time.
Plan: study AFT automata and their minimization in the reference below, search for references on the problem, implementation of the minimization algorithm, documentation of the software.
Deliverables: report, minimization software and its documentation.
Reference:
M.-P. Béal and M. Crochemore,
Minimizing local automata,
IEEE International Symposium on Information Theory}
pp. 1376-1380, IEEE, 2007.
Aim: many problems on string processing arise from the fact that strings may contain consecutive repetitions. The project consists in implementing an algorithm that reports all leftmost occurrences of maximal repetitions having an exponent at least 2. Implementation will use either the suffix tree of the input string or its suffix automaton.
Plan: study of run detection in reference below, search for sites presenting similar software, implementation of chosen suffix data structure, implementation of run detection algorithm, documentation.
Deliverables: report, run detection software and its documentation.
References:
M. G. Main.
Detecting leftmost maximal periodicities.
Discret. Appl. Math., 25:145--153, 1989.
R. Kolpakov and G. Kucherov.
Finding
maximal repetitions in a word in linear time.
In Proceedings of the 40th IEEE Annual Symposium on Foundations of Computer Science,
pages 596--604, New York, 1999. IEEE Computer Society Press.
Aim: many problems on string processing arise from the fact that strings may contain consecutive repetitions. The project consists in implementing an algorithm that reports all repeats including don't care symbols.
Plan: study of repeat detection algorithm in the reference below, search for sites presenting similar software, implementation of the repeat detection algorithm, documentation.
Deliverables: report, repeat detection software and its documentation.
Reference:
R. Kolpakov and G. Kucherov,
Finding repeats with fixed gaps,
7th International Symposium on String Processing and Information Retrieval (SPIRE'99),
2000, IEEE, pp. 162--168.
Aim: the notion of index on a text is central in many methods for text processing and for the management of textual databases. Suffix Arrays is one of these methods based on the sorted list of suffixes of the input text. The project consists in implementing a linear-time sorting algorithm and other elements related to Suffix Array construction and to Burrows-Wheeler text compression.
Plan: study of the sorting problem in the literature starting with the reference below. Implementation of the sorting algorithm and the LCP computation to obtain a Suffix Array construction software. Then, using this work, implementation of the algorithms described in the second reference below.
Deliverables: report, suffix sorting and associated software and their documentation.
References:
J. Kärkkäinen and P. Sanders,
Simple linear work suffix array construction,
in ICALP'03, LNCS 2719, Spinger, 2003, pp. 943--955.
M. Crochemore, J. Désarménien and D. Perrin,
A note on the Burrows-Wheeler transformation,
Theoret. Comput. Sci., 2005, to appear.
Aim: the project consists in building a Web interface in Java to visualize approximate string matching algorithms. Users of the interface will be able to enter texts and patterns and to see how several algorithms process them in detail. Examples of animations may be found at Exact String Matching Algorithms and Sequence comparison.
Plan: search for sites presenting animations of algorithms, study of several approximate string matching algorithms, proposal for their animation, implementation, documentation.
Deliverables: report, animation software and its documentation.
References:
M. Crochemore and W. Rytter,
Jewels of Stringology,
World Scientific, 2002, 310 pages.
G. Navarro and M. Raffinot,
Flexible Pattern Matching in Strings,
Cambridge University Press, 2002, 221 pages.
G.A. Stephen,
String Searching Algorithms,
World Scientific Publishing Co., 1994, 243 pages.
Aim: the project is to implement a method for comparing molecular sequences based on alignment. The prescribed algorithm is able to run on sequences compressed with Lempel-Ziv method and it runs efficiently on sequences having a small entropy. The alignment algorithm refines the standard dynamic programming method. Implementation will propose a visualization of the execution.
Plan: study the sub-quadratic sequence alignment algorithm in reference below, proposal for its visualization, implementation, documentation.
Deliverables: report, animation software and its documentation.
Reference:
M. Crochemore, G. Landau, and M. Ziv-Ukelson.
A sub-quadratic sequence alignment algorithm for
unrestricted cost matrices