Department of Informatics
The Strand, London WC2R 2LS, UK
  Undergraduate enquiries:
Postgraduate enquiries:
Fax:
+44 (0)20 7848 2005
+44 (0)20 7848 2588
+44 (0)20 7848 2851
 
 



2011/12 MSc individual projects

Supervisor: Maxime Crochemore
Email: @kcl   Office: Strand Building S6.33


Project titles

The project development platforms are to be discussed with the supervisor.


Detailed Descriptions

Free topic

Background: mainly CSMACMB or CSMTSP

Subject: to be agreed with the supervisor. Typically, a project aimed at implementing algorithms in the domains of:
  • data compression
  • bioinformatics
  • data retrieval
  • applications of text processing
  • visualization of algorithms
  • combinatorics on words
Examples: antivirus, worm, plagiarism, Web search.

Deliverables: report, software and its documentation.

Compact suffix array

Aim: the project is on the design of a compact indexing data structure, called Compact Suffix Array. The project consists in implementing the algorithms to query the index.

Plan: literature review based on article below, choice of an algorithmic methods, test on real data, documentation.

Deliverables: report, program and its documentation.

References:
V. Mäkinen and G. Navarro. Compressed Compact Suffix Arrays. CPM 2004: 420-433.
S. Grabowski, G. Navarro, R. Przywarski, A. Salinger and V. Mäkinen. A Simple Alphabet-independent Fm-index. Int. J. Found. Comput. Sci. 17(6): 1365-1384 (2006).

Real-time string matching

Aim: the goal of the project is to implement a string-searching algorithm running in real-time on any text and using a fixed amount of memory.

Plan: literature review based on the article below, algorithm implementation, test on real data, documentation.

Deliverables: report, program and its documentation.

Reference:
D. Breslauer, R. Grossi and F. Mignosi. Simple Real-Time Constant-Space String Matching, Combinatorial Pattern Matching , volume 6661 of LNCS, pages 173--183. Springer, 2011.

Approximate indexing

Aim: searching a book, a library, or a collection of Web sites is rather done with an index. Searching remains to execute operations on the index, which is much faster than exploring the data themselves. The project consists in implementing an algorithm that builds an approximate index on a text and some operations related to it.

Plan: study of the indexing methods described in the article, implementation of the index, documentation of software.

Deliverables: report, approximate index software and its documentation.

Reference:
M. Crochemore, G. Tischler. Gapped Suffix Arrays: a New Index Structure for Fast Approximate Matching, In E. Chavez and S. Lonardi, editors, String Processing and Information Retrieval - SPIRE 2010 , volume 6393 of LNCS, pages 359--364. Springer, 2010.


Former projects

Absent words

Aim: the project is on the detection of segments (strings) that do not occur in genomic sequences. Present solutions make use of index data structures such as Suffix tree, Suffix array, or Suffix automaton. The project consists in implementing an algorithm that reports absent words and in testing its time and space efficiency on real data.

Plan: literature review based on article below, choice of an algorithmic methods, test on real data, documentation.

Deliverables: report, program and its documentation.

Reference:
J. Herold, S. Kurtz, and R. Giegerich. Efficient computation of absent words in genomic sequences, BMC Bioinformatics 2008, 9:167, doi:10.1186/1471-2105-9-167.

Polymorphic microsatellites

Aim: the project is on the detection of microsatellites in genomic sequences based on the method described in the article below. The originality of the method is that it accounts for some polymorphism in the detected segments. The project consists in implementing the method and running it on real data.

Plan: literature review based on article below, choice of an algorithmic methods, test on real data, documentation.

Deliverables: report, program and its documentation.

Reference:
J. Tang et al. Large-scale identification of polymorphic microsatellites using an in silico approach, BMC Bioinformatics 2008, 9:374, doi:10.1186/1471-2105-9-374.

Reticular Alignment

Aim: the project focuses on a new heuristics for Multiple Sequence Alignement (MSA). Heuristics are important to solve efficiently the problem that is known to be NP-hard. The project consists in implementing the method described in the article below and running it on real data.

Plan: literature review based on article below, choice of an implementation, test on real data, documentation.

Deliverables: report, program and its documentation.

Reference:
A. Szabo, A. Novak, I. Miklos, and J. Hein T. D. Otto et al. Reticular Alignment: A progressive corner-cutting method for multiple sequence alignment, BMC Bioinformatics 2010, 11:570, doi:10.1186/1471-2105-11-570.

Local Periodicities

Aim: many problems on string processing arise from the fact that strings may contain consecutive repetitions. The project consists in implementing an algorithm that reports all runs, that is, maximal repetitions of exponent at least 2. A local period at a position $i$ on a string is the shortest nonempty square centered at $i$.

Plan: Implementation of a simple algorithm that reports all local periods, search for sites presenting similar software, study and implementation of the algorithm in reference below, extend the algorithm to detect centers of runs, documentation.

Deliverables: report, local period and run detection software and its documentation.

References:
J.-P. Duval, R. Kolpakov, G. Kucherov, T. Lecroq, and A. Lefebvre. Linear-time computation of local periods. Theoretical Computer Science, 326(1-3) (2004) 229-240.

DCA Compression

Aim: the project consists in implementing a text compression algorithm based on the notion of antidictionary. The algorithm reduces the size of the input by considering segments that do not appear in it: the minimal forbidden factors. An antidictionary is a set of well chosen minimal forbidden factors for the computation of which several strategies have to be implemented.

Plan: study of forbidden factors, study and implementation of compression and decompression algorithms related to a given antidictionary, Report with experiments on the choice of an antidictionary, software and its documentation.

Deliverables: compression/decompression software, its documentation, and report on the experiments in the form of a Web site.

Reference:
M. Crochemore, F. Mignosi, A. Restivo, and S. Salemi. Data compression using antidictonaries. Proceedings of the I.E.E.E. 88 (11), 2000, pp. 1756–1768.

Coding

Aim: the project consists in implementing a coding system based on special automata. In the first part, the system is given by a list of "forbidden" strings and a set of unconstrained position. The system is then represented by an automaton that is able to encode any source text into a text in which none of the given strings appear.
In the second part, the project will consider how to generate lists of forbidden strings from a given text. This will be applied to the detection of periodic forbidden segments in coding genes.
The software will implement the two parts of the coding system. The reference below contains material for the project.

Plan: study of periodic coding systems, study and implementation of a suffix automaton construction algorithm, implementation of algorithms used to build the coding system, implementation of algorithms used to detect periodic forbidden segments, examples and tests on coding genes, documentation.

Deliverables: report, entire coding system software and its documentation.

Reference:
M.-P. Béal, M. Crochemore, and G. Fici. Presentation of constrained systems with Unconstrained Positions IEEE Transactions on Information Theory, 51(5):1891--1900, 2005.

Approximate indexing

Aim: searching a book, a library, or a collection of Web sites is rather done with an index. Searching remains to execute operations on the index, which is much faster than exploring the data themselves. The project consists in implementing an algorithm that builds an approximate index on a text and some operations related to it.

Plan: study of exact indexing methods, search for references on approximate indexing, study the article below, implementation of the index, documentation of software.

Deliverables: report, approximate index software and its documentation.

Reference:
M. Crochemore, C. Epifanio, A. Gabriele, and F. Mignosi. On the Suffix automaton with mismatches, Implementation and Application of Automata, 12th International Conference, CIAA 2007, LNCS 4783, pp. 144-156. Springer-Verlag, Berlin, 2007.

Plagiarism

Aim: the project is on the detection of similarities between texts. many problems on string processing arise from the fact that strings may contain consecutive repetitions. The project consists in implementing an algorithm that reports all repeats including don't care symbols.

Plan: review of existing software to detect plagiarism, study of algorithmic methods, implementation of several algorithm, experimentation on texts in natural language and on source codes, documentation.

Deliverables: report, plagiarism software and its documentation.

Reference:
R. W. Irving Plagiarism and collusion detection using the Smith-Waterman algorithm, TR-2004-164, Dept of Computing Science, University of Glasgow, 2004.

Motifs

Aim: the project consists in implementing a motif inference algorithm aimed at reporting repetitions in molecular sequences. Motifs are strings drawn from an alphabet containing a don't care that matches all letters. Interesting motifs are maximal and irreducible. The software will report lists of motifs satisfying some extra properties. The reference below contains material for the project.

Plan: study of motif definitions, study of generating algorithms, search for sites presenting similar algorithms or software, implementation, documentation.

Deliverables: report, motif inference software and its documentation.

Reference:
N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot. A Basis for Repeated Motifs in Pattern Discovery and Text Mining Technical report 2002-10, Institut Gaspard-Monge, University of Marne-la-Vallée, 2002.

LZ Compression

Aim: the project consists in implementing a text compression algorithm described in the article below. The algorithm makes use of an algorithm that sorts the suffixes of a text.

Plan: study and implementation of a few suffix sorting algorithms, study and implementation of the compression and decompression algorithms, experiments on the choice of the suffix sorting method, documentation.

Deliverables: report, compression/decompression software and its documentation.

Reference:
A. J. Ferreira , A. L. Oliveira , M. A. T. Figueiredo. Time and Memory Efficient Lempel-Ziv Compression Using Suffix Arrays. To appear.

LZW Compression

Aim: the project consists in implementing a text compression algorithm described in the article below. The algorithm is a variation of a standard method used in many compression software.

Plan: study of LZ method for text compression, study and implementation of the compression and decompression algorithms, experiments, documentation.

Deliverables: report, compression/decompression software and its documentation.

Reference:

For an optional project on compression:
A. Apostolico. Of Lempel-Ziv-Welch Parses with Refillable Gaps. Data Compression Conference (DCC 2005), 2005, pp. 338-347.

Worm Detection

Aim: Recent cybersecurity incidents suggest that Internet worms can spread so fast that in-time human-mediated reaction is not possible, and therefore initial response to cyberattacks has to be automated. The first step towards combating new unknown worms is to be able to detect and identify them at the first stages of their spread. The paper below presents a method for detecting new worms based on identifying similar packet contents directed to multiple destination hosts.

Plan: Study and Implementation of a method described in the paper, possible extension of the technique, evaluation of the implementation, documentation.

Deliverables: report, worm detection software and its documentation.

References:
P. Akritidis, K. Anagnostakis, E. P. Markatos. Efficient Content-based Worm Detection. Proceedings of the 40th IEEE International Conference on Communications (ICC '05), 2005.

Repetitions in GSS

Aim: the project is on the detection of repetitive elements in genome survey sequences. These sequences offer a large view of a genome because they include non coding DNA, and their detection is important for sequencing projects. The project consists in implementing the method described in the article below and running it on real data.

Plan: literature review based on article below, choice of an implementation, test on real data, documentation.

Deliverables: report, program and its documentation.

Reference:
T. D. Otto et al. ReRep: Computational detection of repetitive sequences in genome survey sequences (GSS), BMC Bioinformatics 2008, 9:366 doi:10.1186/1471-2105-9-366.

Minimising automata

Aim: Automata provide highly efficient implementations of methods related to coding, searching, parsing, etc. But their size matters. The aim of the project is to implement a minimization procedure for a specific kind of automata. The algorithm processes the automaton in an original way compared to known methods and runs in linear time.

Plan: study AFT automata and their minimization in the reference below, search for references on the problem, implementation of the minimization algorithm, documentation of the software.

Deliverables: report, minimization software and its documentation.

Reference:
M.-P. Béal and M. Crochemore, Minimizing local automata, IEEE International Symposium on Information Theory} pp. 1376-1380, IEEE, 2007.

Runs in strings

Aim: many problems on string processing arise from the fact that strings may contain consecutive repetitions. The project consists in implementing an algorithm that reports all leftmost occurrences of maximal repetitions having an exponent at least 2. Implementation will use either the suffix tree of the input string or its suffix automaton.

Plan: study of run detection in reference below, search for sites presenting similar software, implementation of chosen suffix data structure, implementation of run detection algorithm, documentation.

Deliverables: report, run detection software and its documentation.

References:
M. G. Main. Detecting leftmost maximal periodicities. Discret. Appl. Math., 25:145--153, 1989.
R. Kolpakov and G. Kucherov. Finding maximal repetitions in a word in linear time. In Proceedings of the 40th IEEE Annual Symposium on Foundations of Computer Science, pages 596--604, New York, 1999. IEEE Computer Society Press.

Repeats

Aim: many problems on string processing arise from the fact that strings may contain consecutive repetitions. The project consists in implementing an algorithm that reports all repeats including don't care symbols.

Plan: study of repeat detection algorithm in the reference below, search for sites presenting similar software, implementation of the repeat detection algorithm, documentation.

Deliverables: report, repeat detection software and its documentation.

Reference:
R. Kolpakov and G. Kucherov, Finding repeats with fixed gaps, 7th International Symposium on String Processing and Information Retrieval (SPIRE'99), 2000, IEEE, pp. 162--168.

Sorting Suffixes

Aim: the notion of index on a text is central in many methods for text processing and for the management of textual databases. Suffix Arrays is one of these methods based on the sorted list of suffixes of the input text. The project consists in implementing a linear-time sorting algorithm and other elements related to Suffix Array construction and to Burrows-Wheeler text compression.

Plan: study of the sorting problem in the literature starting with the reference below. Implementation of the sorting algorithm and the LCP computation to obtain a Suffix Array construction software. Then, using this work, implementation of the algorithms described in the second reference below.

Deliverables: report, suffix sorting and associated software and their documentation.

References:
J. Kärkkäinen and P. Sanders, Simple linear work suffix array construction, in ICALP'03, LNCS 2719, Spinger, 2003, pp. 943--955.
M. Crochemore, J. Désarménien and D. Perrin, A note on the Burrows-Wheeler transformation, Theoret. Comput. Sci., 2005, to appear.

Animation of string algorithms

Aim: the project consists in building a Web interface in Java to visualize approximate string matching algorithms. Users of the interface will be able to enter texts and patterns and to see how several algorithms process them in detail. Examples of animations may be found at Exact String Matching Algorithms and Sequence comparison.

Plan: search for sites presenting animations of algorithms, study of several approximate string matching algorithms, proposal for their animation, implementation, documentation.

Deliverables: report, animation software and its documentation.

References:
M. Crochemore and W. Rytter, Jewels of Stringology, World Scientific, 2002, 310 pages.
G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings, Cambridge University Press, 2002, 221 pages.
G.A. Stephen, String Searching Algorithms, World Scientific Publishing Co., 1994, 243 pages.

Alignment of sequences

Aim: the project is to implement a method for comparing molecular sequences based on alignment. The prescribed algorithm is able to run on sequences compressed with Lempel-Ziv method and it runs efficiently on sequences having a small entropy. The alignment algorithm refines the standard dynamic programming method. Implementation will propose a visualization of the execution.

Plan: study the sub-quadratic sequence alignment algorithm in reference below, proposal for its visualization, implementation, documentation.

Deliverables: report, animation software and its documentation.

Reference:
M. Crochemore, G. Landau, and M. Ziv-Ukelson. A sub-quadratic sequence alignment algorithm for unrestricted cost matrices


Informatics Homepage     Local Homepage