Extracting and rendering representative sequences

TitreExtracting and rendering representative sequences
Type de publicationBook Chapter
Year of Publication2011
AuteursGabadinho, A, Ritschard, G, Studer, M, Müller, NS
ÉditeurFred, A, Dietz, JLG, Liu, K, Filipe, J
Book TitleKnowledge Discovery, Knowledge Engineering and Knowledge Management
Series TitleCommunications in Computer and Information Science
NombreVol. 128
Place PublishedBerlin
ISBN Number978-3-642-19031-5
Mots-cléscategorical sequences, dairwise dissimilarities, discrepancy of sequences, representatives, summarizing sets of sequences, visualization

This paper is concerned with the summarization of a set of categorical sequences. More specifically, the problem studied is the determination of the smallest possible number of representative sequences that ensure a given coverage of the whole set, i.e. that have together a given percentage of sequences in their neighbourhood. The proposed heuristic for extracting the representative subset requires as main arguments a pairwise distance matrix, a representativeness criterion and a distance threshold under which two sequences are considered as redundant or, identically, in the neighborhood of each other. It first builds a list of candidates using a representativeness score and then eliminates redundancy. We propose also a visualization tool for rendering the results and quality measures for evaluating them. The proposed tools have been implemented in our TraMineR R package for mining and visualizing sequence data and we demonstrate their efficiency on a real world example from social sciences. The methods are nonetheless by no way limited to social science data and should prove useful in many other domains.