MAGDA - Alarm Supervision in Telecommunication Networks
by Eric Fabre
MAGDA (Modélisation et Apprentissage pour une Gestion Distribuée des Alarmes) is a research project supported by the French National Research Network in Telecommunications programme RNRT (Réseau National de Recherche en Télécommunication). The challenge for the MAGDA project is twofold: 1) adapting and mixing different formalisms towards the general objective of fault diagnosis in telecommunication networks, using alarm correlation, and 2) building an experimental platform to validate this approach.
Today, there are no widely-recognized standards for tools that could help alarm correlation. Thus, it is worth mentioning some key points of this research that could be regarded as useful achievements in and of themselves:
- Validation of black-box methods: Alarm logs are easily available in practice, but tools are lacking to exploit them properly. It is important to check the relevance of data mining techniques to discover structures in these logs, in particular, frequent chronicles of events.
- Modelling methodology: It is quite easy to build a model of alarm production and fault propagation for toy examples, but the construction of a model for a real network is quite challenging, and requires some structuring and methodology. In particular, it must be object-oriented and based on generic elements that are instantiated and interconnected. This model will be described in UML. The direct use of this model (for simulation and fault diagnosis) is also an objective of MAGDA.
- Model-based diagnosis: The effectiveness of model based approaches for alarm correlation will be evaluated and compared to (possibly associated with) black-box models. Key points to check are the relevance of a formalism for concurrency of events, the relevance of a stochastic model, the influence of an incomplete knowledge of the model and the possibility to cope with a changing model (reconfiguration).
- Distributed diagnosis. As very large systems must be modelled, a centralised diagnosis algorithm handling the global state of the system is unaffordable. Therefore, a distributed diagnosis algorithm will be developed, composed of asynchronous local algorithms that supervise part of the network evolution and synchronize their results. This approach is well-suited to highly concurrent systems.
- Real case experiments will be carried out on an experimental platform composed of a real network manager connected to a simulator of a simple network. Fault scenarios will be elaborated from alarm logs collected on a true network.
All the above mentioned points are very likely to lead to innovations in alarm correlation and network monitoring. We present now only the part of the project devoted to diagnosis using discrete-event control theory.
Model-based online monitoring
We consider a telecommunication network as a network of asynchronously interacting finite-state machines. Such systems are subject to spontaneous faults, occurrences of which may trigger alarms. Also, network elements get services from other elements and, in turn, provide services to several alternative network elements. This causes both fault and alarm propagation throughout the network.
As a first idea, we have proposed modelling such a situation as follows:
- The mechanism of a spontaneous fault occurrence and alarm propagation for a given element is modelled by a Petri Net of capacity one (PN). Places of the PN capture features of interest concerning the state of the network element, eg, nominal, faulty, disabled. Having a token in a given place indicates that the associated feature is active.
- Observations are attached to transitions. Each time a transition fires, it produces an alarm message that is eventually received by the supervisor of the network (losses of messages are possible). The location of tokens and their moves are not observable.
- The interaction of network elements is modelled by common places between the PN models of each element. We consider a true-concurrency semantics for the PN: transition firings are only partially ordered in time, according to the causality structure induced by the underlying PN directed graph structure.
Alarm interpretation is then regarded as the problem of inferring, from the observation of alarms, the hidden state history of the PN. As some of the events (in particular, spontaneous faults) are typically random in nature, we consider some kind of probabilistic form for our PN. To prepare for distributed diagnostics, our requirement was that stochastic independence should match concurrency, implying that if two transitions of the PN are concurrent, all interleavings of their reception by the supervisor should have equal likelihood.
We have proposed a new class of stochastic PN that satisfy the above property: Partially Stochastic Petri Nets (PSPN). The well known Hidden Markov Models (HMM) theory has been extended to them. HMM are stochastic automata for which is it desired to infer the most likely hidden state (or transition) sequence from an observed sequence of transition signatures. This machinery is very popular in pattern recognition and in speech recognition. The basic algorithm is the so-called Viterbi algorithm, which computes on-line the most likely state history. In our PSPN framework, transitions are associated with tiles that describe local state changes. The Viterbi algorithm then reconstructs hidden trajectories by concatenating tiles that match the observations. Therefore it is renamed the Viterbi puzzle.
The following topics require further development for MAGDA and are currently under investigation:
- developing of a theory of distributed, asynchronous deployment of the Viterbi puzzle, which means that the puzzle is now played in co-operation by several players, where each has its own partial set of tiles
- extending this theory to dynamically-evolving models of PN. This is necessary since some network elements are only logical, not physical, and thus are subject to reconfiguration. Our aim is to directly generalise Viterbi puzzles to puzzles with a dynamically-evolving set of tiles
- extending the theory further to accommodate situations where the model may be incomplete, i.e. it is not guaranteed that the observed alarm sequence can be produced by the hypothesised model
- modelling: the process of deriving the PN model from available information is itself a challenge, as this must be automated as part of the deployment and configuration of the fault management system
- deployment of the algorithm: as the MAGDA deployment architecture will rely on OMG-CORBA model, our first objective is to use this model as a target model for the deployment of our abstract distributed Viterbi puzzle.
Currently, the Viterbi puzzle group at IRISA is comprised of three permanent researchers (Albert Benveniste, Eric Fabre, and Claude Jard) and two post-doctoral fellows (Mark Smith from MIT and AT&T Labs Research and Laurie Ricker from Queens University in Canada). Laurie Ricker is supported by the ERCIM PhD fellowship programme. After this project, she will continue her research activity at CWI with Jan van Schuppen.
The MAGDA project is a collaboration between two academic research centres (IRISA/INRIA Rennes and LIPN Paris) and three industrial companies (France-Télécom/CNET, Alcatel/CRC and ILOG).
Links:
Magda web site: http://magda.elibel.tm.fr/
Please contact:
Eric Fabre - INRIA/IRISA
Tel: + 33 2 9984 7326
E-mail: Eric.Fabre@irisa.fr