A survey on unsupervised learning algorithms for detecting abnormal points in streaming data.
One of the critical tasks of data stream analysis is anomaly detection. Various methods based on multiple assumptions have been reported in the literature. However, there is still a lack of experimental comparison of those methods, which makes it difficult to choose a specific one. In this paper, we compared unsupervised data stream abnormal point detection methods on various datasets with emphasis on their performance and runtime, as well as the presence of concept drift, seasonality, trend, and cycle as a characteristic of the dataset. Our experiments show that forecasting-based methods are the ones managing the best seasonality and trend, and lightweight models performing online gradient descent have a lower execution time. The details of our experiments are available online.
AcceptedExperimental study of similarity measures for clustering uncertain time series
Uncertain time series (uTS) are time series whose values are not precisely known. Each value in such time seris can be given as a best estimate and an error deviation on that estimate. These kind of time series are preponderant in transient astrophysics where transient objects are characterized by the time series of their light curves which are uncertain because of many factors including moonlight, twilight and atmospheric factors. An example of uTS dataset can be found at https://www.kaggle.com /c/PLAsTiCC-2018. Similarly to traditional time series, machine learning can be used to analyze uTS. This analyzis is generally performed in the literature using uncertain similarity measures. In particular, uTS clustering has been performed using FOTS, an uncertain similarity measure based on eigenvalues decomposition [1]. Elsewhere, the uncertain euclidean distance (UED), which is based on uncertainty propagation has been proposed and used to perform the classification of uTS [2]. Given UED performance on supervised classication, the goal of this work is to assess the effectiveness of this uncertain measure for uTS clustering. A preliminary experiment has been conducted in that direction, the source code and results of the experiment are publicly available online1. In the experiment, FOTS, UED and euclidean distance are compared as measures for uTS clustering using the datasets from [2]. The obtained results revealed that UED is a promising uncertain measure for uTS clustering. As future direction, an extended experiment with other uncertain similarity measures such as DUST and PROUD [3] will be conducted.
Published on Jul 2022Dimensionality Reduction and Multivariate Time Series Classification
In this work we tackle the problem of dimensionality reduction when classifying multivariate time series (MTS). Multivariate time series classification is a challenging task, especially as sparsity in raw data, computational runtime and dependency among dimensions increase the difficulty to deal with such complex data. In a recent work, a novel subspace model named EMMV (Ensemble de Mhistogrammes Multi-Vues) [1] that combines M-histograms and multi-view learning together with an ensemble learning technique to handle the MTS classification task was reported. The aforementioned model has shown good results when compared to state of the art MTS classification methods. Before performing the classification itself, EMMV reduces the dimension of the multivariate time series using correlation analysis, and uses after that a random selection of the views. In this work, we explore two more alternatives to the dimensionality reduction method used in EMMV, the goal being to check the efficiency of randomness on EMMV. The first technique named Temporal Laplacian Eigenmaps [2] comes from manifold learning and the second one named Fractal Redundancy Elimination [3] comes from the fractal theory. Both are nonlinear dimensionality reduction algorithms in contrast to correlation analysis which is linear, meaning that the first cited are able to eliminate more correlations than the latter. We then conduct several experiments on available MTS benchmarks in order to compare the different techniques, and discuss the obtained results.
Published on Jul 2022Top-k Learned Clauses for Modern SAT Solvers
Clause Learning is one of the most important components of a conflict driven clause learning (CDCL) SAT solver that is effective on industrial SAT instances. Since the number of learned clauses is proved to be exponential in the worst case, it is necessary to identify the most relevant clauses to maintain and delete the irrelevant ones. As reported in the literature, several learned clauses deletion strategies have been proposed. However the diversity in both the number of clauses to be removed at each step of reduction and the results obtained with each strategy increase the difficulty to determine which criterion is better. Thus, the problem to select which learned clauses are to be removed during the search step remains very challenging. In this paper, we propose a novel approach to identify the most relevant learned clauses without favoring or excluding any of the proposed measures, but by adopting the notion of dominance relationship among those measures. Our approach bypasses the problem of difference in the results obtained with different measures and reaches a compromise between the measures assessments. Furthermore, the proposed approach also reduces the non-trivial problem of choosing the number of clauses to delete at each reduction of the learned clause database.
A Distributed and Incremental Algorithm for Large-Scale Graph Clustering
Graph clustering is one of the key techniques to understand the structures present in the graph data. In addition to cluster detection, the identification of hubs and outliers is also a critical task as it plays an important role in the analysis of graph data. Recently, several graph clustering algorithms have been proposed and used in many application domains such as biological network analysis, recommendation systems and community detection. Most of these algorithms are based on the structural clustering algorithm SCAN. Yet, SCAN algorithm has been designed for small graphs, without significant support to deal with big and dynamic graphs. In this paper, we propose DISCAN, a novel distributed and incremental graph clustering algorithm based on SCAN. We present an implementation of DISCAN on top of BLADYG framework, and experimentally show the efficiency of DISCAN in both large and dynamic networks.
Une approche basée sur les motifs graduels pour la recommandation dans un contexte de consommation répétée
Les systèmes de recommandation ont été conçus pour résoudre le problème de surcharge de données. L'objectif est donc de sélectionner parmi un nombre élevé d'items ceux de faible quantité pertinents pour un utilisateur donné. La prise en compte de la nature répétitive et périodique des interactions entre les utilisateurs et les items a permis d'améliorer les performances des systèmes existants. Mais ces systèmes ne prennent pas en compte les données numériques associées à ces interactions. Nous proposons dans cet article une approche de recommandation basée sur les motifs graduels qui permettent de modéliser les covariations entre items. Les résultats expérimentaux obtenus avec l'approche proposée sur le jeu de données utilisé sont encourageants.
L'ambiguïté dans la représentation des émotions : état de l'art des bases de données multimodales
La reconnaissance des émotions est une brique fondamentale dans l'octroi de l'intelligence émotionnelle aux machines. Les premiers modèles ont été conçus pour reconnaître les émotions fortement exprimées et facilement identifiables. Cependant, nous sommes rarement en proie à ce type d'émotions dans notre vie quotidienne. La plupart du temps, nous éprouvons une difficulté à identifier avec certitude notre propre émotion et celle d'autrui : c'est l'ambiguïté émotionnelle. Les bases de données, à la racine du développement des systèmes de reconnaissance, doivent permettre d'introduire l'ambiguïté dans la représentation émotionnelle. Ce papier résume les principales représentations émotionnelles et propose un état de l'art des bases de données multimodales pour la reconnaissance des émotions, avec une étude de leur positionnement sur la problématique. Le papier poursuit sur une discussion de la possibilité de représenter l'ambiguïté des émotions à partir des bases de données sélectionnées.
Évaluation comparative de méthodes non supervisées pour la détection de points anormaux dans les flux de données
Il existe plusieurs méthodes basées sur des hypothèses variées pour la détection d’anomalies dans les flux de données. Le choix d’une méthode est lié à ses performances sur des types de données spécifiques. Les flux de données peuvent être caractérisés par la présence de saisonalité, tendance, cycle et Concept drift (changement des propriétés statistiques des données). Dans ce travail, nous comparons suivant la latence (temps de traitement d’une instance) et performance un ensemble de méthodes de détection d’anomalies dans les flux de données avec des hypothèses diverses sur des jeux de données aussi bien univariés que multivariés (sur lesquels nous avons identifié les caractéristiques présentes).
Etude de la prédiction du niveau de la nappe phréatique à l'aide de modèles neuronaux convolutif, récurrent et résiduel
La prévision du niveau des nappes phréatiques, ou niveau piézométrique ou encore charge hydraulique est une tâche aux enjeux socio-économiques. Une bonne prévision peut permettre la régulation de la consommation d'eau, éviter des inondations et optimiser l'exploitation de l'eau. C'est ainsi que nous nous intéressons au challenge de la conférence EGC 2022, qui consiste à prédire l'évolution du niveau des nappes sur une durée de trois mois dans le futur. Dans cet article, nous proposons d'utiliser trois types de réseau de neurones (convolutif, récurrent et résiduel) qui collaborent afin de prédire la charge hydraulique toutes les 24 heures allant du 15 octobre 2021 au 15 janvier 2022. Le code source de notre approche, ainsi que les résultats sont publiquement disponibles sur GitHub 1.
Exploring convolutional neural networks with transfer learning for diagnosing Lyme disease from skin lesion images.
Background and objective: disease which is one of the most common infectious vector-borne diseases manifests itself in most cases with erythema migrans (EM) skin lesions. Recent studies show that convolutional neural networks (CNNs) perform well to identify skin lesions from images. Lightweight CNN based pre-scanner applications for resource-constrained mobile devices can help users with early diagnosis of Lyme disease and prevent the transition to a severe late form thanks to appropriate antibiotic therapy. Also, resource-intensive CNN based robust computer applications can assist non-expert practitioners with an accurate diagnosis. The main objective of this study is to extensively analyze the effectiveness of CNNs for diagnosing Lyme disease from images and to find out the best CNN architectures considering resource constraints. Methods: First, we created an EM dataset with the help of expert dermatologists from Clermont-Ferrand University Hospital Center of France. Second, we benchmarked this dataset for twenty-three CNN architectures customized from VGG, ResNet, DenseNet, MobileNet, Xception, NASNet, and EfficientNet architectures in terms of predictive performance, computational complexity, and statistical significance. Third, to improve the performance of the CNNs, we used custom transfer learning from ImageNet pre-trained models as well as pre-trained the CNNs with the skin lesion dataset HAM10000. Fourth, for model explainability, we utilized Gradient-weighted Class Activation Mapping to visualize the regions of input that are significant to the CNNs for making predictions. Fifth, we provided guidelines for model selection based on predictive performance and computational complexity. Results: Customized ResNet50 architecture gave the best classification accuracy of 84.42% ±1.36, AUC of 0.9189±0.0115, precision of 83.1%±2.49, sensitivity of 87.93%±1.47, and specificity of 80.65%±3.59. A lightweight model customized from EfficientNetB0 also performed well with an accuracy of 83.13%±1.2, AUC of 0.9094±0.0129, precision of 82.83%±1.75, sensitivity of 85.21% ±3.91, and specificity of 80.89%±2.95. All the trained models are publicly available at https://dappem.limos.fr/download.html which can be used by others for transfer learning and building pre-scanners for Lyme disease. Conclusion: Our study confirmed the effectiveness of even some lightweight CNNs for building Lyme disease pre-scanner mobile applications to assist people with an initial self-assessment and referring them to expert dermatologist for further diagnosis.
On the design of a similarity function for sparse binary data with application on protein function annotation
Automatic protein function annotation is a challenging task that is fundamental in many medical applications. Indeed, the capability to predict whether a protein has a given function is a key step for disease understanding and drug design. For such reasons, many authors have proposed computational methods for protein function prediction. One key element that is present in many proposals is similarity functions. Such functions are often used to compute the pairwise similarity between two proteins. It is commonly accepted that proteins with similar structures share the same function. Nevertheless, no previous works have focused on proposing a similarity function that is specifically designed for protein function annotation. In this work, we analyze the best similarity functions for the protein function annotation task and propose a new one. We erformed experiments in a simple pairwise similarity scenario and also using our proposal as part of a more complex protein function annotation method. Based on the results, we can state that our proposal is a valid alternative as a building block of many protein function annotation methods.
Pilot study of eDOL, a new mHealth application and web platform for medical follow-up of chronic pain patients
The pharmacopoeia of analgesics is old, their effectiveness is limited, with undesirable effects and little progress has been made in recent years. Thus, innovation is limited despite prolific basic research. Better characterization of patients could help to identify the predictors of successful treatments through research programs, and therefore enable physicians to carry out better decision-making in the initial choice of treatment and in the follow-up of their patients. Nevertheless, the current assessment of chronic pain patients provides only fragmentary data on their painful daily experiences. Thus, it is essential to modify the temporality in which patients’ sensations are assessed, with real-life monitoring of different parameters, i.e. subjective and objective markers of chronic pain. Consequently, recent studies have highlighted the urgent need to develop self-management and chronic pain management programs through e-health programs, and enhance their therapeutic value.
Apport de l'entropie pour les c-moyennes floues sur des données catégorielles
La méthode de clustering flou des c-moyennes avec centroids flous FC (Kim et al., 2004) est une extension de la méthode fuzzy k-modes (FKM) (Huang et Ng, 1999) utilisant une représentation floue des centres des clusters. Après dérivation de la fonction objectif de cette méthode, il a été démontré dans (Djiberou Mahamadou et al., 2020) que la formule de mise à jour des centres ne permet pas de garantir la convergence de la méthode. Par la suite, les auteurs ont proposé deux extensions de FC dénommées FC* et CFE (Categorical Fuzzy Entropy c- means). Tandis que FC* utilise des mises à jour dures des centres, CFE incorpore la notion d’entropie dans la fonction objectif pour jouer un rôle de pénalisation sur les poids. Cela permet ainsi une répartition de la masse des poids plus équilibrée sur toutes les valeurs de l’attribut considéré. L’entropie favorise ainsi l’obtention de centroids flous. Dans ce travail nous avons comparé les méthodes CFE, FC* et FKM sur neuf jeux de données réelles.
Ontology-based data integration in a distributed context of coalition air missions
The IBC (Knowledge Base Integration) project addresses an issue of ontology-based data integration. It aims at combining data residing in different actors (aircraft, drone, satellite...) during an air mission scenario and providing users with a unified view of all available data, in a communication constrained environment. We describe the solution we have implemented based on mediation. We use rule languages to process queries using an OWL2 domain ontology and RDF triples to store data. We also give a performance analysis of our prototype.
Uncertain Time Series Classification with Shapelet Transform
Time series classification is a task that aims at classifying chronological data.
It is used in a diverse range of domains such as meteorology, medicine and physics.
In the last decade, many algorithms have been built to perform this task with very appreciable accuracy.
However, applications where time series have uncertainty has been under-explored.
Using uncertainty propagation techniques, we propose a new uncertain dissimilarity measure based on Euclidean distance.
We then propose the uncertain shapelet transform algorithm for the classification of uncertain time series.
The large experiments we conducted on state of the art datasets show the effectiveness of our contribution.
The source code of our contribution and the datasets we used are all available on a public repository.
Named Entity Recognition in Low-resource Languages using Cross-lingual distributional word representation
Named Entity Recognition (NER) is a fundamental task in many NLP applications that seek to identify and classify expressions such as people, location, and organization names.
Many NER systems have been developed, but the annotated data needed for good performances are not available for low-resource languages, such as Cameroonian languages.
In this paper we exploit the low frequency of named entities in text to define a new suitable cross-lingual distributional representation for named entity recognition.
We build the first Ewondo (a Bantu low-resource language of Cameroon) named entities recognizer by projecting named entity tags from English using our word representation.
In terms of Recall, Precision and F-score, the obtained results show the effectiveness of the proposed distributional representation of words
Classification des séries temporelles incertaines par transformation Shapelet
La classification des séries temporelles est une tâche quiconsiste à classifier les données chronologiques.
Elle est utilisée dans divers domaines tels que la météorologie,la médecine et la physique.
Plusieurs techniques performantes ont été proposées durant les dix dernières années pour accomplir cette tâche.
Cependant, elles ne prennentpas explicitement en compte l’incertitude dans les données.
En utilisant la propagation de l’incertitude, nous proposons une nouvelle mesure de dissimilarité incertaine basée sur la distance euclidienne.
Nous montrons égalementcomment faire la classification de séries temporelles incertaines en couplant cette mesure avec la méthode de transformation shapelet, l’une des méthodes les plus performantes pour cette tâche.
Une évaluation expérimentale denotre contribution est faite sur le dépôt de données temporelles UCR.
Knowledge based data integration in a military mission
The IBC (“Intégration de Bases de Connaissance” - Knowledge Bases Integration) project addresses the question of ontology based data integration, in the context of the MMT (Man Machine Teaming) initiative. It aims at combining data residing in different actors (aircraft, drone, sattelite, ...) during an air mission scenario and providing users with a unified view of all available data, in a communication constrained environment.
A Word Representation to Improve Named Entity Recognition in Low-resource Languages
Named Entity Recognition (NER) is a fundamental task in many NLP applications that seek to identify and classify expressions such as people, location, and organization names.
Many NER systems have been developed, but the annotated data needed for learning is not available for low-resource languages, such as Cameroonian languages.
In this paper we exploit the low frequency of named entities in text to define a new suitable word representation for named entity recognition.
We build the first Ewondo (a Bantu language of Cameroon) named entities recognizer by projecting named entity tags from English using our word representation.
In terms of Recall, Precision and F-score, the obtained results show the effectiveness of the proposed word representation.
Categorical Fuzzy Entropy C-means
Hard and fuzzy clustering algorithms are part of the partition-based clustering family.
They are widely used in real-world applications to cluster numerical and categorical data.
While in hard clustering an object is assigned to a cluster with certainty, in fuzzy clustering an object can be assigned to different clusters given a membership degree.
For both types of method an entropy can be incorporated into the objective function, mostly to avoid solutions raising too much uncertainties.
In this paper, we present an extension of a fuzzy clustering method for categorical data using fuzzy centroids.
The new algorithm, referred to as Categorical Fuzzy Entropy (CFE), integrates an entropy term in the objective function.
This allows a better fuzzification of the cluster prototypes.
Experiments on ten real-world data sets and statistical comparisons show that the new method can efficiently handle categorical data.
Evidential Clustering for Categorical Data
Evidential clustering methods assign objects to clusters with a degree of belief, allowing for better representation of cluster overlap and outliers. Based on the theoretical framework of belief functions, they generate credal partitions which extend crisp, fuzzy and possibilistic partitions. Despite their ability to provide rich information about the partition, no evidential clustering algorithm for categorical data has yet been proposed. This paper presents a categorical version of ECM, an evidential variant of k-means. The proposed algorithm, referred to as catECM, considers a new dissimilarity measure and introduces an alternating minimization scheme in order to obtain a credal partition. Experimental results with real and synthetic data sets show the potential and the efficiency of cat-ECM for clustering categorical data.