3D reverse engineering, in which a CAD model is inferred given a 3D scan of a physical object, is a research direction that offers many promising practical applications. This paper proposes TransCAD, an end-to-end transformer-based architecture that predicts the CAD sequence from a point cloud. TransCAD leverages the structure of CAD sequences by using a hierarchical learning strategy. A loop refiner is also introduced to regress sketch primitive parameters. Rigorous experimentation on the DeepCAD and Fusion360 datasets show that TransCAD achieves state-of-the-art results. The result analysis is supported with a proposed metric for CAD sequence, the mean Average Precision of CAD Sequence, that addresses the limitations of existing metrics.
We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD). Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR), a variation of MIM where a percentage of the entities/objects within a scene are masked and subsequently reconstructed based on the unmasked objects. The core idea is that, through object-level masked modeling, the network learns context-aware representations that capture the interaction of objects within a scene and thus are highly predictive of visual object relationships. We extensively evaluate learned representations, both qualitatively and quantitatively, in a few-shot setting and demonstrate the efficacy of MBBR for learning robust visual representations, particularly tailored for VRD. The proposed method is able to surpass state-of-the-art VRD methods on the Predicate Detection (PredDet) evaluation setting, using only a few annotated samples.
Dimitrios Mallis, Sk Aziz Ali, Elona Dupont, Kseniya Cherenkova, Ahmet Serdar Karadeniz, Mohammad Sadil Khan, Anis Kacem, Gleb Gusev, and Djamila Aouada
International Conference on Computer Vision Workshops (ICCVW), 2023
Recent breakthroughs in geometric Deep Learning (DL) and the availability of large Computer-Aided Design (CAD) datasets have advanced the research on learning CAD modeling processes and relating them to real objects. In this context, 3D reverse engineering of CAD models from 3D scans is considered to be one of the most sought-after goals for the CAD industry. However, recent efforts assume multiple simplifications limiting the applications in real-world settings. The SHARP Challenge 2023 aims at pushing the research a step closer to the real-world scenario of CAD reverse engineering through dedicated datasets and tracks. In this paper, we define the proposed SHARP 2023 tracks, describe the provided datasets, and propose a set of baseline methods along with suitable evaluation metrics to assess the performance of the track solutions. All proposed datasets along with useful routines and the evaluation metrics are publicly available.
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task. However, such models are likely to disregard crucial visual cues and often rely on multimodal shortcuts and inherent biases of the language modality to predict the correct answer, a phenomenon commonly referred to as lack of visual grounding. In this work, we alleviate this shortcoming through a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal. Reasoning supervision takes the form of a textual justification of the correct answer, with such annotations being already available on large-scale Visual Common Sense Reasoning (VCR) datasets. The model’s visual attention is guided toward important elements of the scene through a similarity loss that aligns the learned attention distributions guided by the question and the correct reasoning. We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model’s visual perception capability and lead to performance increase, without requiring training on explicit grounding annotations.
This paper proposes a novel paradigm for the unsupervised learning of object landmark detectors. Contrary to existing methods that build on auxiliary tasks such as image generation or equivariance, we propose a self-training approach where, departing from generic keypoints, a landmark detector and descriptor is trained to improve itself, tuning the keypoints into distinctive landmarks. To this end, we propose an iterative algorithm that alternates between producing new pseudo-labels through feature clustering and learning distinctive features for each pseudo-class through contrastive learning. With a shared backbone for the landmark detector and descriptor, the keypoint locations progressively converge to stable landmarks, filtering those less stable. Compared to previous works, our approach can learn points that are more flexible in terms of capturing large viewpoint changes. We validate our method on a variety of difficult datasets, including LS3D, BBCPose, Human3.6M and PennAction, achieving new state of the art results.
In this work we introduce an incremental learning framework for Click-Through-Rate (CTR) prediction and demonstrate its effectiveness for Taboola’s massive-scale recommendation service. Our approach enables rapid capture of emerging trends through warm-starting from previously deployed models and fine tuning on "fresh" data only. Past knowledge is maintained via a teacher-student paradigm, where the teacher acts as a distillation technique, mitigating the catastrophic forgetting phenomenon. Our incremental learning framework enables significantly faster training and deployment cycles (x12 speedup). We demonstrate a consistent Revenue Per Mille (RPM) lift over multiple traffic segments and a significant CTR increase on newly introduced items.
Object parts, also known as landmarks, convey information about an object’s shape and spatial configuration in 3D space, especially for deformable objects. The goal of landmark detection is to have a model that, for a particular object instance, can estimate the locations of its parts. Research in this field is mainly driven by supervised approaches, where a sufficient amount of human-annotated data is available. As annotating landmarks for all objects is impractical, this thesis focuses on learning landmark detectors without supervision. Despite good performance on limited scenarios (objects showcasing minor rigid deformation), unsupervised landmark discovery mostly remains an open problem. Existing work fails to capture semantic landmarks, i.e. points similar to the ones assigned by human annotators and may not generalise well to highly articulated objects like the human body, complicated backgrounds or large viewpoint variations.
In this thesis, we propose a novel self-training framework for the discovery of unsupervised landmarks. Contrary to existing methods that build on auxiliary tasks such as image generation or equivariance, we depart from generic keypoints and train a landmark detector and descriptor to improve itself, tuning the keypoints into distinctive landmarks. We propose an iterative algorithm that alternates between producing new pseudo-labels through feature clustering and learning distinctive features for each pseudo-class through contrastive learning. Our detector can discover highly semantic landmarks, that are more flexible in terms of capturing large viewpoint changes and out-of-plane rotations (3D rotations). New state-of-the-art performance is achieved in multiple challenging datasets.
This paper addresses the problem of unsupervised discovery of object landmarks.
We take a different path compared to existing works, based on 2 novel perspectives: (1) Self-training: starting from generic keypoints, we propose a self-training
approach where the goal is to learn a detector that improves itself, becoming
more and more tuned to object landmarks. (2) Correspondence: we identify correspondence as a key objective for unsupervised landmark discovery and propose
an optimization scheme which alternates between recovering object landmark
correspondence across different images via clustering and learning an object landmark descriptor without labels. Compared to previous works, our approach can
learn landmarks that are more flexible in terms of capturing large changes in
viewpoint. We show the favourable properties of our method on a variety of difficult datasets including LS3D, BBCPose and Human3.6M.
The aim of this study was to compare metrics for quantifying
enteric methane (CH4) emissions from individual cows during
milking using frequent spot measurements and peak analysis
methods. An infrared gas analyser was used to measure the
CH4 emitted by cows, and eructation peaks were identified
using a Signal Processing Toolbox provided by Matlab. CH4
emissions were quantified by gas peak height, peak amplitude and average concentration, and were expressed in
grams per day and CH4 yield (grams per kilogram of dry matter intake (DMI)). Peak analysis measurements of CH4 were
obtained from 36 cows during 2,474 milkings, during which
cows were fed a ration containing between 39 and 70% forage. Spot measurements of CH4 were compared to a separate
dataset of 196 chamber CH4 records from another group of
105 cows, which were fed a ration containing between 25
and 80% forage. The results showed that the metrics of CH4
peak height and CH4 peak amplitude demonstrated similar positive relationships between daily CH4 emissions and
DMI (both r=0.37), and a negative relationship between CH4
yield and DMI (r=-0.43 and -0.38 respectively) as observed in
the chamber measurements (r=0.57 for daily emissions and
r=-0.40 for CH4 yield). The CH4 metrics of peak height and
peak amplitude were highly repeatable (ranging from 0.76
to 0.81), comparable to the high repeatability of production
traits (ranging from 0.63 to 0.99) and were more repeatable
than chamber CH4 measurements (0.31 for daily emissions and
0.03 for CH4 yield). This study recommends quantifying CH4
emissions from the maximum amplitude of an eructation.
Audio source separation is the task of isolating sound sources that are active simultaneously in a room captured by a set of microphones. Convolutive audio source separation of equal number of sources and microphones has a number of shortcomings including the complexity of frequency-domain ICA, the permutation ambiguity and the problem’s scalabity with increasing number of sensors. In this paper, the authors propose a multiple-microphone audio source separation algorithm based on a previous work of Mitianoudis and Davies (2003). Complex FastICA is substituted by Robust ICA increasing robustness and performance. Permutation ambiguity is solved using two methodologies. The first is using the Likelihood Ration Jump solution, which is now modified to decrease computational complexity in the case of multiple microphones. The application of the MuSIC algorithm, as a preprocessing step to the previous solution, forms a second methodology with promising results.
Purpose - The rapid growth of social media has rendered opinion and sentiment mining an important area
of research with a wide range of applications. This paper aims to focus on the Greek language and the
microblogging platform Twitter, investigating methods for extracting emotion of individual tweets as well as
population emotion for different subjects (hashtags).
Design/methodology/approach – The authors propose and investigate the use of emotion lexiconbased methods as a mean of extracting emotion/sentiment information from social media. The authors
compare several approaches for measuring the intensity of six emotions: anger, disgust, fear, happiness,
sadness and surprise. To evaluate the effectiveness of the methods, the authors develop a benchmark dataset
of tweets, manually rated by two humans.
Findings – Development of a new sentiment lexicon for use in Web applications. The authors then assess
the performance of the methods with the new lexicon and find improved results.
Research limitations/implications – Automated emotion results of research seem promising and
correlate to real user emotion. At this point, the authors make some interesting observations about the lexiconbased approach which lead to the need for a new, better, emotion lexicon.
Practical implications – The authors examine the variation of emotion intensity over time for selected
hashtags and associate it with real-world events.
Originality/value – The originality in this research is the development of a training set of tweets,
manually annotated by two independent raters. The authors “transfer” the sentiment information of these
annotated tweets, in a meaningful way, to the set of words that appear in them.
Audio source separation is the task of isolating sound sources that are active simultaneously in a room captured by a set of microphones. Convolutive audio source separation of equal number of sources and microphones has a number of shortcomings including the complexity of frequency-domain ICA, the permutation ambiguity and the problem’s scalabity with increasing number of sensors. In this paper, the authors propose a multiple-microphone audio source separation algorithm based on a previous work of Mitianoudis and Davies. Complex FastICA is substituted by Robust ICA increasing robustness and performance. Permutation ambiguity is solved using the Likelihood Ration Jump solution, which is now modified to decrease computational complexity in the case of multiple microphones.
The rapid growth of social media has rendered opinion and
sentiment mining an important area of research with a wide range
of applications. We focus on the Greek language and the
microblogging platform “Twitter”, investigating methods for
extracting sentiment of individual tweets as well population
sentiment for different subjects (hashtags). The proposed methods
are based on a sentiment lexicon. We compare several approaches
for measuring the intensity of “Anger”, “Disgust”, “Fear”,
“Happiness”, “Sadness”, and “Surprise”. To evaluate the
effectiveness of our methods, we develop a benchmark dataset of
tweets, manually rated by two humans. Our automated sentiment
results seem promising and correlate to real user sentiment. Finally,
we examine the variation of sentiment intensity over time for
selected hashtags, and associate it with real-world events.