We propose CAD-Assistant, a general-purpose CAD agent for AI-assisted design. Our approach is based on a powerful Vision and Large Language Model (VLLM) as a planner and a tool-augmentation paradigm using CAD-specific tools. CAD-Assistant addresses multimodal user queries by generating actions that are iteratively executed on a Python interpreter equipped with the FreeCAD software, accessed via its Python API. Our framework is able to assess the impact of generated CAD commands on geometry and adapts subsequent actions based on the evolving state of the CAD design. We consider a wide range of CAD-specific tools including a sketch image parameterizer, rendering modules, a 2D cross-section generator, and other specialized routines. CAD-Assistant is evaluated on multiple CAD benchmarks, where it outperforms VLLM baselines and supervised task-specific methods. Beyond existing benchmarks, we qualitatively demonstrate the potential of tool-augmented VLLMs as general-purpose CAD solvers across diverse workflows.
@article{Mallis2025CADAssistant,title={CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers},author={Mallis, Dimitrios and Karadeniz, Ahmet Serdar and Cavada, Sebastian and Rukhovich, Danila and Foteinopoulou, Niki and Cherenkova, Kseniya and Kacem, Anis and Aouada, Djamila},journal={ArXiv},year={2025},}
Computer-Aided Design (CAD) models are typically constructed by sequentially drawing parametric sketches and applying CAD operations to obtain a 3D model. The problem of 3D CAD reverse engineering consists of reconstructing the sketch and CAD operation sequences from 3D representations such as point clouds. In this paper, we address this challenge through novel contributions across three levels: CAD sequence representation, network design, and training dataset. In particular, we represent CAD sketch-extrude sequences as Python code. The proposed CAD-Recode translates a point cloud into Python code that, when executed, reconstructs the CAD model. Taking advantage of the exposure of pre-trained Large Language Models (LLMs) to Python code, we leverage a relatively small LLM as a decoder for CAD-Recode and combine it with a lightweight point cloud projector. CAD-Recode is trained on a procedurally generated dataset of one million CAD sequences. CAD-Recode significantly outperforms existing methods across the DeepCAD, Fusion360 and realworld CC3D datasets. Furthermore, we show that our CAD Python code output is interpretable by off-the-shelf LLMs, enabling CAD editing and CAD-specific question answering from point clouds.
author = {Rukhovich, Danila and Dupont, Elona and Mallis, Dimitrios and Cherenkova, Kseniya and Kacem, Anis and Aouada, Djamila},journal = {ArXiv},year = {2025},}
This work introduces PICASSO, a framework for the parameterization of 2D CAD sketches from hand-drawn and precise sketch images. PICASSO converts a given CAD sketch image into parametric primitives that can be seamlessly integrated into CAD software. Our framework leverages rendering self-supervision to enable the pre-training of a CAD sketch parameterization network using sketch renderings only, thereby eliminating the need for corresponding CAD parameterization. Thus, we significantly reduce reliance on parameter-level annotations, which are often unavailable, particularly for hand-drawn sketches. The two primary components of PICASSO are (1) a Sketch Parameterization Network (SPN) that predicts a series of parametric primitives from CAD sketch images, and (2) a Sketch Rendering Network (SRN) that renders parametric CAD sketches in a differentiable manner and facilitates the computation of a rendering (image-level) loss for self-supervision. We demonstrate that the proposed PICASSO can achieve reasonable performance even when finetuned with only a small number of parametric CAD sketches. Extensive evaluation on the widely used SketchGraphs and CAD as Language datasets validates the effectiveness of the proposed approach on zero- and few-shot learning scenarios.
@article{Karadeniz2025DAVINCI,title={PICASSO: A Feed-Forward Framework for Parametric Inference of CAD Sketches via Rendering Self-Supervision},author={Karadeniz, Ahmet Serdar and Mallis, Dimitrios and Mejri, Nesryne and Cherenkova, Kseniya and Kacem, Anis and Aouada, Djamila},journal={Winter Conference on Applications of Computer Vision (WACV)},year={2025},}
This work presents DAVINCI, a unified architecture for single-stage Computer-Aided Design (CAD) sketch parameterization and constraint inference directly from raster sketch images. By jointly learning both outputs, DAVINCI minimizes error accumulation and enhances the performance of constrained CAD sketch inference. Notably, DAVINCI achieves state-of-the-art results on the large-scale SketchGraphs dataset, demonstrating effectiveness on both precise and hand-drawn raster CAD sketches. To reduce DAVINCI’s reliance on large-scale annotated datasets, we explore the efficacy of CAD sketch augmentations. We introduce Constraint-Preserving Transformations (CPTs), i.e. random permutations of the parametric primitives of a CAD sketch that preserve its constraints. This data augmentation strategy allows DAVINCI to achieve reasonable performance when trained with only 0.1% of the SketchGraphs dataset. Furthermore, this work contributes a new version of SketchGraphs, augmented with CPTs. The newly introduced CPTSketchGraphs dataset includes 80 million CPT-augmented sketches, thus providing a rich resource for future research in the CAD sketch domain.
@article{Karadeniz2024PICASSOAF,title={DAVINCI: A Single-Stage Architecture for Constrained CAD Sketch Inference},author={Karadeniz, Ahmet Serdar and Mallis, Dimitrios and Mejri, Nesryne and Cherenkova, Kseniya and Kacem, Anis and Aouada, Djamila},journal={British Machine Vision Conference (BMVC)},year={2024},}
3D reverse engineering, in which a CAD model is inferred given a 3D scan of a physical object, is a research direction that offers many promising practical applications. This paper proposes TransCAD, an end-to-end transformer-based architecture that predicts the CAD sequence from a point cloud. TransCAD leverages the structure of CAD sequences by using a hierarchical learning strategy. A loop refiner is also introduced to regress sketch primitive parameters. Rigorous experimentation on the DeepCAD and Fusion360 datasets show that TransCAD achieves state-of-the-art results. The result analysis is supported with a proposed metric for CAD sequence, the mean Average Precision of CAD Sequence, that addresses the limitations of existing metrics.
@article{dupont2024transcad,title={TransCAD: A Hierarchical Transformer for CAD Sequence Inference from Point Clouds},author={Dupont, Elona and Cherenkova, Kseniya and Mallis, Dimitrios and Gusev, Gleb and Kacem, Anis and Aouada, Djamila},journal={European Conference on Computer Vision (ECCV)},year={2024},}
We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD). Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR), a variation of MIM where a percentage of the entities/objects within a scene are masked and subsequently reconstructed based on the unmasked objects. The core idea is that, through object-level masked modeling, the network learns context-aware representations that capture the interaction of objects within a scene and thus are highly predictive of visual object relationships. We extensively evaluate learned representations, both qualitatively and quantitatively, in a few-shot setting and demonstrate the efficacy of MBBR for learning robust visual representations, particularly tailored for VRD. The proposed method is able to surpass state-of-the-art VRD methods on the Predicate Detection (PredDet) evaluation setting, using only a few annotated samples.
@article{Anastasakis2023WACV,title={Self-Supervised Learning for Visual Relationship Detection through Masked Bounding Box Reconstruction},author={Anastasakis, Zacharias and Mallis, Dimitrios and Diomataris, Markos and Alexandridis, George and Kollias, Stefanos and Pitsikalis, Vassilis},journal={Winter Conference on Applications of Computer Vision (WACV)},year={2024},}
Dimitrios Mallis, Sk Aziz Ali, Elona Dupont, Kseniya Cherenkova, Ahmet Serdar Karadeniz, Mohammad Sadil Khan, Anis Kacem, Gleb Gusev, and Djamila Aouada
International Conference on Computer Vision Workshops (ICCVW), 2023
Recent breakthroughs in geometric Deep Learning (DL) and the availability of large Computer-Aided Design (CAD) datasets have advanced the research on learning CAD modeling processes and relating them to real objects. In this context, 3D reverse engineering of CAD models from 3D scans is considered to be one of the most sought-after goals for the CAD industry. However, recent efforts assume multiple simplifications limiting the applications in real-world settings. The SHARP Challenge 2023 aims at pushing the research a step closer to the real-world scenario of CAD reverse engineering through dedicated datasets and tracks. In this paper, we define the proposed SHARP 2023 tracks, describe the provided datasets, and propose a set of baseline methods along with suitable evaluation metrics to assess the performance of the track solutions. All proposed datasets along with useful routines and the evaluation metrics are publicly available.
@article{Mallis2023SHARPC2,title={SHARP Challenge 2023: Solving CAD History and pArameters Recovery from Point clouds and 3D scans. Overview, Datasets, Metrics, and Baselines},author={Mallis, Dimitrios and Ali, Sk Aziz and Dupont, Elona and Cherenkova, Kseniya and Karadeniz, Ahmet Serdar and Khan, Mohammad Sadil and Kacem, Anis and Gusev, Gleb and Aouada, Djamila},journal={International Conference on Computer Vision Workshops (ICCVW)},year={2023},}
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task. However, such models are likely to disregard crucial visual cues and often rely on multimodal shortcuts and inherent biases of the language modality to predict the correct answer, a phenomenon commonly referred to as lack of visual grounding. In this work, we alleviate this shortcoming through a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal. Reasoning supervision takes the form of a textual justification of the correct answer, with such annotations being already available on large-scale Visual Common Sense Reasoning (VCR) datasets. The model’s visual attention is guided toward important elements of the scene through a similarity loss that aligns the learned attention distributions guided by the question and the correct reasoning. We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model’s visual perception capability and lead to performance increase, without requiring training on explicit grounding annotations.
@article{Parelli2023InterpretableVQ,title={Interpretable Visual Question Answering via Reasoning Supervision},author={Parelli, Maria and Mallis, Dimitrios and Diomataris, Markos and Pitsikalis, Vassilis},journal={International Conference on Image Processing (ICIP)},year={2023},}
This paper proposes a novel paradigm for the unsupervised learning of object landmark detectors. Contrary to existing methods that build on auxiliary tasks such as image generation or equivariance, we propose a self-training approach where, departing from generic keypoints, a landmark detector and descriptor is trained to improve itself, tuning the keypoints into distinctive landmarks. To this end, we propose an iterative algorithm that alternates between producing new pseudo-labels through feature clustering and learning distinctive features for each pseudo-class through contrastive learning. With a shared backbone for the landmark detector and descriptor, the keypoint locations progressively converge to stable landmarks, filtering those less stable. Compared to previous works, our approach can learn points that are more flexible in terms of capturing large viewpoint changes. We validate our method on a variety of difficult datasets, including LS3D, BBCPose, Human3.6M and PennAction, achieving new state of the art results.
@article{Mallis2023landm,author={Mallis, Dimitrios and Sanchez, Enrique and Bell, Matt and Tzimiropoulos, Georgios},journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},title={From Keypoints to Object Landmarks via Self-Training Correspondence: A novel approach to Unsupervised Landmark Discovery},year={2023},}
In this work we introduce an incremental learning framework for Click-Through-Rate (CTR) prediction and demonstrate its effectiveness for Taboola’s massive-scale recommendation service. Our approach enables rapid capture of emerging trends through warm-starting from previously deployed models and fine tuning on "fresh" data only. Past knowledge is maintained via a teacher-student paradigm, where the teacher acts as a distillation technique, mitigating the catastrophic forgetting phenomenon. Our incremental learning framework enables significantly faster training and deployment cycles (x12 speedup). We demonstrate a consistent Revenue Per Mille (RPM) lift over multiple traffic segments and a significant CTR increase on newly introduced items.
@article{Katsileros2022AnIL,title={An Incremental Learning framework for Large-scale CTR Prediction},author={Katsileros, Petros and Mandilaras, Nikiforos and Mallis, Dimitrios and Pitsikalis, Vassilis and Theodorakis, Stavros and Chamiel, Gil},journal={Conference on Recommender Systems},year={2022},}
Object parts, also known as landmarks, convey information about an object’s shape and spatial configuration in 3D space, especially for deformable objects. The goal of landmark detection is to have a model that, for a particular object instance, can estimate the locations of its parts. Research in this field is mainly driven by supervised approaches, where a sufficient amount of human-annotated data is available. As annotating landmarks for all objects is impractical, this thesis focuses on learning landmark detectors without supervision. Despite good performance on limited scenarios (objects showcasing minor rigid deformation), unsupervised landmark discovery mostly remains an open problem. Existing work fails to capture semantic landmarks, i.e. points similar to the ones assigned by human annotators and may not generalise well to highly articulated objects like the human body, complicated backgrounds or large viewpoint variations.
In this thesis, we propose a novel self-training framework for the discovery of unsupervised landmarks. Contrary to existing methods that build on auxiliary tasks such as image generation or equivariance, we depart from generic keypoints and train a landmark detector and descriptor to improve itself, tuning the keypoints into distinctive landmarks. We propose an iterative algorithm that alternates between producing new pseudo-labels through feature clustering and learning distinctive features for each pseudo-class through contrastive learning. Our detector can discover highly semantic landmarks, that are more flexible in terms of capturing large viewpoint changes and out-of-plane rotations (3D rotations). New state-of-the-art performance is achieved in multiple challenging datasets.
@phdthesis{Mallisthesis,title={Unsupervised Landmark Discovery via Self-Training Correspondence},author={Mallis, Dimitrios},school={University of Nottingham},note={PhD Thesis},year={2022},}
This paper addresses the problem of unsupervised discovery of object landmarks.
We take a different path compared to existing works, based on 2 novel perspectives: (1) Self-training: starting from generic keypoints, we propose a self-training
approach where the goal is to learn a detector that improves itself, becoming
more and more tuned to object landmarks. (2) Correspondence: we identify correspondence as a key objective for unsupervised landmark discovery and propose
an optimization scheme which alternates between recovering object landmark
correspondence across different images via clustering and learning an object landmark descriptor without labels. Compared to previous works, our approach can
learn landmarks that are more flexible in terms of capturing large changes in
viewpoint. We show the favourable properties of our method on a variety of difficult datasets including LS3D, BBCPose and Human3.6M.
@article{unsupervLandm2020,title={Unsupervised Learning of Object Landmarks via Self-Training Correspondence},author={Mallis, Dimitrios and Sanchez, Enrique and Bell, Matt and Tzimiropoulos, Georgios},journal={Neural Information Processing Systems},year={2020},}
The aim of this study was to compare metrics for quantifying
enteric methane (CH4) emissions from individual cows during
milking using frequent spot measurements and peak analysis
methods. An infrared gas analyser was used to measure the
CH4 emitted by cows, and eructation peaks were identified
using a Signal Processing Toolbox provided by Matlab. CH4
emissions were quantified by gas peak height, peak amplitude and average concentration, and were expressed in
grams per day and CH4 yield (grams per kilogram of dry matter intake (DMI)). Peak analysis measurements of CH4 were
obtained from 36 cows during 2,474 milkings, during which
cows were fed a ration containing between 39 and 70% forage. Spot measurements of CH4 were compared to a separate
dataset of 196 chamber CH4 records from another group of
105 cows, which were fed a ration containing between 25
and 80% forage. The results showed that the metrics of CH4
peak height and CH4 peak amplitude demonstrated similar positive relationships between daily CH4 emissions and
DMI (both r=0.37), and a negative relationship between CH4
yield and DMI (r=-0.43 and -0.38 respectively) as observed in
the chamber measurements (r=0.57 for daily emissions and
r=-0.40 for CH4 yield). The CH4 metrics of peak height and
peak amplitude were highly repeatable (ranging from 0.76
to 0.81), comparable to the high repeatability of production
traits (ranging from 0.63 to 0.99) and were more repeatable
than chamber CH4 measurements (0.31 for daily emissions and
0.03 for CH4 yield). This study recommends quantifying CH4
emissions from the maximum amplitude of an eructation.
@article{emmissionstudy,title={Modified approach to estimating daily methane emissions of dairy cows by measuring filtered eructations during milking},journal={Landbauforschung},year={2019},author={Bell, Matt and Garnsworthy, Phil and Mallis, Dimitrios and Eckard, Richrd and Moate, Peter and Yan, Tianhai},}
Audio source separation is the task of isolating sound sources that are active simultaneously in a room captured by a set of microphones. Convolutive audio source separation of equal number of sources and microphones has a number of shortcomings including the complexity of frequency-domain ICA, the permutation ambiguity and the problem’s scalabity with increasing number of sensors. In this paper, the authors propose a multiple-microphone audio source separation algorithm based on a previous work of Mitianoudis and Davies (2003). Complex FastICA is substituted by Robust ICA increasing robustness and performance. Permutation ambiguity is solved using two methodologies. The first is using the Likelihood Ration Jump solution, which is now modified to decrease computational complexity in the case of multiple microphones. The application of the MuSIC algorithm, as a preprocessing step to the previous solution, forms a second methodology with promising results.
@article{Mallis2018ConvolutiveAS,title={Convolutive audio source separation using robust ICA and an intelligent evolving permutation ambiguity solution},author={Mallis, Dimitrios and Sgouros, Thomas and Mitianoudis, Nikolaos},journal={Evolving Systems},year={2018},}
Purpose - The rapid growth of social media has rendered opinion and sentiment mining an important area
of research with a wide range of applications. This paper aims to focus on the Greek language and the
microblogging platform Twitter, investigating methods for extracting emotion of individual tweets as well as
population emotion for different subjects (hashtags).
Design/methodology/approach – The authors propose and investigate the use of emotion lexiconbased methods as a mean of extracting emotion/sentiment information from social media. The authors
compare several approaches for measuring the intensity of six emotions: anger, disgust, fear, happiness,
sadness and surprise. To evaluate the effectiveness of the methods, the authors develop a benchmark dataset
of tweets, manually rated by two humans.
Findings – Development of a new sentiment lexicon for use in Web applications. The authors then assess
the performance of the methods with the new lexicon and find improved results.
Research limitations/implications – Automated emotion results of research seem promising and
correlate to real user emotion. At this point, the authors make some interesting observations about the lexiconbased approach which lead to the need for a new, better, emotion lexicon.
Practical implications – The authors examine the variation of emotion intensity over time for selected
hashtags and associate it with real-world events.
Originality/value – The originality in this research is the development of a training set of tweets,
manually annotated by two independent raters. The authors “transfer” the sentiment information of these
annotated tweets, in a meaningful way, to the set of words that appear in them.
@article{Kalamatianos2018TowardsTC,author={Kalamatianos, Georgios and Symeonidis, Symeon and Mallis, Dimitrios and Arampatzis, Avi},journal={Journal of Systems and Information Technology},year={2018},}
Audio source separation is the task of isolating sound sources that are active simultaneously in a room captured by a set of microphones. Convolutive audio source separation of equal number of sources and microphones has a number of shortcomings including the complexity of frequency-domain ICA, the permutation ambiguity and the problem’s scalabity with increasing number of sensors. In this paper, the authors propose a multiple-microphone audio source separation algorithm based on a previous work of Mitianoudis and Davies. Complex FastICA is substituted by Robust ICA increasing robustness and performance. Permutation ambiguity is solved using the Likelihood Ration Jump solution, which is now modified to decrease computational complexity in the case of multiple microphones.
@article{Mallis2016ConvolutiveAS,title={Convolutive Audio Source Separation Using Robust ICA and Reduced Likelihood Ratio Jump},author={Mallis, Dimitrios and Sgouros, Thomas and Mitianoudis, Nikolaos},journal={Artificial Intelligence Applications and Innovations},year={2016},}
The rapid growth of social media has rendered opinion and
sentiment mining an important area of research with a wide range
of applications. We focus on the Greek language and the
microblogging platform “Twitter”, investigating methods for
extracting sentiment of individual tweets as well population
sentiment for different subjects (hashtags). The proposed methods
are based on a sentiment lexicon. We compare several approaches
for measuring the intensity of “Anger”, “Disgust”, “Fear”,
“Happiness”, “Sadness”, and “Surprise”. To evaluate the
effectiveness of our methods, we develop a benchmark dataset of
tweets, manually rated by two humans. Our automated sentiment
results seem promising and correlate to real user sentiment. Finally,
we examine the variation of sentiment intensity over time for
selected hashtags, and associate it with real-world events.
@article{Kalamatianos2015SentimentAO,title={Sentiment analysis of greek tweets and hashtags using a sentiment lexicon},author={Kalamatianos, Georgios and Mallis, Dimitrios and Symeonidis, Symeon and Arampatzis, Avi},journal={Panhellenic Conference on Informatics},year={2015},doi={10.1145/2801948.2802010},}