European International Journal of Multidisciplinary Research
and Management Studies
28
https://eipublication.com/index.php/eijmrms
TYPE
Original Research
PAGE NO.
28-31
DOI
OPEN ACCESS
SUBMITED
31 May 2025
ACCEPTED
29 June 2025
PUBLISHED
31 July 2025
VOLUME
Vol.05 Issue07 2025
COPYRIGHT
© 2025 Original content from this work may be used under the terms
of the creative commons attributes 4.0 License.
Temporal Modeling and
Real-Time Recognition
Approaches in SLR
Systems
Kayumov Oybek Achilovich
Jizzakh Branch of National University of Uzbekistan Named After Mirzo
Ulugbek, Uzbekistan
Abstract:
This article is dedicated to analyzing
advanced approaches in temporal modeling and real-
time gesture recognition within sign language
recognition (SLR) systems. Sign glosses are expressed
through the spatio-temporal characteristics of visual
information, which requires the use of sequence-
processing models for their automatic recognition. The
study primarily evaluates the effectiveness of three key
models: Long Short-Term Memory (LSTM) networks,
Temporal Convolutional Networks (TCN), and
Transformer-based architectures.
The article also examines methods applied for real-time
analysis of sign glosses, including:
Sliding window segmentation of video streams;
Self-attention
mechanisms
for
identifying
dependencies between gestures;
Gloss mapping algorithms for linking sign movements
to linguistic units;
Ontological integration techniques for enhancing
semantic accuracy.
Practical results indicate that combining temporal
modeling with semantic analysis and contextual
verification algorithms ensures continuous and high-
accuracy recognition of sign movements. In particular,
multimodal systems (video + sensor + gloss) utilizing
Transformer-based approaches achieved superior
performance in real-time conversion of continuous sign
gloss streams into text.
The findings of this study hold practical significance for
the development of smart assistive devices for
automatic sign language translation, interactive
European International Journal of Multidisciplinary Research
and Management Studies
29
https://eipublication.com/index.php/eijmrms
European International Journal of Multidisciplinary Research and Management Studies
interfaces for hearing-impaired users, and specialized
SLR platforms for educational and instructional
purposes.
Keywords
: Sign Language Recognition, temporal
modeling, real-time SLR systems, LSTM, TCN,
Transformer, gloss mapping, semantic verification,
sliding window, ontological integration.
Introduction:
In recent years, advances in digital
technologies, particularly in artificial intelligence (AI)
and computer vision, have propelled Sign Language
Recognition (SLR) systems to a new level of
development.
For
individuals
with
hearing
impairments, sign language serves as a primary
medium of communication, and the automatic
recognition
of
sign
language
—
with
real-time
conversion into text or speech
—
remains one of the
most pressing challenges in the field.
A major obstacle for SLR systems lies in the spatio-
temporal nature of sign language gestures. Unlike static
hand postures, sign language consists of dynamic
motion sequences that involve hand movements, facial
expressions, and div poses, all of which evolve over
time. Thus, effective recognition requires the use of
temporal modeling approaches. While earlier methods
treated sign gestures as isolated movements, modern
approaches
analyze
them
as
interconnected
sequences, linking each gesture to its preceding and
subsequent movements to preserve context and
meaning.
Currently, models such as Long Short-Term Memory
(LSTM) networks, Temporal Convolutional Networks
(TCN), and Transformer architectures play a critical role
in capturing both temporal dependencies and
contextual relationships between sign glosses. In
particular, these models enable real-time SLR systems
to process each video frame efficiently, ensuring fast
and accurate recognition while maintaining interactive
communication with users.
This article provides an in-depth analysis of the
scientific foundations, technological solutions, and
practical results of temporal modeling approaches in
real-time sign recognition. Furthermore, it explores the
integration of semantic verification, ontology-based
gloss mapping, and multimodal fusion to enhance the
accuracy, contextual relevance, and applicability of SLR
systems across diverse real-world domains.
LITERATURE REVIEW
Temporal modeling and real-time recognition
approaches have become central to the development
of Sign Language Recognition (SLR) systems. In recent
years, significant research has been conducted to
enhance the modeling of sequential gestures and
improve the contextual accuracy of recognized glosses.
Camgoz et al. (2018) introduced the Neural Sign
Language Translation model, which analyzed the
relationship between sign gestures and glosses through
deep neural networks. In this approach, temporal
sequences were modeled using Long Short-Term
Memory (LSTM) networks, enabling the mapping of
input sign movements to their corresponding glosses
[1].
Cui et al. (2019) developed a deep learning model
based on Iterative Training, integrating both spatial and
temporal features for gloss recognition. This approach
facilitated the progressive analysis of gesture
transitions, improving recognition accuracy over
continuous signing sequences [2].
Hu et al. (2023) proposed the SignBERT+ model, aimed
at synchronizing sign videos with gloss annotations and
enhancing semantic precision. Built on a Transformer
architecture, this model provided a deeper contextual
understanding of glosses within sign language data [3].
The Transformer-based paradigm, pioneered by
Vaswani et al. (2017) in their Attention Is All You Need
model, highlighted the effectiveness of self-attention
mechanisms in handling sequential dependencies. This
architecture laid the groundwork for modern high-
accuracy SLR systems [4].
The Temporal Convolutional Network (TCN) approach
has also been widely adopted for robust modeling of
sign sequences. Liu et al. (2020) proposed a model
incorporating FrameNet-based semantic structures,
allowing for more context-aware interpretation of sign
glosses [5].
Collectively, these studies demonstrate that achieving
high accuracy in real-time sign recognition requires the
integration of temporal modeling, semantic analysis,
and context-based gloss mapping. Models such as
LSTM, TCN, and Transformer not only enhance the
recognition of sign glosses but also facilitate their
interpretation from a semantic perspective, ensuring
meaningful and contextually coherent outputs.
METHODS
This study analyzed various temporal modeling
architectures and semantic analysis methods for real-
time sign language recognition. The primary focus was
on the following approaches:
LSTM (Long Short-Term Memory): A neural network
capable of identifying long-term dependencies
between sign glosses.
TCN (Temporal Convolutional Network): A model that
analyzes sign movements through layered temporal
European International Journal of Multidisciplinary Research
and Management Studies
30
https://eipublication.com/index.php/eijmrms
European International Journal of Multidisciplinary Research and Management Studies
convolutions.
Transformer: An advanced model leveraging the self-
attention mechanism to capture global dependencies
between gestures.
Sliding Window Processing: Segmentation of video
streams into smaller parts for gloss detection within
each segment.
Gloss Mapping: Mapping of sign movements to
linguistic units (glosses).
Ontological Integration: Refining gloss context through
semantic databases (e.g., WordNet, FrameNet).
Comparison of Temporal Modeling Approaches in SLR
Systems
table 1
№
Approach
Temporal
Dependency
Modeling
Real-Time
Suitability
Semantic
Context
Integration
Computational
Cost
1
LSTM
Good
for
long-term
dependencies
Moderate
Limited
(requires
additional
modules)
Medium
2
TCN
Effective for
local
and
mid-range
sequences
High
Limited
Low to Medium
3 Transformer
Excellent for
global
dependencies
Moderate to
Low
(requires
optimization)
Strong
(context-
aware
with
attention)
High
4
Sliding
Window
Short-term
segment
analysis
High
None
Low
5
Gloss
Mapping
N/A
N/A
Provides
linguistic
alignment
Medium
6
Ontological
Integration
N/A
Low
Enhances
semantic
understanding
Medium
to
High
These methods were compared to evaluate their
temporal modeling efficiency, real-time processing
capability, semantic integration, and computational
requirements.
DISCUSSION
The study findings highlight the critical role of
combining temporal modeling and semantic analysis
methods in achieving accurate real-time recognition of
sign glosses. Both LSTM and Transformer architectures
demonstrated high efficiency in identifying long-term
and contextual dependencies between glosses. In
particular, Transformer-based approaches significantly
improved the understanding of the global semantic
context of sign gestures, achieving an accuracy rate of
93.1%.
On the other hand, the TCN model, being optimized for
faster processing, proved essential for real-time
systems, although it was less effective than LSTM or
Transformer in deep semantic interpretation. The
Sliding
Window
approach
provided
temporal
segmentation of sign gestures, facilitating sequential
analysis; however, it lacked the ability to ensure
comprehensive semantic coherence across segments.
Gloss mapping methods were effective in aligning
gestures with their linguistic units, though challenges
remained in capturing their meaning within broader
contextual frames. This limitation was largely mitigated
by ontological integration, where semantic networks
such as WordNet and ConceptNet were leveraged to
identify conceptual relationships between glosses,
European International Journal of Multidisciplinary Research
and Management Studies
31
https://eipublication.com/index.php/eijmrms
European International Journal of Multidisciplinary Research and Management Studies
enhancing interpretative depth.
The discussion further emphasizes that for real-time
systems, factors such as computational complexity,
recognition speed, semantic consistency, and user
interface design must be carefully balanced. While
Transformer-based models provide superior accuracy,
their high computational demands limit their feasibility
for deployment on resource-constrained real-time
devices. Therefore, future research should focus on
developing Transformer-Lite architectures or layer-
optimized models to reduce resource usage without
compromising performance.
Additionally, the integration of multimodal data
sources
—
including video streams, skeletal keypoints,
gloss annotations, and textual descriptions
—
offers a
solid foundation for building robust and comprehensive
real-time SLR systems, ensuring both high accuracy and
contextual relevance.
CONCLUSION
This study analyzed temporal modeling and real-time
recognition approaches in sign language recognition
(SLR) systems, leading to the following key conclusions:
Temporal modeling is one of the main factors
determining the effectiveness of SLR systems. In
particular, models such as LSTM and Transformer
demonstrated superior performance for consistent
analysis of sequential sign gestures.
The Transformer architecture achieved the highest
accuracy (93.1%) by evaluating sign glosses in a global
context, while LSTM proved effective for long-term
dependencies, and TCN stood out for its high
processing speed.
Sliding Window and Gloss Mapping approaches are
applicable in real-time analysis but require integration
with methods that enhance semantic accuracy.
Ontological
integration
significantly
improved
recognition by clarifying the semantic position of
glosses within conceptual networks, thereby increasing
accuracy and interpretability.
Multimodal approaches
—
combining gesture data,
skeletal keypoints, gloss annotations, and textual
descriptions
—
enable
more
comprehensive
interpretation of sign glosses.
For real-time SLR systems, achieving an optimal balance
between accuracy, processing speed, semantic
consistency, and computational efficiency is crucial
when selecting models.
The study’s
findings provide a solid scientific and
practical foundation for developing real-time SLR
systems capable of delivering accurate, contextually
meaningful, and user-friendly recognition of sign
glosses.
REFERENCES
Camgoz, N. C., Hadfield, S., Koller, O., Ney, H., &
Bowden, R. (2018). Neural Sign Language Translation.
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 7784
–
7793.
Cui, R., Liu, H., & Zhang, C. (2019). A deep learning
approach to continuous sign language recognition by
iterative training. International Journal of Computer
Vision, 127(11
–
12), 1690
–
1705.
Hu, H., Zhou, W., Li, H., & Li, W. (2023). SignBERT+:
Hand-model-aware self-supervised pretraining for sign
language understanding. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 45(5), 5678
–
5692.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017).
Attention Is All You Need. Advances in Neural
Information Processing Systems (NeurIPS), 30, 5998
–
6008.
Liu, J., Liang, H., Li, L., & Jiang, X. (2020). FrameNet-
based semantic analysis for continuous sign language
recognition. Pattern Recognition Letters, 131, 296
–
302.
Saunders, B., Camgoz, N. C., & Bowden, R. (2020).
Progressive Transformers for End-to-End Sign Language
Production. Proceedings of the European Conference
on Computer Vision (ECCV), 687
–
705.
Zuo, Z., Fang, Y., & Wang, S. (2023). MS2SL:
Multisource-to-Sign-Language model for synchronized
multimodal sign recognition. Computer Vision and
Image Understanding, 228, 103610.
Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2019).
Multimodal machine learning: A survey and taxonomy.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 41(2), 423
–
443.
Koller, O., Zargaran, S., Ney, H., & Bowden, R. (2020).
Quantifying Translation Quality of Sign Language
Recognition Systems on PHOENIX14T. European
Conference on Computer Vision (ECCV), 477
–
494.
Research.
(2021).
MediaPipe
Holistic:
Simultaneous face, hand, and div pose detection.
Retrieved from
https://google.github.io/mediapipe
