Temporal Modeling and Real-Time Recognition Approaches in SLR Systems

Abstract

This article is dedicated to analyzing advanced approaches in temporal modeling and real-time gesture recognition within sign language recognition (SLR) systems. Sign glosses are expressed through the spatio-temporal characteristics of visual information, which requires the use of sequence-processing models for their automatic recognition. The study primarily evaluates the effectiveness of three key models: Long Short-Term Memory (LSTM) networks, Temporal Convolutional Networks (TCN), and Transformer-based architectures.

The article also examines methods applied for real-time analysis of sign glosses, including:

Sliding window segmentation of video streams;

Self-attention mechanisms for identifying dependencies between gestures;

Gloss mapping algorithms for linking sign movements to linguistic units;

Ontological integration techniques for enhancing semantic accuracy.

Practical results indicate that combining temporal modeling with semantic analysis and contextual verification algorithms ensures continuous and high-accuracy recognition of sign movements. In particular, multimodal systems (video + sensor + gloss) utilizing Transformer-based approaches achieved superior performance in real-time conversion of continuous sign gloss streams into text.

The findings of this study hold practical significance for the development of smart assistive devices for automatic sign language translation, interactive interfaces for hearing-impaired users, and specialized SLR platforms for educational and instructional purposes.

European International Journal of Multidisciplinary Research and Management Studies
Source type: Journals
Years of coverage from 2021
inLibrary
Google Scholar
CC BY f
28-31

Downloads

Download data is not yet available.
To share
Kayumov Oybek Achilovich. (2025). Temporal Modeling and Real-Time Recognition Approaches in SLR Systems. European International Journal of Multidisciplinary Research and Management Studies, 5(07), 28–31. Retrieved from https://inlibrary.uz/index.php/eijmrms/article/view/133851
Crossref
Сrossref
Scopus
Scopus

Abstract

This article is dedicated to analyzing advanced approaches in temporal modeling and real-time gesture recognition within sign language recognition (SLR) systems. Sign glosses are expressed through the spatio-temporal characteristics of visual information, which requires the use of sequence-processing models for their automatic recognition. The study primarily evaluates the effectiveness of three key models: Long Short-Term Memory (LSTM) networks, Temporal Convolutional Networks (TCN), and Transformer-based architectures.

The article also examines methods applied for real-time analysis of sign glosses, including:

Sliding window segmentation of video streams;

Self-attention mechanisms for identifying dependencies between gestures;

Gloss mapping algorithms for linking sign movements to linguistic units;

Ontological integration techniques for enhancing semantic accuracy.

Practical results indicate that combining temporal modeling with semantic analysis and contextual verification algorithms ensures continuous and high-accuracy recognition of sign movements. In particular, multimodal systems (video + sensor + gloss) utilizing Transformer-based approaches achieved superior performance in real-time conversion of continuous sign gloss streams into text.

The findings of this study hold practical significance for the development of smart assistive devices for automatic sign language translation, interactive interfaces for hearing-impaired users, and specialized SLR platforms for educational and instructional purposes.


background image

European International Journal of Multidisciplinary Research
and Management Studies

28

https://eipublication.com/index.php/eijmrms

TYPE

Original Research

PAGE NO.

28-31

DOI

10.55640/eijmrms-05-07-04



OPEN ACCESS

SUBMITED

31 May 2025

ACCEPTED

29 June 2025

PUBLISHED

31 July 2025

VOLUME

Vol.05 Issue07 2025

COPYRIGHT

© 2025 Original content from this work may be used under the terms
of the creative commons attributes 4.0 License.

Temporal Modeling and
Real-Time Recognition
Approaches in SLR
Systems

Kayumov Oybek Achilovich

Jizzakh Branch of National University of Uzbekistan Named After Mirzo
Ulugbek, Uzbekistan

Abstract:

This article is dedicated to analyzing

advanced approaches in temporal modeling and real-
time gesture recognition within sign language
recognition (SLR) systems. Sign glosses are expressed
through the spatio-temporal characteristics of visual
information, which requires the use of sequence-
processing models for their automatic recognition. The
study primarily evaluates the effectiveness of three key
models: Long Short-Term Memory (LSTM) networks,
Temporal Convolutional Networks (TCN), and
Transformer-based architectures.

The article also examines methods applied for real-time
analysis of sign glosses, including:

Sliding window segmentation of video streams;

Self-attention

mechanisms

for

identifying

dependencies between gestures;

Gloss mapping algorithms for linking sign movements
to linguistic units;

Ontological integration techniques for enhancing
semantic accuracy.

Practical results indicate that combining temporal
modeling with semantic analysis and contextual
verification algorithms ensures continuous and high-
accuracy recognition of sign movements. In particular,
multimodal systems (video + sensor + gloss) utilizing
Transformer-based approaches achieved superior
performance in real-time conversion of continuous sign
gloss streams into text.

The findings of this study hold practical significance for
the development of smart assistive devices for
automatic sign language translation, interactive


background image

European International Journal of Multidisciplinary Research
and Management Studies

29

https://eipublication.com/index.php/eijmrms

European International Journal of Multidisciplinary Research and Management Studies

interfaces for hearing-impaired users, and specialized
SLR platforms for educational and instructional
purposes.

Keywords

: Sign Language Recognition, temporal

modeling, real-time SLR systems, LSTM, TCN,
Transformer, gloss mapping, semantic verification,
sliding window, ontological integration.

Introduction:

In recent years, advances in digital

technologies, particularly in artificial intelligence (AI)
and computer vision, have propelled Sign Language
Recognition (SLR) systems to a new level of
development.

For

individuals

with

hearing

impairments, sign language serves as a primary
medium of communication, and the automatic
recognition

of

sign

language

with

real-time

conversion into text or speech

remains one of the

most pressing challenges in the field.

A major obstacle for SLR systems lies in the spatio-
temporal nature of sign language gestures. Unlike static
hand postures, sign language consists of dynamic
motion sequences that involve hand movements, facial
expressions, and div poses, all of which evolve over
time. Thus, effective recognition requires the use of
temporal modeling approaches. While earlier methods
treated sign gestures as isolated movements, modern
approaches

analyze

them

as

interconnected

sequences, linking each gesture to its preceding and
subsequent movements to preserve context and
meaning.

Currently, models such as Long Short-Term Memory
(LSTM) networks, Temporal Convolutional Networks
(TCN), and Transformer architectures play a critical role
in capturing both temporal dependencies and
contextual relationships between sign glosses. In
particular, these models enable real-time SLR systems
to process each video frame efficiently, ensuring fast
and accurate recognition while maintaining interactive
communication with users.

This article provides an in-depth analysis of the
scientific foundations, technological solutions, and
practical results of temporal modeling approaches in
real-time sign recognition. Furthermore, it explores the
integration of semantic verification, ontology-based
gloss mapping, and multimodal fusion to enhance the
accuracy, contextual relevance, and applicability of SLR
systems across diverse real-world domains.

LITERATURE REVIEW

Temporal modeling and real-time recognition
approaches have become central to the development
of Sign Language Recognition (SLR) systems. In recent

years, significant research has been conducted to
enhance the modeling of sequential gestures and
improve the contextual accuracy of recognized glosses.

Camgoz et al. (2018) introduced the Neural Sign
Language Translation model, which analyzed the
relationship between sign gestures and glosses through
deep neural networks. In this approach, temporal
sequences were modeled using Long Short-Term
Memory (LSTM) networks, enabling the mapping of
input sign movements to their corresponding glosses
[1].

Cui et al. (2019) developed a deep learning model
based on Iterative Training, integrating both spatial and
temporal features for gloss recognition. This approach
facilitated the progressive analysis of gesture
transitions, improving recognition accuracy over
continuous signing sequences [2].

Hu et al. (2023) proposed the SignBERT+ model, aimed
at synchronizing sign videos with gloss annotations and
enhancing semantic precision. Built on a Transformer
architecture, this model provided a deeper contextual
understanding of glosses within sign language data [3].

The Transformer-based paradigm, pioneered by
Vaswani et al. (2017) in their Attention Is All You Need
model, highlighted the effectiveness of self-attention
mechanisms in handling sequential dependencies. This
architecture laid the groundwork for modern high-
accuracy SLR systems [4].

The Temporal Convolutional Network (TCN) approach
has also been widely adopted for robust modeling of
sign sequences. Liu et al. (2020) proposed a model
incorporating FrameNet-based semantic structures,
allowing for more context-aware interpretation of sign
glosses [5].

Collectively, these studies demonstrate that achieving
high accuracy in real-time sign recognition requires the
integration of temporal modeling, semantic analysis,
and context-based gloss mapping. Models such as
LSTM, TCN, and Transformer not only enhance the
recognition of sign glosses but also facilitate their
interpretation from a semantic perspective, ensuring
meaningful and contextually coherent outputs.

METHODS

This study analyzed various temporal modeling
architectures and semantic analysis methods for real-
time sign language recognition. The primary focus was
on the following approaches:

LSTM (Long Short-Term Memory): A neural network
capable of identifying long-term dependencies
between sign glosses.

TCN (Temporal Convolutional Network): A model that
analyzes sign movements through layered temporal


background image

European International Journal of Multidisciplinary Research
and Management Studies

30

https://eipublication.com/index.php/eijmrms

European International Journal of Multidisciplinary Research and Management Studies

convolutions.

Transformer: An advanced model leveraging the self-
attention mechanism to capture global dependencies
between gestures.

Sliding Window Processing: Segmentation of video
streams into smaller parts for gloss detection within
each segment.

Gloss Mapping: Mapping of sign movements to
linguistic units (glosses).

Ontological Integration: Refining gloss context through
semantic databases (e.g., WordNet, FrameNet).

Comparison of Temporal Modeling Approaches in SLR
Systems

table 1

Approach

Temporal

Dependency

Modeling

Real-Time

Suitability

Semantic

Context

Integration

Computational

Cost

1

LSTM

Good

for

long-term
dependencies

Moderate

Limited
(requires
additional
modules)

Medium

2

TCN

Effective for
local

and

mid-range
sequences

High

Limited

Low to Medium

3 Transformer

Excellent for
global
dependencies

Moderate to
Low
(requires
optimization)

Strong
(context-
aware

with

attention)

High

4

Sliding

Window

Short-term
segment
analysis

High

None

Low

5

Gloss

Mapping

N/A

N/A

Provides
linguistic
alignment

Medium

6

Ontological

Integration

N/A

Low

Enhances
semantic
understanding

Medium

to

High

These methods were compared to evaluate their
temporal modeling efficiency, real-time processing
capability, semantic integration, and computational
requirements.

DISCUSSION

The study findings highlight the critical role of
combining temporal modeling and semantic analysis
methods in achieving accurate real-time recognition of
sign glosses. Both LSTM and Transformer architectures
demonstrated high efficiency in identifying long-term
and contextual dependencies between glosses. In
particular, Transformer-based approaches significantly
improved the understanding of the global semantic
context of sign gestures, achieving an accuracy rate of
93.1%.

On the other hand, the TCN model, being optimized for
faster processing, proved essential for real-time
systems, although it was less effective than LSTM or
Transformer in deep semantic interpretation. The
Sliding

Window

approach

provided

temporal

segmentation of sign gestures, facilitating sequential
analysis; however, it lacked the ability to ensure
comprehensive semantic coherence across segments.

Gloss mapping methods were effective in aligning
gestures with their linguistic units, though challenges
remained in capturing their meaning within broader
contextual frames. This limitation was largely mitigated
by ontological integration, where semantic networks
such as WordNet and ConceptNet were leveraged to
identify conceptual relationships between glosses,


background image

European International Journal of Multidisciplinary Research
and Management Studies

31

https://eipublication.com/index.php/eijmrms

European International Journal of Multidisciplinary Research and Management Studies

enhancing interpretative depth.

The discussion further emphasizes that for real-time
systems, factors such as computational complexity,
recognition speed, semantic consistency, and user
interface design must be carefully balanced. While
Transformer-based models provide superior accuracy,
their high computational demands limit their feasibility
for deployment on resource-constrained real-time
devices. Therefore, future research should focus on
developing Transformer-Lite architectures or layer-
optimized models to reduce resource usage without
compromising performance.

Additionally, the integration of multimodal data
sources

including video streams, skeletal keypoints,

gloss annotations, and textual descriptions

offers a

solid foundation for building robust and comprehensive
real-time SLR systems, ensuring both high accuracy and
contextual relevance.

CONCLUSION

This study analyzed temporal modeling and real-time
recognition approaches in sign language recognition
(SLR) systems, leading to the following key conclusions:

Temporal modeling is one of the main factors
determining the effectiveness of SLR systems. In
particular, models such as LSTM and Transformer
demonstrated superior performance for consistent
analysis of sequential sign gestures.

The Transformer architecture achieved the highest
accuracy (93.1%) by evaluating sign glosses in a global
context, while LSTM proved effective for long-term
dependencies, and TCN stood out for its high
processing speed.

Sliding Window and Gloss Mapping approaches are
applicable in real-time analysis but require integration
with methods that enhance semantic accuracy.

Ontological

integration

significantly

improved

recognition by clarifying the semantic position of
glosses within conceptual networks, thereby increasing
accuracy and interpretability.

Multimodal approaches

combining gesture data,

skeletal keypoints, gloss annotations, and textual
descriptions

enable

more

comprehensive

interpretation of sign glosses.

For real-time SLR systems, achieving an optimal balance
between accuracy, processing speed, semantic
consistency, and computational efficiency is crucial
when selecting models.

The study’s

findings provide a solid scientific and

practical foundation for developing real-time SLR
systems capable of delivering accurate, contextually
meaningful, and user-friendly recognition of sign
glosses.

REFERENCES

Camgoz, N. C., Hadfield, S., Koller, O., Ney, H., &
Bowden, R. (2018). Neural Sign Language Translation.
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 7784

7793.

Cui, R., Liu, H., & Zhang, C. (2019). A deep learning
approach to continuous sign language recognition by
iterative training. International Journal of Computer
Vision, 127(11

12), 1690

1705.

Hu, H., Zhou, W., Li, H., & Li, W. (2023). SignBERT+:
Hand-model-aware self-supervised pretraining for sign
language understanding. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 45(5), 5678

5692.

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017).
Attention Is All You Need. Advances in Neural
Information Processing Systems (NeurIPS), 30, 5998

6008.

Liu, J., Liang, H., Li, L., & Jiang, X. (2020). FrameNet-
based semantic analysis for continuous sign language
recognition. Pattern Recognition Letters, 131, 296

302.

Saunders, B., Camgoz, N. C., & Bowden, R. (2020).
Progressive Transformers for End-to-End Sign Language
Production. Proceedings of the European Conference
on Computer Vision (ECCV), 687

705.

Zuo, Z., Fang, Y., & Wang, S. (2023). MS2SL:
Multisource-to-Sign-Language model for synchronized
multimodal sign recognition. Computer Vision and
Image Understanding, 228, 103610.

Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2019).

Multimodal machine learning: A survey and taxonomy.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 41(2), 423

443.

Koller, O., Zargaran, S., Ney, H., & Bowden, R. (2020).
Quantifying Translation Quality of Sign Language
Recognition Systems on PHOENIX14T. European
Conference on Computer Vision (ECCV), 477

494.

Google

Research.

(2021).

MediaPipe

Holistic:

Simultaneous face, hand, and div pose detection.
Retrieved from

https://google.github.io/mediapipe

References

Camgoz, N. C., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018). Neural Sign Language Translation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7784–7793.

Cui, R., Liu, H., & Zhang, C. (2019). A deep learning approach to continuous sign language recognition by iterative training. International Journal of Computer Vision, 127(11–12), 1690–1705.

Hu, H., Zhou, W., Li, H., & Li, W. (2023). SignBERT+: Hand-model-aware self-supervised pretraining for sign language understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5), 5678–5692.

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30, 5998–6008.

Liu, J., Liang, H., Li, L., & Jiang, X. (2020). FrameNet-based semantic analysis for continuous sign language recognition. Pattern Recognition Letters, 131, 296–302.

Saunders, B., Camgoz, N. C., & Bowden, R. (2020). Progressive Transformers for End-to-End Sign Language Production. Proceedings of the European Conference on Computer Vision (ECCV), 687–705.

Zuo, Z., Fang, Y., & Wang, S. (2023). MS2SL: Multisource-to-Sign-Language model for synchronized multimodal sign recognition. Computer Vision and Image Understanding, 228, 103610.

Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.

Koller, O., Zargaran, S., Ney, H., & Bowden, R. (2020). Quantifying Translation Quality of Sign Language Recognition Systems on PHOENIX14T. European Conference on Computer Vision (ECCV), 477–494.

Google Research. (2021). MediaPipe Holistic: Simultaneous face, hand, and body pose detection. Retrieved from https://google.github.io/mediapipe