Authors

  • Kawaljeet Singh Chadha
    University of the Cumberlands, Williamsburg, KY, USA

DOI:

https://doi.org/10.71337/inlibrary.uz.ijdsml.111738

Keywords:

Insurance fraud detection Machine learning ETL pipeline temporal behavior modeling Real-time fraud scoring.

Abstract

The insurance industry is also affected by insurance fraud, which incurs massive financial losses and operational inefficiencies. Current fraud detection methods tend to be based on rule-based systems and static Extract, Transform, Load (ETL) pipelines, which are unable to keep up with the pace of rapidly evolving fraud tactics. However, these conventional approaches exhibit high false-positive rates, limited flexibility, and cannot perform real-time analysis, causing delayed detection and increased operational costs. This article describes the integration of machine learning (ML) techniques into Extract, Transform, and Load (ETL) pipelines to facilitate real-time, data-driven fraud identification during insurance claims processing. This system features embedded supervised machine learning classifiers within the ETL workflow, enabling dynamic analysis of claims data during ingestion and transformation. Temporal behavior modelling, behavior modelling, and external data source enrichment, co-enabled with fraud auto-registry, will allow the system to improve the detection of complex behaviors over time. Scalability and near real-time processing are supported by the pipeline orchestration, resulting in timely fraud risk scoring. The results of experiments demonstrate that the proposed methods yield a significant improvement in detection accuracy and latency reduction compared to traditional methods. By incorporating dimensionality reduction techniques, further optimization of model performance can be achieved. With this approach, claims processing can effectively evolve in lockstep with dynamic and ever-changing scales, adapting without impacting efficiency and resiliency. Ultimately, an ML-augmented ETL pipeline is proposed, which provides insurers with a powerful tool for reducing fraud losses while maintaining agility and compliance.


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

410

INTERNATIONAL JOURNAL OF DATA SCIENCE AND MACHINE LEARNING (ISSN: 2692-5141)

Volume 05, Issue 01, 2025, pages 410-436

Published Date: - 25-06-2025

Doi: -

https://doi.org/10.55640/ijdsml-05-01-30


Machine Learning

Augmented ETL Pipelines for Fraud

Resistant

Insurance Claims Processing

Kawaljeet Singh Chadha

University of the Cumberlands, Williamsburg, KY, USA


ABSTRACT

The insurance industry is also affected by insurance fraud, which incurs massive financial losses and operational
inefficiencies. Current fraud detection methods tend to be based on rule-based systems and static Extract,
Transform, Load (ETL) pipelines, which are unable to keep up with the pace of rapidly evolving fraud tactics.
However, these conventional approaches exhibit high false-positive rates, limited flexibility, and cannot perform
real-time analysis, causing delayed detection and increased operational costs. This article describes the integration
of machine learning (ML) techniques into Extract, Transform, and Load (ETL) pipelines to facilitate real-time, data-
driven fraud identification during insurance claims processing. This system features embedded supervised machine
learning classifiers within the ETL workflow, enabling dynamic analysis of claims data during ingestion and
transformation. Temporal behavior modelling, behavior modelling, and external data source enrichment, co-
enabled with fraud auto-registry, will allow the system to improve the detection of complex behaviors over time.
Scalability and near real-time processing are supported by the pipeline orchestration, resulting in timely fraud risk
scoring. The results of experiments demonstrate that the proposed methods yield a significant improvement in
detection accuracy and latency reduction compared to traditional methods. By incorporating dimensionality
reduction techniques, further optimization of model performance can be achieved. With this approach, claims
processing can effectively evolve in lockstep with dynamic and ever-changing scales, adapting without impacting
efficiency and resiliency. Ultimately, an ML-augmented ETL pipeline is proposed, which provides insurers with a
powerful tool for reducing fraud losses while maintaining agility and compliance.

Key words:

Insurance fraud detection, Machine learning, ETL pipeline, temporal behavior modeling, Real-time fraud

scoring.

1. INTRODUCTION

The global insurance industry continues to face a significant and persistent problem of insurance fraud, resulting in
substantial financial losses of billions of dollars per year. Specifically, this type of fraudulent activity occurs during
the claims processing stage when dishonest policyholders or organized groups file false, excessive, or otherwise
fraudulent claims to obtain unearned monetary compensation. Fraud detection in insurance claims remains a
challenge due to the complexity of these claims, which involve numerous data points with diverse data formats and
a high volume of daily transactions.

Insurance companies, in the past, have relied on rule-based detection systems and static ETL pipelines to process
claims and alert them to suspicious activities. For the rule-based system, predefined conditions and thresholds are


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

411

applied to identify potential fraud. However, such systems tend to be inflexible and struggle to adjust to ever-
changing patterns of fraudulent tactics and newly emerging patterns of deceit. As a result, these methods typically
lead to a high rate of false positives, thereby straining claims adjusters with inappropriate investigations and low
operational efficiency. At the same time, the core focus of traditional ETL pipelines, which extract, clean, and load
data in preparation for downstream analytics, has remained outside of native support for predictive intelligence or
real-time decision-making. So fraud detection often falls behind itself instead of being proactive, missing chances
to intervene early.

In recent times, advances in machine learning (ML) have led to new possibilities for enhancing fraud detection
within the insurance domain. ML models can learn from historical data, identify complex patterns, and evolve to
keep pace with changing fraud behaviour over time. With ML algorithms embedded directly into the claims
processing pipeline, the pipeline becomes intelligent and automated, featuring real-time risk scoring and more
sophisticated fraud detection. This type of integration enables the system to consume not only structured data,
such as claim amounts and policyholder information but also unstructured data, including descriptions or
documents, allowing for the identification of anomalies that traditional rules would otherwise miss. Based on this,
the goal of this approach is inspired by the successful applications of predictive analytics in life insurance
underwriting, where supervised machine learning models (e.g., decision trees, regression techniques, and neural
networks) have been applied to more precisely assess applicant risk. In such cases, Correlation-Based Feature
Selection (CFS) and Principal Component Analysis (PCA) have been instrumental methods for input variable
refinement, noise removal, and model robustness as well as interpretability.

This thesis fills the gap between static claims processing pipelines and dynamic, intelligent fraud detection by
designing a machine learning-augmented Extract, Transform, Load (ETL) process for insurance claims. This thesis
includes the objectives of creating an ETL pipeline that utilizes supervised ML models for real-time fraud risk scoring,
intelligently features engineering strategies including temporal behaviour modelling and external data sources
enrichment, evaluates the performance of different ML a

lgorithms and dimensionality reduction techniques’

effectiveness for identifying fraudulent claims and showing the operational viability and scalability of the solution
using industry standard orchestration and visualization tools. This research aims to address these objectives by
establishing a practical and scalable approach that provides fraud resistance to insurance claims processing systems
without disrupting existing operational workflows. With this advancement, detection accuracy can be improved,
financial losses can be reduced, and trust between the insurer and policyholder can be increased.

2. Related Work

2.1 Traditional ETL Pipelines in Insurance Operations

In insurance operations, claims data has been traditionally managed via traditional ETL pipelines. These pipelines
focus on extracting data from diverse sources (claim forms, policy databases, and external records), transforming
the data into a standard format, and loading it into a centralised data warehouse. These processes are usually rule-
based, batch-oriented, and automate basic data cleaning, validation, and aggregation tasks. This approach
guarantees data consistency and facilitates reporting but cannot identify fraudulently paid claims in a dynamic
manner (in real-time). These pipelines tend to be static, which makes them less effective at adapting to emerging
new fraud patterns (

3

).

As illustrated in the image below, the ETL process involves three key stages: extract, transform, and load. In the
extraction phase, data is collected from various sources such as claim forms, policy databases, and external records.


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

412

Figure 1: ETL: Extract, Transform, and Load.

2.2 Current Landscape of Machine Learning in Fraud Detection

Fraud detection has been transformed by machine learning, as these systems can now learn complex patterns based
on historical data (

26

). To classify claims as legit or potentially fraudulent in the insurance domain, several

supervised learning models (such as Random Forest, XGBoost and neural networks) are already used. Here, these
models predict fraud likelihood by analysing attributes such as claimant behavior, claim amounts, and temporal
trends. They can handle high-dimensional data and nonlinear relationships more effectively, making them superior
to traditional rule-based systems. However, they are typically deployed separately from core ETL processes, which
prevents them from being integrated into real-time workflows.

2.3 Previous Integration of ML within ETL Flows

Machine learning is being embedded within ETL pipelines, making fraud detection a natural part of the data
processing lifecycle. These approaches typically involve adding feature engineering steps in the transformation
phase to extract temporal and behavioral features that enhance the quality of model inputs. After feature
extraction, ML inference can be performed before loading results into operational stores, allowing earlier fraud risk
scoring. While promising results have been reached, these systems face several challenges, including scaling to
larger models, managing model versions, and handling imbalanced data, an issue inherent in fraud datasets.
Moreover, most existing efforts do not allow for easy integration of diverse insurance products or external data
sources, limiting flexibility and adaptability (

13

).

2.4 Gaps in the Research Identified

The current state of research reveals several gaps that must be addressed to achieve ETL pipelines that ML
completely powers for fraud detection. Intelligent feature engineering, modelling of temporal behavioral aspects,
and supervised learning are not coupled in a standard pipeline framework. Little is known about the operational
performance of these systems in near-real-time settings either. Additionally, further exploration is needed to strike
a balance between utilizing dimensionality reduction techniques and maintaining model interpretability while
achieving high fraud detection accuracy. Closing these gaps would help make insurance claims fraud detection more
effective and scalable, enabling the development of adaptive and fraud-resistant systems in turn.


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

413

3. Conceptual Framework

An intelligent ETL pipeline for fraud-resistant insurance claims processing requires a well-structured conceptual
framework for development (

35

). As a foundation, this framework focuses on incorporating machine learning into

traditional ETL stages by converting static data workflows into adaptive, real-time ones that can detect fraudulent
activity while claims are being processed.

As illustrated in the image below, the framework integrates various influencing factors such as the economic
environment, governance and control, individual traits, and contested practices that collectively shape the
consequences of fraudulent behaviors.

Figure 2: Conceptual framework for the study of fraud and scandals (Van Driel, 2019).

3.1 Intelligent ETL Systems' Overview

This goes beyond basic data extraction, transformation, and loading through an intelligent ETL system that includes
advanced analytics, with machine learning built into the data flow. Instead of merely preparing data for offline
analysis, such a system continuously analyses incoming claims data, dynamically tagging risk and alerts. This enables
a move in fraud detection closer to the source of the data, eliminating virtually all the time between the submission
of a claim and the identification of fraud. Embedded intelligence is a core component of this approach, augmenting
the ETL stage with embedded intelligence (

21

). Risk-related metadata can be collected and associated with data

elements as the data is extracted. Feature engineering techniques from transformation stages add behavioral,
temporal, and contextual attributes to the dataset. Finally, loading processes embed real-time scoring from
machine learning models to route suspicious claims to manual review or payment stages before they reach.

3.2. Embedded Intelligence: ML inference within pipeline stages

Embedding machine learning inference within ETL pipeline stages ensures that predictive insights are generated
during data processing. Enabling this requires the integration of lightweight machine learning (ML) models that can
run efficiently under pipeline constraints such as latency and resource usage. Temporal features can be engineered
in the transformation phase by using sliding windows to capture changes in claimant behavior over time, for
example. Supervised classifiers are trained to estimate fraud risk scores, which are fed by these features. The claims


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

414

data is scored, and the results are attached to the data as new attributes before being loaded into operational
systems. Achieving this level of real-time processing and data consistency often involves leveraging scalable
database solutions like MongoDB, which bridges performance and reliability in big data environments (

8

).

Furthermore, MongoDB’s capabilities enable handling high data volumes and streaming workloads, making it

suitable for embedding ML inference within ETL pipelines that require real-time scoring and dynamic data updates
(

9

). Another essential element of this embedded intelligence is model versioning and management. This needs to

happen in a pipeline that allows models to be updated or completely swapped out with no interruption in data flow,
enabling continuous learning and adaptation to emerging fraud tactics. To maintain pipeline accuracy and relevance
over time, dynamic model lifecycle management is critical.

As shown in the Table below, key components of embedded machine learning inference include lightweight models,
temporal feature engineering, and fraud risk scoring, all designed to enable real-time, adaptive decision-making
within ETL pipelines.

Table 1: Key Components of Embedded Machine Learning Inference

Component

Description

Purpose

Lightweight ML

Models

Efficient supervised classifiers embedded

within the ETL pipeline.

Enable real-time inference under

latency/resource constraints.

Temporal Feature

Engineering

Use of sliding windows and time-based

aggregations during transformation stage.

Capture behavioral patterns and changes

in claimant behavior over time.

Fraud Risk Scoring

Inference applied during transformation to

generate a fraud probability or risk score.

Identify high-risk claims early in the data

flow.

Attribute

Augmentation

Fraud scores and derived features are

appended to the original claim record.

Enhance downstream analysis and

decision-making systems.

Model Versioning

Management of multiple versions of ML

models used for scoring.

Ensure smooth deployment and rollback

with minimal disruption to the pipeline.

Model Swapping and

Updating

Hot-swapping capability for deploying new or

retrained models into the pipeline.

Support continuous learning and

adaptability to new fraud patterns.

Dynamic Lifecycle

Management

Monitoring model performance and retraining

based on concept drift and fraud pattern

evolution.

Maintain relevance, accuracy, and

effectiveness of embedded models over

time.

3.3 Principles for the creation of the ML-ETL pipeline.

The design of machine learning-augmented ETL pipelines for insurance fraud detection is informed by several


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

415

guiding principles to guarantee that these systems are not only effective but also practical for deployment in the
real world. Modularity is paramount: each part of the pipeline (extraction, transformation or modelling) should be
made of loosely related components. With this modular structure, the individual parts of the pipeline can be
updated or replaced without requiring the replacement of the entire pipeline, making the system more flexible and
easier to maintain over time. Scalability is also important, as the pipeline must be able to scale up to the vast
volumes of claims data that modern insurance operations utilize. This can be achieved by running horizontally on a
distributed infrastructure, leveraging parallel processing and a distributed computing framework to ensure
performance in heavy data loads. Given the time-sensitivity of fraud detection, real-time processing capabilities are
crucial.

The pipeline should be tailored to minimize latency, enabling near real-time scoring and decision-making to detect
fraud before it occurs and prevent financial losses (

28

). What also makes the pipeline super effective is that it is

extensible and allows integration with many other external data sources (s., fraud registries, watch lists, third-party
APIs). It adds this external intelligence to enrich the feature set used by machine learning models and, in doing so,
improves detection accuracy. Finally, a key principle is interpretability, as results produced by fraud detection
models need to be explainable to aid downstream investigation processes and regulatory compliance. As a
consequence of this requirement, careful feature selection is necessary to satisfy it, along with transparent models
or explainability techniques that enable human analysts to trust and understand automated decisions. These
principles are combined to form the basis for constructing resilient, adaptable, and efficient ML-augmented ETL
pipelines for insurance fraud detection.

3.4 Expected Performance Improvements, Adaptability

Embedding machine learning directly into ETL workflows enables insurance firms to achieve significant performance
gains in both accuracy and efficiency within their fraud detection processes. Faster intervention through early risk
scoring reduces the window during which fraudulent claims can be paid out. Additionally, the continuous learning
feature enables adaptation to new fraud schemes over time, addressing one of the fundamental deficiencies of
static rule-based systems (

18

). The pipeline is also modular and extensible, allowing insurers to customize fraud

detection models and features for different types of insurance product lines, such as auto, health, or property
insurance. Adding a layer of sophistication that allows patterns to be observed across multiple claims or over
particular temporal periods enables the detection of staged or orchestrated fraud attempts. In addition, the larger
the scope of external data sources that are incorporated, the more comprehensive set of information is available
for risk assessment, thus enhancing the ability to uncover anomalies that internal data, standing alone, might not
reveal. In general, this framework is a practical recipe for building ETL pipelines that allow turning insurance claims
processing from a reactive, batch-oriented process into a proactive, intelligent process.

4. Intelligent Feature Pipeline Design

Many of the machine learning models that achieved success in this task, specifically in detecting fraud in insurance,
were based on high-quality and relevant features extracted from raw claims data (

12

). To design an intelligent

feature pipeline is to undergo a set of systematic processes for enriching, transforming, and organizing data
attributes to help machine learning algorithms distinguish between legitimate and fraudulent claims. This section
outlines key strategies for feature engineering in ETL workflows that enable better fraud detection.

As illustrated in the image below, the healthcare sector already leverages machine learning across various pillars

such as medical imaging diagnosis, smart health records, and crowdsourced data collection

to improve decision-


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

416

making and predictive capabilities.

Figure 3: Pillars of machine learning for healthcare services.

4.1 Dynamic Attributes and Ingestion Level Risk Tagging

During the data ingestion stage, incoming claims are assigned preliminary risk indicators based on available
metadata and initial validations (

14

). This could be as simple as risk tags generated from frequently occurring

unusual claims, frequent claims billing within an anomaly timeframe, or unusual policyholder details. This also
creates a dynamic list of attributes that change with additional data, the first layer of which is established during
early tagging, allowing prioritization of claims for more intensive analysis later. Attributes such as claim frequency
within a specific period or the behaviour of a claimant in terms of past behaviour are dynamic in the sense that they
are constantly updated when new claims enter the system. This provides the pipeline with a running context of risk,
which it can use for downstream feature generation and model scoring (

33

).

4.2 Feature enrichment with external data sources.

The integration of external data sources into the feature engineering pipeline is crucial in enhancing the accuracy
and robustness of fraud detection systems. Additionally, I utilize external data when available (public fraud
registries, industry watch lists, third-party and third-party databases) to supplement the information often missing
or incomplete in internal claim records. The system can uncover hidden patterns of suspicious activity, as the
claimant's details are cross-referenced with these external sources to match prior fraud allegations or known
associations with fraudulent entities. Fraud watchlists maintained by regulatory bodies or industry coalitions, for
example, enable us to flag potential exposure early in the claim processing process when an offender previously
engaged in fraudulent activity is identified. In addition, the geo-location data is used to detect claims coming from
a high-risk area where the fraud incidence rate is high. The geographic insight adds a spatial dimension to risk
assessment that data internal to the Firm cannot alone provide.

Government agencies and regulatory authorities provide Application Programming Interfaces (APIs) for real-time
data access. With these APIs, the ETL pipeline can automatically update when new flags, fraud schemes, or policy
changes are detected, dynamically adjusting the latest intelligence. The continual influx of external data allows


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

417

machine learning models to detect more sophisticated fraud tactics, such as those involving identity manipulation,
collusion, or coordinated fraud rings, resulting in an overall better performance of the fraud detection system.

As illustrated in Table 2, engineered features used in fraud detection span various types, such as temporal, external,
behavioral, and contextual attributes, each contributing to a more robust and insightful model.

Table 2: Sources and Examples of Engineered Features

Feature Type

Source Example

Description

Temporal Features Claim timestamp history

Days between claims, frequency in a 6-month
window

External Data

Fraud watch lists, geo-location APIs Region risk levels, entity flag status

Behavioral Patterns

Claimant’s claim submission

behavior

Claim amount trends, policy changes over time

Contextual
Features

Third-party databases, public records Employment verification, identity consistency checks

4.3 Generating Time-Based and Behavioral Features

Identifying staged or recurring fraud attempts requires temporal behaviour analysis. The behaviour dynamics in
terms of features such as the time elapsed since the last claim, frequency of claims over sliding windows, and
changes in claim patterns over time are critical for effective detection (

4

,

5

). Unsupervised features could

encompass the claimant's average claim value, types of claims or correlations between claim times and external
events (e.g. policy renewals). During transformation, techniques such as sliding windows or lag features can be
applied to capture these temporal patterns and make models capable of detecting anomalies that are missed by
static snapshots.

4.4 Modular feature templates for different insurance products.

The risk profile and claim characteristics of insurance products vary significantly. The ETL pipeline can customize
feature engineering processes for specific product types (e.g., auto, health, or property insurance) by designing
modular feature templates. These sets of features and methods define which features are considered relevant and
which method to use for extracting them from a specific domain. For example, auto insurance might highlight
vehicle history, repair costs, and accident reports as key features. In contrast, health insurance might focus on
treatment patterns and provider networks

simplified pipeline maintenance with ease of pace, allowing for

adjustments when new products or emerging fraud patterns require changes (

29

).

5. Temporal Behavior Modeling in Claims

To distinguish genuine from fraudulent claims, it is essential to understand how claims evolve. Fraudulent activities


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

418

often follow patterns that can only be illustrated by analyzing successive years of claims data. By integrating
temporal behavior modelling into the feature engineering process, fraud detection systems can better understand
these evolving dynamics, which in turn improves the accuracy of their predictions (

30

).

5.1 Time Dimension of Claims Analysis

Examining claims in isolation may not yield much; many fraudulent schemes involve a series of small steps over
time. Staged fraud, for example, may include filing numerous smaller claims spread out over weeks to evade
detection or repeatedly filing claims with very slight variations that appear valid as a single claim. This allows frauds
to be identified by looking for unusual frequencies, sequences, or timing, which may only be detectable when claims
are viewed within a temporal context. It also provides temporal analysis, which helps it detect new and evolving
fraud tactics with agility, moving beyond static rules. It also facilitates a better understanding of claimant behaviour,
enabling the differentiation between occasional anomalies and systematic fraudulent activities.

5.2. Temporal Feature Engineering Techniques

Techniques such as sliding windows, lag features, pattern mining, and frequency analysis are key methods used to
extract temporal features from claims data. Claim activities are aggregated using sliding windows over fixed periods
(such as the past month) to indicate recent trends, spikes, or clusters of activity. Past claim values or time since the
last claim are featured in lag features, which pick outbursts of rapid, successive claims or unusual changes (

19

).

Frequency analysis is how often a specific type of claim or behavior occurs over certain timeframes of the day or
week. Finally, pattern mining is used to discover sequences or combinations of claim attributes that are prevalent
in fraudulent cases (e.g., multiple claims associated with the same repair shop or geographic region). These
techniques collectively provide a comprehensive set of temporal features to augment the input for a machine-
learning model.

5.3 Integrating Supervised Classifiers

Static claim attributes and these temporal features are then fed into supervised machine learning classifiers such
as Random Forest or XGBoost. By including temporal data, models can identify patterns of how events unfold over
time and weigh more recent suspicious activity. This multi-dimensional input enhances the prediction of false
positives and false negatives.

As illustrated in the image below, these models fall within the broader context of Artificial Intelligence (AI),
specifically under machine learning approaches that leverage decision trees and ensemble techniques


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

419

Figure 4: Supervised Machine Learning

5.4 Benefits: Staged Fraud and Behavioral Anomalies Detection

The advantage of temporal behaviour modelling lies in its ability to detect complex fraud patterns and time spread
(staged fraud), where fraudsters go to great lengths to ensure that the fraudulent activity is spread out over time,
thereby evading traditional detection methods. The system analyses the sequence and frequency of claims,
identifying suspicious patterns that, when viewed individually, may appear benign but together reveal fraudulent
intent. This approach sheds light on more coordinated efforts at utilizing static claims rules that only concern
individual claim characteristics (

27

).

Temporal modelling has also been successful in detecting behavioral anomalies, such as the number of claims
submitted by a claimant, short intervals between consecutive claims, and hikes in the amount claimed. Often, these
changes indicate a deviation from normal behaviour and are essential early signs of potential fraud. This captures
the temporal dynamics of the data, enabling the system to perform proactive risk management, where airlines can
examine flagged claims immediately and take preventive actions to avoid incurring substantial financial damage
(

22

). On the whole, incorporating technological modelling into fraud detection pipelines enables a greater ability to

recognize subtle and evolving fraud schemes, thereby decreasing false negatives and improving the overall accuracy
of detection. The approach taken is dynamic, such that emerging fraud patterns are detected early, and insurance
claims processing systems are made resilient.

As shown in Table 3, temporal modeling enables the detection of various behavioral anomalies, such as staged
fraud, rapid claim frequency, and identity reuse patterns that might otherwise go unnoticed through static analysis.

Table 3: Behavioral Anomalies Detectable Through Temporal Modeling

Anomaly Type

Description

Staged Fraud

Spread-out fraudulent events designed to appear independent

Rapid Claim Frequency

Multiple claims within a short, unusual time frame


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

420

Anomaly Type

Description

Amount Escalation Pattern Gradual increase in claim amounts leading up to a large claim

Identity Reuse Patterns

Recurring use of same address/contact info across different identities

6. Machine Learning Models and Training

Fraud detection systems within ETL pipelines require appropriate machine-learning models and effective training
techniques. In this section, the model choices, data preparation strategies, and training processes implemented to
optimize fraud identification performance are explained.

6.1 Selection and Justification of the Model

Several supervised machine learning algorithms have demonstrated exemplary performance in fraud detection
problems (

23

). Random Forest, XGBoost and Multilayer Perceptron (MLP) are the models chosen for this study.

Each of them brings in its strength to a successful classification of fraudulent claims. An ensemble learning method,
such as Random Forest, builds multiple decision trees and aggregates their predictions. It is robust to noisy data,
applicable to high-dimensional feature spaces, and provides interpretability through feature importance measures.
It is well-suited for datasets typical of insurance claims, which can be complex (

36

,

37

). The gradient boosting

framework XGBoost is fast and accurate, as it builds the tree sequentially, correcting previous errors. Since
fraudulent claims constitute a small percentage of total claims, working with imbalanced datasets becomes
essential, posing a common and significant challenge in fraud detection. A Multilayer Perceptron (MLP) is a neural
network that is that can capture the relationship and interaction between features, even if this relationship is not
linear. MLPs require more computational resources and tuning and can model complex patterns by between-based
methods. Hence, they may be able to detect more subtle signs of fraud.

6.2. Data Preprocessing and Partitioning

Several steps are involved in preparing the claims data for machine learning. Imputation techniques specific to
feature types address missing values, ensuring that gaps in the data do not divert the model's training off track. This
means that categorical variables are encoded using one-hot encoding or target encoding based on the feature
distribution and model requirements. To perform an unbiased evaluation, the dataset is partitioned into training,
validation, and testing subsets. This approach preserves the proportion of fraudulent to legitimate claims across
data splits, thereby enhancing representativeness and ensuring better generalization.

6.3 Dealing with Imbalanced Data

Fraudulent claims are the minority in insurance fraud datasets, which are known for being severely imbalanced.
This does not balance out, and the model may become biased in favour of the majority classes, resulting in poor
fraud detection performance. To compensate for this, techniques such as the Synthetic Minority Over-sampling
Technique (SMOTE) are used to over-sample the minority class with synthetic examples, thereby balancing the
dataset. Otherwise, during model training, weighted loss functions are used so that fraud cases are penalised more
heavily in the loss function to push the model to learn to detect fraud, even in the presence of class imbalance.


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

421

6.4 Cross-Validation and Hyperparameter Tuning

Hyperparameter choice is crucial to model performance (

39

). Methods of cross-validation (such as k-fold cross-

validation) are used to assess the stability of a model and to avoid overfitting. It works by splitting the training data
into several folds, iterating the training and validation of the model to find the correct parameters. Combinations
of hyperparameters (such as tree depth, learning rate, and neuron counts) are explored using grid search or
randomized search techniques to fine-tune model behavior. This rigorous tuning maximizes the mode

l’s detection

accuracy and robustness.

As illustrated in the diagram below, training data and labels are combined with different sets of hyperparameter
values and passed through a learning algorithm. Each configuration is then evaluated based on its resulting
performance.

Figure 5: Model evaluation, model selection, and algorithm selection in machine learning

6.5 Interpretability Considerations

For regulatory compliance and operational trust, model interpretability is key to insurance applications. Tree-based
models produce feature importance scores that are used to identify key factors driving fraud predictions, which
investigators can use to understand and validate alerts. Techniques for explainability, such as SHAP (Shapley
Additive explanations), offer both local and global insights into the decisions made by models, contributing to
transparency. The interpretability of intrusion strategies supports human investigators in prioritizing investigations
and refining detection strategies (

38

).

7. Methodology

The dataset, preprocessing steps, ETL workflow implementation, machine learning pipeline integration, and
evaluation strategy described in this section enable us to develop and assess the fraud detection system within an
ETL context.

As shown in Table below, various tools such as Apache Airflow, n8n, and MLflow play critical roles in orchestrating
machine learning-enabled ETL pipelines by supporting tasks like workflow scheduling, low-code integration, and


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

422

model version tracking.

Table 4: Key Tools for Pipeline Orchestration

Tool/Platform Role in Pipeline

Example Usage

Apache Airflow Workflow orchestration

Scheduling data extraction and ML scoring

n8n

Visual low-code ETL orchestration Connecting APIs and triggering model runs

MLflow

Model tracking and versioning

Logging accuracy of each deployed model

PostgreSQL

Centralized data warehouse

Storing and joining processed claim datasets

7.1 Dataset Overview and Data Preprocessing

It contains historic insurance claims from multiple lines of business (auto, health, property insurance). Claim
attributes included in each record are claimant demographics, claim amounts, dates, claim types and status labels
to indicate fraud or legitimacy. Methods for targeted imputation are employed to address missing data. To reduce
distortion from outliers, median imputation is used for filling numerical features with missing values; categorical
variables are imputed with the most frequent category. Domain knowledge is used to guide the selective exclusion
or further investigation in cases of systematically missing data. Model efficiency is improved, and the curse of
dimensionality is relieved by using dimensionality reduction (

31

). The first approach is Correlation-based Feature

Selection (CFS), which builds on constructing subsets of features that have a high correlation with the resulting
fraud label while keeping the minimum correlation among themselves; the second is Principle component Analysis
(PCA), which transforms the features into orthogonal components onto which the maximum variance of the original
variables fall. Both aim to reduce redundancy while preserving predictive power.

7.2 ETL Workflow Engineering

It orchestrates data extraction, transformation, feature engineering, and loading (ETL) stages integrated with
machine learning scoring. Tools like Apache Airflow and n8n enable workflow automation and scheduling. The
extraction phase retrieves claim data from operational databases, as well as from external sources such as fraud
registries and public records. Cleaning, normalization, and feature enrichment are applied to the original data

including temporal behavior features

during the transformation stages (

16

). The pipeline hosts machine learning

models that are embedded within the pipeline to score claims in near real-time or in batches, depending on
operational requirements. Large-volume historical data analysis lends itself to batch processing, for example, while
real-time scoring enables fraud flagging in real time at the time of claim submission.

7.3 Machine Learning Pipeline Versioning

Continuous improvement and deployment of machine learning components are facilitated by encapsulating them
in modular, version-controlled artefacts. Versioning stores a record of changes in algorithms, hyper parameters,
and training data, ensuring relative independence and auditability. Some integration with ETL workflows is achieved


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

423

through APIs, aided by containerization technologies like Docker, which ensures scalability and portability across
environments. The ability to update the ML models independently of the entire ETL process provides added value
in this situation.

As illustrated in the Figure below MLOps Components Machine Learning Life Cycle, each stage

from data

ingestion, experimentation, and model training to deployment and monitoring

is designed to be traceable and

reproducible.

Figure 6:

MLOps Components Machine Learning Life Cycle

7.4 Evaluation Strategy.

Standard classification metrics, such as accuracy, precision, recall, F1-score, and Area under the Receiver Operating
Characteristic Curve (AUC), are used to evaluate the model's performance. Accuracy is a measure of the proportion
of flagged claims that are truly fraudulent, reflecting the ability to detect all fraudulent cases. There is a tradeoff
between false positives and false negatives, so balancing these metrics achieves this (

24

). They also monitor

operational metrics, including fraud detection rates and processing latencies. The critical issue is that the detection
rate indicates the percentage of fraudulent claims that are correctly identified and prevented, thereby minimizing
financial losses. Processing latency refers to the time elapsed from claim submission to fraud scoring; challenges in
this Area impact both customer experience and workflow efficiency.

7.5 Visualization and Reporting Tools

Interactive dashboards are built-in using tools such as Apache Superset and Power BI to support informed decision-
making. These visualizations display real-time fraud trends, the performance of the mother model, and the
distribution of alerts. Fraud heat maps and temporal activity charts provide investigators with insights into hotspots
and unusual behavior. Data insights are summarized in regular reports, enabling management to make informed,
data-driven policy adjustments and allocate resources effectively.

8. Experimental Results

In the evaluation phase, the machine learning-augmented ETL pipeline was assessed for model performance, error


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

424

types, and operational impacts. Results demonstrate the value of incorporating intelligent fraud detection into
claims processing workflows.

8.1. Comparing Model Performance

The performance of the selected machine learning models

Random Forest, XGBoost, and Multilayer Perceptron

(MLP)

is compared (

32

). The feature sets with and without temporal behavior features, along with both

correlation-based feature selection (CFS) and principal component analysis (PCA) for dimensionality reduction,
were tested over each model. Results indicate that temporal models outperform traditional models (without
temporal features) with significant improvement in recall and F1 scores. Capturing time-based patterns in claims
(such as staging and repeating fraud) is very important, and this improvement reflects that.

XGBoost achieved the highest overall accuracy and balanced precision-recall metrics across the models, partly due
to its superior performance on imbalanced classes and its ability to capture complex relationships. Although the
majority of features are not predictive of fraud, Random Forest proved to be competitive (although with slightly
lower precision) and also exhibited relatively high interpretability. In contrast, an MLP required more training time
but was capable of detecting subtle fraud patterns that a Random Forest could not identify. CFS-based
dimensionality reduction produced more interpretable and slightly better detection rates of feature subsets than
PCA, which, although effective in capturing variance, sometimes resulted in outputs that combined features in a
less symptomatically meaningful manner for fraud identification.

As shown in the Figure below, the comparison illustrates the measurable benefits of incorporating temporal
features, particularly for complex classifiers like XGBoost and MLP.

Figure 7:

Comparing Model Performance


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

425

8.2 Error analysis.

False positives and negatives were analyzed in depth, revealing insights into how the models functioned (

10

). Often,

the false positives were due to unusual but legitimate customer behaviors (e.g., large claims after rare events).
However, reducing false positives is key to avoiding unnecessary investigations and ensuring it doesn't erode the
customer's trust. The number of false negatives or undetected fraud cases was primarily associated with
sophisticated schemes that allowed them to mimic legitimate claim patterns closely. To overcome these challenges,
it is recommended that additional external data sources be incorporated and more behavioral modelling be
employed.

8.3 Visualization of Results.

Receiver Operating Characteristic (ROC) curves and confusion matrices from the visualization tools were used to
assess the discrimination ability and classification errors of the models. The ROC curves for models with temporal
features had steeper curves, which indicates better sensitivity. Temporal clusters of suspect activities in a fraud
heat map guided the focus of fraud analysts' investigations in high-risk periods and segments. Seasonal and event-
driven types of fraud incidence were illustrated through time series plots.

8.4 Operational Impact Assessment.

The ML-augmented pipeline was successfully integrated into the claim, yielding and demonstrating measurable
operational improvements. This resulted in detection latency being drastically lowered to the extent that near real-
time flagging and (in some cases) intervention were possible. The fraud detection rate increased, resulting in
reduced payouts on fraudulent claims and potential cost savings. Deployment tests using cloud-based infrastructure
validated the scalability of the pipeline by operating on large claim volumes without degradation in scoring speed
or accuracy.

As illustrated in table below, incorporating temporal features significantly improves the performance of all
evaluated machine learning models, particularly in terms of recall and F1 score.

Table 5:

Performance Comparison of ML Models with and without Temporal Features

Model

Feature Set

Accuracy Precision Recall

F1
Score

Notes

Random Forest

With Temporal +
CFS

91%

0.88

0.89 0.885

High interpretability, competitive
performance

Random Forest

Without Temporal
+ PCA

85%

0.81

0.78 0.795 Misses sequential fraud patterns

XGBoost

With Temporal +
CFS

94%

0.91

0.92

0.915

Best performance overall; excels at
imbalanced data

XGBoost

Without Temporal

88%

0.85

0.84 0.845 Effective but less interpretable


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

426

Model

Feature Set

Accuracy Precision Recall

F1
Score

Notes

+ PCA

Multilayer
Perceptron

With Temporal +
CFS

92%

0.87

0.90 0.885

Learns subtle patterns, but longer
training time

Multilayer
Perceptron

Without Temporal
+ PCA

86%

0.82

0.80 0.81

Underperforms without temporal
context

9. Deployment and Scalability Considerations

The implementation of the fraud ETL Machine Learning augmented pipeline requires a suitable deployment
strategy, as well as the ability to scale up within the insurance IT infrastructure. This section covers the infrastructure
stack, CI/CD workflows, real-time deployment approach, and monitoring mechanisms.

As illustrated in Figure below Deployment and Scalability Considerations, the system must be capable of integrating
seamlessly with existing insurance IT ecosystems while supporting high-volume, real-time data processing demands

Figure 8:

Deployment and Scalability Considerations

9.1. Infrastructure Stack

Cloud-native technologies are recommended for handling large volumes of data and the complex processing
requirements of insurance claims, leveraging the data processing capabilities of Apache Spark, including parallel
transformation and feature engineering at scale. By containerizing with Docker, consistent deployment
environments are made possible, dependencies are isolated, and upgrades are simplified (Watada et al., 2019).


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

427

Elastic computing and storage resources are made available by cloud platforms, such as Amazon Web Services
(AWS) or Google Cloud Platform (GCP), which can be dynamically scaled according to workload demands. Container
orchestration is supported by managed services such as AWS Elastic Kubernetes Service (EKS) or Google Kubernetes
Engine (GKE), which maintain high availability and fault tolerance.

9.2 CI/CD Workflows for ETL and ML Updates

Continuous Integration and Continuous Delivery pipelines automate testing, packaging, and deployment of both
ETL workflows and machine learning models. CI/CD tools supporting code repositories (e.g., Jenkins, GitLab CI)
enable frequent code updates and reduce the time required to implement improvements or bug fixes in production.
Data processing scripts, machine learning models, and configuration files in a system are subject to version control,
which ensures their traceability and rollback capability. Data Quality, Model Performance, and integration points
are validated by automated tests before deployment, reducing Operational Risks (

40

).

9.3 Real-time deployment with Message Queues

Event-driven architecture for real-time fraud detection is made possible with message queue systems, such as
Apache Kafka or Amazon Kinesis, which enable writing code to listen for events and take actions as needed. Events
generated by the submission of the claims through online portals or call centers are immediately streamed to the
ETL pipeline. Then, iterative feature computation and ML scoring are triggered on these events in near real-time,
enabling claims that should raise suspicion to be flagged immediately (

17

). The decoupled nature of message

queues makes the system more resilient, allowing for buffered bursts of claim submissions and asynchronous
processing.

9.4 Monitoring and Drift Detection

Maintaining accurate models and ensuring high operational efficiency post-deployment is critical. Monitoring
dashboards track metrics such as scoring latency, fraud detection rates, and false favorable ratios continuously.
Statistical tests are used to detect model drift, which occurs when fraud patterns or claim behaviors change, by
comparing them with historical prediction distributions (

7

). Whenever the performance degrades above predefined

thresholds, automated alerts prompt retraining or recalibration. It enables detailed logging of processing steps and
decision-making, which constitutes audit trails for regulatory compliance in the insurance industry.

10. Limitations

Several vital limitations exist for machine learning-augmented ETL pipelines in insurance fraud detection
implementations. The problems associated with these limitations reduce the system's performance and its level of
applicability and adoption in real-world insurance operations. Some of the critical constraints are discussed below
under the respective subsections.

As summarized in the Table below, machine learning

augmented ETL pipelines for fraud detection face several key

limitations. These include the scarcity of labeled data due to underreporting, challenges in model interpretability,
difficulty in generalizing across domains, and integration issues with legacy systems.


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

428

Table 6: Summary of Key Limitations in ML-Augmented ETL for Fraud Detection

Limitation

Description

Labeled Data Scarcity

Fraud cases are underreported or misclassified, affecting model training

Model Interpretability

Complex models lack transparency for audit/regulatory purposes

Cross-Domain Generalization Models trained for auto insurance may not work well for health or property

Legacy System Integration

ETL incompatibility and workflow disruption challenges

10.1 Data quality and labelling challenges.

One fundamental limitation is that the labelled data needed for training a fraud detection model is limited and of
poor quality. Identifying fraudulent claims in insurance is inherently complex, and underreporting or mislabeling of
such claims may consequently arise. On some suspected fraud claims, they are unable to confirm, while on others,
they risk wrongly flagging genuine claims. As a result, datasets are imbalanced

there are many more legitimate

claims than equivalent confirmed fraud cases. Class imbalance in this manner creates challenges for training the
machine learning model, as the algorithms may learn to be biased towards the majority class and, consequently,
fail to identify rare fraudulent events accurately. Furthermore, learning can be steered in the wrong direction by
noisy or erroneous labels, resulting in poor model generalization on future data (

11

). Data collection is subsequently

complicated by privacy regulations and data-sharing restrictions between and within insurance companies. Without
enough high-quality data, model effectiveness and robustness are not achievable.

10.2 Trust and Model Interpretability

Several state-of-the-art machine learning models, including gradient boosting machines and deep neural networks,
achieve high accuracy but lack transparency. Such "black-box" models generate little explanation of their decision-
making processes, such that claims adjusters, fraud investigators, and regulators cannot easily explain why a claim
was flagged as suspicious. Claim approvals or denials in the insurance sector have essential financial and
reputational bearing. Therefore, many regulatory frameworks require audit trails and interpretable reasoning for
automated decisions. Interpretability can be completely absent, which can leave users mistrusting the software,
thwarting adoption by society as a whole and making compliance with industry standards difficult. Predictive
accuracy and model transparency remain challenging to balance (

20

). Simple models, such as decision trees, are

more easily interpretable but may struggle to handle intricate fraud patterns. A crucial part of this was to integrate
explainable AI methods into the pipeline, which introduced complexity and computational overhead.

As shown in the Figure below, The Main Building Blocks of the Paper, one of the critical elements is balancing
predictive performance with interpretability


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

429

Figure 9: The main building blocks of the paper

10.3 Generalization to Products and Regions

Insurance fraud exhibits significant differences based on product type, customer demographics, and regional
regulations. For example, staged accidents are commonly associated with automobile insurance fraud, as well as
false billing or exaggerated treatments, which are also linked to health insurance fraud. Various countries have
different claim processes that attempt to follow their regulations, which significantly impact data patterns.
Differences in feature distributions, fraud schemes, and customer behaviour mean that data from one insurance
line or geographic area may not apply very well across datasets from another. This limits the scalability of a single
unified fraud detection engine. Models need to be retrained or adapted over time to maintain their performance,
which requires (often ongoing) access to data and relevant expertise. The continual need for customization
increases operational complexity and cost.

10.4 Legacy Systems Integration

Legacy IT systems have evolved over many years, and most insurance companies run complex legacy IT systems (

6

).

Unfortunately, integrating an advanced machine learning-augmented ETL pipeline with these existing systems can
be challenging due to differences in data formats, communication protocols, and system architectures. Modern ETL
workflows may require real-time data streaming, incremental processing, or containerized deployment methods,
which legacy systems may not be designed to support. If the data is not compatible, then there can be delays,
errors, or data loss in the execution pipeline. Additionally, insurance operations are characterized by established
workflows and business processes that involve numerous stakeholders. The changes tend to require extensive
testing, change management, and training, which slows down the deployment and reduces the initial effectiveness.

10.5 Latency and Operational Constraints

Fraud detection in real-time is a necessity, and therefore, it demands rigorous latency processing (

1

). Consequently,

it requires the ETL pipeline to ingest, transform, and score claims information promptly with minimal delay, allowing


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

430

for timely intervention. However, feature engineering, temporal modelling, and ML inference are complex, which
can add a considerable amount of computational overhead. The key issue is typically the balance between model
complexity and processing speed. Low latency may not unnecessarily overcomplicate modelling, but excessive
latency can compromise the value of early fraud detection, and overly simplistic models may miss more subtle
signals of fraud. To ensure consistency in pipeline performance, operational constraints, including fluctuating claim
submission volumes, network reliability, and system resource limitations, must be accounted for.

11. Future Work

Machine learning-augmented ETL pipelines for insurance fraud detection represent a significant step forward, and
there remains considerable promise in exploring other promising directions. While the experiments indicate an
uncertain future, future research along this path should improve model transparency, privacy preservation,
automation, data diversity and operational integration. They will help build more robust, adaptive, and trusted
fraud detection systems.

As shown in Figure below, How Machine Learning Helps to Detect Web Transaction Fraud, intelligent systems can
proactively flag anomalous behavior by analyzing real-time patterns and contextual data.

Figure 10:

How Machine Learning helps to detect web transaction fraud

11.1 Increasing Model Transparency with Explainable AI

Adopting complex machine learning models is one of the foremost challenges, and a key concern is their
interpretability. The following steps involve incorporating explainable AI (XAI) techniques to demystify model
predictions and provide clear, easy-to-understand explanations for humans. Questions about identifying the
influence of features on individual predictions can be answered with methods like SHAP (Shapley Additive
explanations) and LIME (Local Interpretable Model-agnostic Explanations), which provide such frameworks.
Embedding XAI into the ETL pipeline would automatically enhance the level of confidence that claims adjusters and
fraud analysts have, as well as provide support for regulatory compliance by delivering transparent decision-making


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

431

trials. The computational efficiency of these techniques needs to be further optimized for real-time fraud scoring,
and explanations must be tailored to user roles, ranging from technical experts to business stakeholders (

34

).

11.2 Federated Learning for Privacy-Preserving Collaboration

Insurers are limited in sharing sensitive customer data due to data privacy and regulatory constraints, thereby
prohibiting the development of holistic fraud detection models across the industry. In this case, federated learning
emerges as a promising solution to address this issue by enabling collaborative model training across decentralized
datasets without requiring the transfer of raw data. Future implementations can utilize federated learning to
aggregate knowledge from multiple insurers, thereby capturing more widespread patterns of fraud and enhancing
detection performance. Using this approach, data remains within each organization's secure domain, creating
privacy by design while adhering to relevant privacy regulations, including the GDPR and HIPAA. Considering the
research to be far from exhaustive, future studies should focus on developing robust federated learning protocols
that can address diverse data sources with varying data quality and potential adversarial attacks. Practical
frameworks for coordinating participating insurers, model aggregation, and reward mechanisms will also be
necessary (

2

).

11.3 Adaptive AutoML for Continuous Model Evolution

Fraud tactics are constantly evolving as fraudsters devise new methods to circumvent detection. Without timely
retraining and tuning, static models can easily become obsolete. The opportunity of Automated Machine Learning
(AutoML) systems in adaptive mode is to automate the process of selecting, tuning, and deploying machine learning
models on incoming data streams. Such frameworks should also be able to monitor different model performance
metrics, detect concept drift, and trigger retraining workflows without manual intervention, which will be a focus
of future work

(

25

). This would mean the fraud detection system won't become outdated over time, requiring less

operational burden from the data science team. In addition, research can explore integrating AutoML with domain
knowledge and business rules to leverage the full potential of AutoML while incorporating domain-specific expertise
and business rules for enhanced oversight in automated learning.

11.4 Exploiting multimodal data for expanding feature sets

Traditional fraud detection models generally work only on structured claim data. However, structured and semi-
structured data sources, such as claim documents, customer communications, social media activity, and sensor data
(from telematics devices), have potential signals that can enhance the accuracy of detection. Future research should
utilize natural language processing (NLP), computer vision, and sensor data analytics to develop methods for
effectively integrating and processing these disparate data types. To combine these heterogeneous inputs into
cohesive feature sets for machine learning, developing multimodal data fusion techniques will be a crucial aspect
of the project. Some challenges in this area include ensuring data quality, addressing noise and inconsistencies in
the data, and resolving privacy issues associated with non-traditional data sources.

11.5 Deeper Integration into Claims Workflow and Business Processes

In terms of accurate scoring, it is not the only determinant of effective fraud detection but also a timely and efficient
operational response (

15

). Further work should investigate how to integrate the results of fraud detection more

deeply into the claims management systems and business process workflows. Enabling this integration would
automatically prioritize suspicious claims for human review, send fraud investigation signals, or dynamically adjust


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

432

claim processing policies. Optimizing resource allocation and increasing customer service involves reducing delays
and erroneous claim denials by leveraging workflow automation and a business rules engine. Dashboards and alert
systems can also be developed to present actionable insights tailored to different roles, thereby improving user
engagement and decision-making.

12. CONCLUSION

Integrating machine learning into ETL pipelines for insurance claims processing represents a valuable advancement
in the ongoing fight against fraud. However, traditional methods have their limitations; they are based on static
rules and manual review processes, which are not sufficient to handle the ever-growing volume and complexity of
claims data. This work demonstrates that directly embedding machine learning models into ETL workflows yields a
powerful paradigm for enhancing fraud detection accuracy, latency, and operational efficiency. This work makes a
key contribution to the intelligent design of the ETL pipeline, with feature engineering being a significant part of the
journey. Temporal behavior modelling facilitates the detection of suspicious patterns in the time dimension, e.g.,
staged or repetitive fraudulent behavior, which is usually beyond the scope of traditional static analyses. Beyond
the inherent data, external data sources are also incorporated to enrich the features, making the system stronger
in identifying high-risk claims at an early stage of the processing pipeline. With these innovations, fraud detection
can be adopted for a more dynamic and adaptive framework vs. a rigid rule-based framework.

A comparative evaluation of different supervised machine learning models, including Random Forest, XGBoost, and

Multilayer Perceptron’s, highlights the importance of selecting the right model for the unique characteristics of

insurance data. Correlation-Based Feature Selection (CFS) and Principal Component Analysis (PCA) dimensionality
reduction techniques were effective in refining feature input to yield better models with reduced computational
overhead. This reveals that combining domain knowledge and machine learning knowledge is necessary to optimize
the detection system. Another practical strength of the proposed framework is that it is compatible with real-world
operational environments. The pipeline is cyclical and supports scalability, as well as continuous integration and
deployment (CI/CD) practices, utilizing modern orchestration tools such as Apache Airflow and n8n, alongside
cloud-native platforms and containerized deployments. As a result, fraud detection models can be updated
frequently to adapt to changing fraudulent behavior without discontinuing existing workflows. The use case also
supports real-time scoring capabilities, enabling insurers to respond effectively and quickly assess risk in real time.

While these advances were significant, challenges persist in pushing the technology to broader adoption. Fraud
data is typically scarce, imbalanced, and often inaccurately labeled, which limits the ability to train universally
effective machine learning models. Moreover, many machine learning algorithms can be computationally complex,
making them notoriously difficult to interpret. This is a problem in the insurance context, as a lack of interpretability
is a significant issue for regulatory compliance and user trust in operations. As such, an area for future work is the
integration of explainable AI techniques to provide transparency into the model's decisions. Further, the
generalizability of fraud detection models across different insurance products and geographic markets is
insufficient. Due to the wide variation in fraud tactics across product types, regional regulations, and other factors,
it requires localized adaptation or transfer learning strategies. The technical and organizational challenges of
integrating new machine learning components into existing systems and legacy systems are further addressed,
necessitating the careful application of established principles of change management and data harmonization.

Looking forward, the realm of federated learning offers insurers the opportunity to collaborate on training models
without explicitly sharing data, thereby broadening the scope and robustness of fraud detection systems. To achieve
additional adaptability, automated machine learning (AutoML) frameworks will be able to tune and retrain models


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

433

to match changing fraud trends continuously. Additionally, integrating unstructured data sources, such as claims
documents, social media, and sensor data, can facilitate the detection of more fraud signals through multimodal
analytics. The machine learning

augmented ETL pipeline described in this study is a scalable, adaptive, and

intelligent solution for modern insurance fraud detection. It bridged the data engineering with advanced analytics
to offer both improved detection accuracy and operational feasibility. This integrated approach enables insurers to
lower financial losses, increase customer trust, and fulfil governmental requirements. As fraud tactics become
increasingly sophisticated and complex, it will be critical to sustain innovation and the careful implementation of
emerging technologies to retain the effectiveness of fraud prevention.

REFERENCES

1.

Abakarim, Y., Lahby, M., & Attioui, A. (2018, October). An efficient real time model for credit card fraud
detection based on deep learning. In

Proceedings of the 12th international conference on intelligent

systems: theories and applications

(pp. 1-7).

https://dl.acm.org/doi/abs/10.1145/3289402.3289530

2.

Bello, H. O., Ige, A. B., & Ameyaw, M. N. (2024). Adaptive machine learning models: concepts for real-time
financial fraud prevention in dynamic environments.

World Journal of Advanced Engineering Technology

and Sciences

,

12

(02), 021-034.

https://doi.org/10.30574/wjaets.2024.12.2.0266

3.

Beteto, A., Melo, V., Lin, J., Alsultan, M., Dias, E. M., Korte, E., ... & Lambert, J. H. (2022). Anomaly and cyber
fraud detection in pipelines and supply chains for liquid fuels.

Environment Systems and Decisions

,

42

(2),

306-324.

https://link.springer.com/article/10.1007/s10669-022-09843-5

4.

Chavan, A. (2022). Importance of identifying and establishing context boundaries while migrating from
monolith to microservices. Journal of Engineering and Applied Sciences Technology, 4, E168.

http://doi.org/10.47363/JEAST/2022(4)E168

5.

Chavan, A. (2023). Managing scalability and cost in microservices architecture: Balancing infinite scalability
with financial constraints. Journal of Artificial Intelligence & Cloud Computing, 2, E264.

http://doi.org/10.47363/JAICC/2023(2)E264

6.

Crotty, J., & Horrocks, I. (2017). Managing legacy system costs: A case study of a meta-assessment model
to identify solutions in a large financial services company.

Applied computing and informatics

,

13

(2), 175-

183.

https://doi.org/10.1016/j.aci.2016.12.001

7.

Darville, J., Yavuz, A., Runsewe, T., & Celik, N. (2023). Effective sampling for drift mitigation in machine
learning using scenario selection: A microgrid case study.

Applied Energy

,

341

, 121048.

https://doi.org/10.1016/j.apenergy.2023.121048

8.

Dhanagari, M. R. (2024). MongoDB and data consistency: Bridging the gap between performance and
reliability.

Journal

of

Computer

Science

and

Technology

Studies,

6

(2),

183-198.

https://doi.org/10.32996/jcsts.2024.6.2.21

9.

Dhanagari, M. R. (2024). Scaling with MongoDB: Solutions for handling big data in real-time.

Journal of

Computer Science and Technology Studies, 6

(5), 246-264.

https://doi.org/10.32996/jcsts.2024.6.5.20


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

434

10.

Drakesmith, M., Caeyenberghs, K., Dutt, A., Lewis, G., David, A. S., & Jones, D. K. (2015). Overcoming the
effects of false positives and threshold bias in graph theoretical analyses of neuroimaging
data.

Neuroimage

,

118

, 313-333.

https://doi.org/10.1109/ACCESS.2019.2945930

11.

Elmes, A., Alemohammad, H., Avery, R., Caylor, K., Eastman, J. R., Fishgold, L., ... & Estes, L. (2020).
Accounting for training data error in machine learning applied to earth observations.

Remote

Sensing

,

12

(6), 1034.

https://doi.org/10.3390/rs12061034

12.

Fursov, I., Kovtun, E., Rivera-Castro, R., Zaytsev, A., Khasyanov, R., Spindler, M., & Burnaev, E. (2022).
Sequence

embeddings

help

detect

insurance

fraud.

IEEE

Access

,

10

,

32060-32074.

https://doi.org/10.1109/ACCESS.2022.3149480

13.

Goel, G., & Bhramhabhatt, R. (2024). Dual sourcing strategies.

International Journal of Science and Research

Archive

, 13(2), 2155.

https://doi.org/10.30574/ijsra.2024.13.2.2155

14.

Hardy, B., Mohoric, T., Exner, T., Dokler, J., Brajnik, M., Bachler, D., ... & Athar, A. (2024). Knowledge
infrastructure for integrated data management and analysis supporting new approach methods in
predictive

toxicology

and

risk

assessment.

Toxicology

in

Vitro

,

100

,

105903.

https://doi.org/10.1016/j.tiv.2024.105903

15.

Kalluri, K. (2022). Optimizing Financial Services Implementing Pega's Decisioning Capabilities for Fraud
Detection.

International Journal of Innovative Research in Engineering & Multidisciplinary Physical

Sciences

,

10

(1), 1-9.

16.

Karwa, K. (2024). The future of work for industrial and product designers: Preparing students for AI and
automation trends. Identifying the skills and knowledge that will be critical for future-proofing design
careers.

International Journal of Advanced Research in Engineering and Technology

,

15

(5).

https://iaeme.com/MasterAdmin/Journal_uploads/IJARET/VOLUME_15_ISSUE_5/IJARET_15_05_011.pdf

17.

Khurana, R. (2020). Fraud detection in ecommerce payment systems: The role of predictive ai in real-time
transaction security and risk management.

International Journal of Applied Machine Learning and

Computational Intelligence

,

10

(6), 1-32.

https://neuralslate.com/

18.

Konneru, N. M. K. (2021). Integrating security into CI/CD pipelines: A DevSecOps approach with SAST, DAST,
and SCA tools.

International Journal of Science and Research Archive

. Retrieved from

https://ijsra.net/content/role-notification-scheduling-improving-patient

19.

Kumar, A. (2019). The convergence of predictive analytics in driving business intelligence and enhancing
DevOps efficiency. International Journal of Computational Engineering and Management, 6(6), 118-142.
Retrieved from

https://ijcem.in/wp-content/uploads/THE-CONVERGENCE-OF-PREDICTIVE-ANALYTICS-IN-

DRIVING-BUSINESS-INTELLIGENCE-AND-ENHANCING-DEVOPS-EFFICIENCY.pdf

20.

Lepri, B., Oliver, N., Letouzé, E., Pentland, A., & Vinck, P. (2018). Fair, transparent, and accountable
algorithmic decision-making processes: The premise, the proposed solutions, and the open
challenges.

Philosophy & Technology

,

31

(4), 611-627.

https://link.springer.com/article/10.1007/S13347-

017-0279-X


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

435

21.

Machireddy, J. R. (2024). Integrating Machine Learning-Driven RPA with Cloud-Based Data Warehousing for
Real-Time Analytics and Business Intelligence.

Hong Kong Journal of AI and Medicine

,

4

(1), 98-121.

https://hongkongscipub.com/

22.

Misiura, A. (2015).

Enterprise risk management in the airline industry-risk management structures and

practices

(Doctoral dissertation, Brunel University London).

http://bura.brunel.ac.uk/handle/2438/11087

23.

Mittal, S., & Tyagi, S. (2019, January). Performance evaluation of machine learning algorithms for credit
card fraud detection. In

2019 9th International Conference on Cloud Computing, Data Science & Engineering

(Confluence)

(pp. 320-324). IEEE.

https://doi.org/10.1109/CONFLUENCE.2019.8776925

24.

Mori, T., & Uchihira, N. (2019). Balancing the trade-off between accuracy and interpretability in software
defect

prediction.

Empirical

Software

Engineering

,

24

,

779-825.

https://link.springer.com/article/10.1007/s10664-018-9638-1

25.

Nelson, J., & Temple, S. (2020, April).

MLOps Framework for Continuous Integration and Deployment

.

26.

Njoku, D. O., Iwuchukwu, V. C., Jibiri, J. E., Ikwuazom, C. T., Ofoegbu, C. I., & Nwokoma, F. O. (2024). Machine
learning approach for fraud detection system in financial institution: A web base application.

Machine

Learning

,

20

(4), 01-12.

27.

Nyati, S. (2018). Transforming telematics in fleet management: Innovations in asset tracking, efficiency, and
communication. International Journal of Science and Research (IJSR), 7(10), 1804-1810. Retrieved from

https://www.ijsr.net/getabstract.php?paperid=SR24203184230

28.

Olayinka, O. H. (2021). Big data integration and real-time analytics for enhancing operational efficiency and
market responsiveness.

Int J Sci Res Arch

,

4

(1), 280-96.

https://doi.org/10.30574/ijsra.2021.4.1.0179

29.

Pillai, V. (2022).

Anomaly Detection for Innovators: Transforming Data into Breakthroughs

. Libertatem

Media Private Limited.

30.

Raju, R. K. (2017). Dynamic memory inference network for natural language inference. International Journal
of Science and Research (IJSR), 6(2).

https://www.ijsr.net/archive/v6i2/SR24926091431.pdf

31.

Reddy, G. T., Reddy, M. P. K., Lakshmanna, K., Kaluri, R., Rajput, D. S., Srivastava, G., & Baker, T. (2020).
Analysis of dimensionality reduction techniques on big data.

Ieee Access

,

8

, 54776-54788.

https://doi.org/10.1109/ACCESS.2020.2980942

32.

Sahin, E. K. (2020). Assessing the predictive capability of ensemble tree methods for landslide susceptibility
mapping using XGBoost, gradient boosting machine, and random forest.

SN Applied Sciences

,

2

(7), 1308.

https://link.springer.com/article/10.1007/s42452-020-3060-1

33.

Sardana, J. (2022). Scalable systems for healthcare communication: A design perspective.

International

Journal of Science and Research Archive

.

https://doi.org/10.30574/ijsra.2022.7.2.0253

34.

Sarma, W., Nagavalli, S. P., & Sresth, V. (2020). Leveraging AI-Driven Algorithms to Address Real-World
Challenges in E-Commerce: Enhancing User Experience, Fraud Detection, and Operational


background image

AMERICAN ACADEMIC PUBLISHER

https://www.academicpublishers.org/journals/index.php/ijdsml

436

Efficiency.

INTERNATIONAL JOURNAL OF RESEARCH AND ANALYTICAL REVIEWS

,

7

, 2348-1269.

http://www.ijrar.org/

35.

Sartzetaki, M., Karagkouni, A., & Dimitriou, D. (2023). A conceptual framework for developing intelligent
services (a platform) for transport enterprises: The designation of key drivers for action.

Electronics

,

12

(22),

4690.

https://doi.org/10.3390/electronics12224690

36.

Singh, V. (2022). Visual question answering using transformer architectures: Applying transformer models
to improve performance in VQA tasks. Journal of Artificial Intelligence and Cognitive Computing, 1(E228).

https://doi.org/10.47363/JAICC/2022(1)E228

37.

Singh, V. (2023). Enhancing object detection with self-supervised learning: Improving object detection
algorithms using unlabeled data through self-supervised techniques. International Journal of Advanced
Engineering and Technology.

https://romanpub.com/resources/Vol%205%20%2C%20No%201%20-

%2023.pdf

38.

Sukhadiya, J., Pandya, H., & Singh, V. (2018). Comparison of Image Captioning Methods.

INTERNATIONAL

JOURNAL

OF

ENGINEERING

DEVELOPMENT

AND

RESEARCH

,

6

(4),

43-48.

https://rjwave.org/ijedr/papers/IJEDR1804011.pdf

39.

Van Rijn, J. N., & Hutter, F. (2018, July). Hyperparameter importance across datasets. In

Proceedings of the

24th ACM SIGKDD international conference on knowledge discovery & data mining

(pp. 2367-2376).

https://dl.acm.org/doi/abs/10.1145/3219819.3220058

40.

Yarram, S., & Bittla, S. R. (2023). Predictive Test Automation: Shaping the Future of Quality Engineering in
Enterprise Platforms.

Available at SSRN 5132329

.

https://ssrn.com/abstract=5132329

References

Abakarim, Y., Lahby, M., & Attioui, A. (2018, October). An efficient real time model for credit card fraud detection based on deep learning. In Proceedings of the 12th international conference on intelligent systems: theories and applications (pp. 1-7). https://dl.acm.org/doi/abs/10.1145/3289402.3289530

Bello, H. O., Ige, A. B., & Ameyaw, M. N. (2024). Adaptive machine learning models: concepts for real-time financial fraud prevention in dynamic environments. World Journal of Advanced Engineering Technology and Sciences, 12(02), 021-034. https://doi.org/10.30574/wjaets.2024.12.2.0266

Beteto, A., Melo, V., Lin, J., Alsultan, M., Dias, E. M., Korte, E., ... & Lambert, J. H. (2022). Anomaly and cyber fraud detection in pipelines and supply chains for liquid fuels. Environment Systems and Decisions, 42(2), 306-324. https://link.springer.com/article/10.1007/s10669-022-09843-5

Chavan, A. (2022). Importance of identifying and establishing context boundaries while migrating from monolith to microservices. Journal of Engineering and Applied Sciences Technology, 4, E168. http://doi.org/10.47363/JEAST/2022(4)E168

Chavan, A. (2023). Managing scalability and cost in microservices architecture: Balancing infinite scalability with financial constraints. Journal of Artificial Intelligence & Cloud Computing, 2, E264. http://doi.org/10.47363/JAICC/2023(2)E264

Crotty, J., & Horrocks, I. (2017). Managing legacy system costs: A case study of a meta-assessment model to identify solutions in a large financial services company. Applied computing and informatics, 13(2), 175-183. https://doi.org/10.1016/j.aci.2016.12.001

Darville, J., Yavuz, A., Runsewe, T., & Celik, N. (2023). Effective sampling for drift mitigation in machine learning using scenario selection: A microgrid case study. Applied Energy, 341, 121048. https://doi.org/10.1016/j.apenergy.2023.121048

Dhanagari, M. R. (2024). MongoDB and data consistency: Bridging the gap between performance and reliability. Journal of Computer Science and Technology Studies, 6(2), 183-198. https://doi.org/10.32996/jcsts.2024.6.2.21

Dhanagari, M. R. (2024). Scaling with MongoDB: Solutions for handling big data in real-time. Journal of Computer Science and Technology Studies, 6(5), 246-264. https://doi.org/10.32996/jcsts.2024.6.5.20

Drakesmith, M., Caeyenberghs, K., Dutt, A., Lewis, G., David, A. S., & Jones, D. K. (2015). Overcoming the effects of false positives and threshold bias in graph theoretical analyses of neuroimaging data. Neuroimage, 118, 313-333. https://doi.org/10.1109/ACCESS.2019.2945930

Elmes, A., Alemohammad, H., Avery, R., Caylor, K., Eastman, J. R., Fishgold, L., ... & Estes, L. (2020). Accounting for training data error in machine learning applied to earth observations. Remote Sensing, 12(6), 1034. https://doi.org/10.3390/rs12061034

Fursov, I., Kovtun, E., Rivera-Castro, R., Zaytsev, A., Khasyanov, R., Spindler, M., & Burnaev, E. (2022). Sequence embeddings help detect insurance fraud. IEEE Access, 10, 32060-32074. https://doi.org/10.1109/ACCESS.2022.3149480

Goel, G., & Bhramhabhatt, R. (2024). Dual sourcing strategies. International Journal of Science and Research Archive, 13(2), 2155. https://doi.org/10.30574/ijsra.2024.13.2.2155

Hardy, B., Mohoric, T., Exner, T., Dokler, J., Brajnik, M., Bachler, D., ... & Athar, A. (2024). Knowledge infrastructure for integrated data management and analysis supporting new approach methods in predictive toxicology and risk assessment. Toxicology in Vitro, 100, 105903. https://doi.org/10.1016/j.tiv.2024.105903

Kalluri, K. (2022). Optimizing Financial Services Implementing Pega's Decisioning Capabilities for Fraud Detection. International Journal of Innovative Research in Engineering & Multidisciplinary Physical Sciences, 10(1), 1-9.

Karwa, K. (2024). The future of work for industrial and product designers: Preparing students for AI and automation trends. Identifying the skills and knowledge that will be critical for future-proofing design careers. International Journal of Advanced Research in Engineering and Technology, 15(5). https://iaeme.com/MasterAdmin/Journal_uploads/IJARET/VOLUME_15_ISSUE_5/IJARET_15_05_011.pdf

Khurana, R. (2020). Fraud detection in ecommerce payment systems: The role of predictive ai in real-time transaction security and risk management. International Journal of Applied Machine Learning and Computational Intelligence, 10(6), 1-32. https://neuralslate.com/

Konneru, N. M. K. (2021). Integrating security into CI/CD pipelines: A DevSecOps approach with SAST, DAST, and SCA tools. International Journal of Science and Research Archive. Retrieved from https://ijsra.net/content/role-notification-scheduling-improving-patient

Kumar, A. (2019). The convergence of predictive analytics in driving business intelligence and enhancing DevOps efficiency. International Journal of Computational Engineering and Management, 6(6), 118-142. Retrieved from https://ijcem.in/wp-content/uploads/THE-CONVERGENCE-OF-PREDICTIVE-ANALYTICS-IN-DRIVING-BUSINESS-INTELLIGENCE-AND-ENHANCING-DEVOPS-EFFICIENCY.pdf

Lepri, B., Oliver, N., Letouzé, E., Pentland, A., & Vinck, P. (2018). Fair, transparent, and accountable algorithmic decision-making processes: The premise, the proposed solutions, and the open challenges. Philosophy & Technology, 31(4), 611-627. https://link.springer.com/article/10.1007/S13347-017-0279-X

Machireddy, J. R. (2024). Integrating Machine Learning-Driven RPA with Cloud-Based Data Warehousing for Real-Time Analytics and Business Intelligence. Hong Kong Journal of AI and Medicine, 4(1), 98-121. https://hongkongscipub.com/

Misiura, A. (2015). Enterprise risk management in the airline industry-risk management structures and practices (Doctoral dissertation, Brunel University London). http://bura.brunel.ac.uk/handle/2438/11087

Mittal, S., & Tyagi, S. (2019, January). Performance evaluation of machine learning algorithms for credit card fraud detection. In 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence) (pp. 320-324). IEEE. https://doi.org/10.1109/CONFLUENCE.2019.8776925

Mori, T., & Uchihira, N. (2019). Balancing the trade-off between accuracy and interpretability in software defect prediction. Empirical Software Engineering, 24, 779-825. https://link.springer.com/article/10.1007/s10664-018-9638-1

Nelson, J., & Temple, S. (2020, April). MLOps Framework for Continuous Integration and Deployment.

Njoku, D. O., Iwuchukwu, V. C., Jibiri, J. E., Ikwuazom, C. T., Ofoegbu, C. I., & Nwokoma, F. O. (2024). Machine learning approach for fraud detection system in financial institution: A web base application. Machine Learning, 20(4), 01-12.

Nyati, S. (2018). Transforming telematics in fleet management: Innovations in asset tracking, efficiency, and communication. International Journal of Science and Research (IJSR), 7(10), 1804-1810. Retrieved from https://www.ijsr.net/getabstract.php?paperid=SR24203184230

Olayinka, O. H. (2021). Big data integration and real-time analytics for enhancing operational efficiency and market responsiveness. Int J Sci Res Arch, 4(1), 280-96. https://doi.org/10.30574/ijsra.2021.4.1.0179

Pillai, V. (2022). Anomaly Detection for Innovators: Transforming Data into Breakthroughs. Libertatem Media Private Limited.

Raju, R. K. (2017). Dynamic memory inference network for natural language inference. International Journal of Science and Research (IJSR), 6(2). https://www.ijsr.net/archive/v6i2/SR24926091431.pdf

Reddy, G. T., Reddy, M. P. K., Lakshmanna, K., Kaluri, R., Rajput, D. S., Srivastava, G., & Baker, T. (2020). Analysis of dimensionality reduction techniques on big data. Ieee Access, 8, 54776-54788. https://doi.org/10.1109/ACCESS.2020.2980942

Sahin, E. K. (2020). Assessing the predictive capability of ensemble tree methods for landslide susceptibility mapping using XGBoost, gradient boosting machine, and random forest. SN Applied Sciences, 2(7), 1308. https://link.springer.com/article/10.1007/s42452-020-3060-1

Sardana, J. (2022). Scalable systems for healthcare communication: A design perspective. International Journal of Science and Research Archive. https://doi.org/10.30574/ijsra.2022.7.2.0253

Sarma, W., Nagavalli, S. P., & Sresth, V. (2020). Leveraging AI-Driven Algorithms to Address Real-World Challenges in E-Commerce: Enhancing User Experience, Fraud Detection, and Operational Efficiency. INTERNATIONAL JOURNAL OF RESEARCH AND ANALYTICAL REVIEWS, 7, 2348-1269. http://www.ijrar.org/

Sartzetaki, M., Karagkouni, A., & Dimitriou, D. (2023). A conceptual framework for developing intelligent services (a platform) for transport enterprises: The designation of key drivers for action. Electronics, 12(22), 4690. https://doi.org/10.3390/electronics12224690

Singh, V. (2022). Visual question answering using transformer architectures: Applying transformer models to improve performance in VQA tasks. Journal of Artificial Intelligence and Cognitive Computing, 1(E228). https://doi.org/10.47363/JAICC/2022(1)E228

Singh, V. (2023). Enhancing object detection with self-supervised learning: Improving object detection algorithms using unlabeled data through self-supervised techniques. International Journal of Advanced Engineering and Technology. https://romanpub.com/resources/Vol%205%20%2C%20No%201%20-%2023.pdf

Sukhadiya, J., Pandya, H., & Singh, V. (2018). Comparison of Image Captioning Methods. INTERNATIONAL JOURNAL OF ENGINEERING DEVELOPMENT AND RESEARCH, 6(4), 43-48. https://rjwave.org/ijedr/papers/IJEDR1804011.pdf

Van Rijn, J. N., & Hutter, F. (2018, July). Hyperparameter importance across datasets. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 2367-2376). https://dl.acm.org/doi/abs/10.1145/3219819.3220058

Yarram, S., & Bittla, S. R. (2023). Predictive Test Automation: Shaping the Future of Quality Engineering in Enterprise Platforms. Available at SSRN 5132329. https://ssrn.com/abstract=5132329