Authors

  • Soli Maya Yacobovitch
    Project analyst USA, Florida

DOI:

https://doi.org/10.37547/tajet/Volume06Issue10-08

Keywords:

machine learning big data sorting algorithms optimization

Abstract

The use of machine learning methods to optimize big data sorting algorithms has become an urgent research topic due to the growing volume of information and the requirements for their rapid processing. Machine learning provides opportunities to automate and improve traditional sorting methods, allowing you to reduce the cost of computing resources and time. This is achieved by analyzing the characteristics of the data and preprocessing them using classification and regression. The main advantages of using machine learning in sorting big data include improving the accuracy and adaptability of algorithms to different types of data, which is especially important for areas with large amounts of information, such as finance, medicine and logistics. Progressive machine learning algorithms such as supporting vectors, decision trees, and gradient boosting demonstrate high efficiency and potential for further development and integration.


background image

THE USA JOURNALS

THE AMERICAN JOURNAL OF ENGINEERING AND TECHNOLOGY (ISSN

2689-0984)

VOLUME 06 ISSUE10

67

https://www.theamericanjournals.com/index.php/tajet

PUBLISHED DATE: - 12-10-2024

DOI: -

https://doi.org/10.37547/tajet/Volume06Issue10-08

PAGE NO.: - 67-74

APPLICATION OF MACHINE LEARNING
METHODS TO ENHANCE THE PERFORMANCE
OF BIG DATA SORTING ALGORITHMS


Soli Maya Yacobovitch

Project analyst, USA, Florida

INTRODUCTION

With the growing volumes of data generated by
various sources such as social media, the Internet
of Things (IoT), and business applications,
processing and sorting large datasets has become
one of the key tasks of modern information
systems. Sorting algorithms play a crucial role in
structuring data for subsequent analysis; however,
as the volume of information increases, traditional
processing methods begin to face limitations in
terms of speed and performance. These challenges
drive the search for new approaches capable of
improving the efficiency of big data processing.

One promising direction for addressing this issue
is the use of machine learning methods. Machine
learning can optimize sorting algorithms by
analyzing and pre-classifying data, which

significantly reduces processing time and
decreases the load on computational resources. In
this context, the relevance of the topic lies in the
need to develop new solutions for handling large
and dynamic datasets, which are becoming an
integral part of business processes in industries
such as finance, healthcare, and e-commerce.

The aim of this work is to explore machine learning
methods applicable to improving the performance
of big data sorting algorithms, as well as to identify
their advantages and potential limitations in the
context of processing large volumes of
information.

1. Traditional Sorting Algorithms and Their
Limitations

Big data represents large, dynamic, and diverse

RESEARCH ARTICLE

Open Access

Abstract


background image

THE USA JOURNALS

THE AMERICAN JOURNAL OF ENGINEERING AND TECHNOLOGY (ISSN

2689-0984)

VOLUME 06 ISSUE10

68

https://www.theamericanjournals.com/index.php/tajet

sets of information, also characterized by the
complexity of ensuring their accuracy and high
value. Due to their vast size, such data cannot be
processed using traditional database management
methods. In this context, speed implies the rapid
creation and processing of data. Diversity refers to
various data types, including both structured and
unstructured forms. Accuracy indicates the
challenges of ensuring data reliability, while value
emphasizes the hidden advantages for businesses.

Optimizing the handling of big data goes beyond a
technological task and becomes a key success
factor for organizations. Companies that
successfully apply data analytics can identify new
opportunities, improve productivity, and make
more informed decisions, providing them with a
significant

market

advantage.

Conversely,

shortcomings in big data management lead to
increased costs and missed opportunities.
Therefore, developing effective approaches to data
processing

is

essential

for

maintaining

competitiveness [1].

The foundation for successful big data
management lies in the proper use of data
structures and algorithms that optimize the
storage and analysis of information. Data
structures ensure efficient data organization, while
algorithms determine how information will be
processed. In the context of big data, where real-
time processing is required, choosing optimal
solutions for data handling becomes especially
critical. This review presents key data structures
and algorithms suitable for big data, as well as their
contribution to enhancing the efficiency of
working with large information sets.

One of the main challenges is the scale of the data.
In today's world, enormous amounts of
information are generated, ranging from user
activity on social networks to data from IoT
sensors. Traditional data processing systems
cannot handle such volumes, which creates storage

issues, increases processing time, and complicates
scalability [2].

The first key element in building such a network is
architecture based on principles of fault tolerance
and redundancy. This involves using multiple data
centers and data transmission routes to minimize
downtime and ensure service availability even in
case of failures at one of the nodes. Methods such
as automatic switching to backup nodes and
dynamic resource redistribution are applied to
achieve this.

The next aspect is the integration of technologies
to ensure data security at all levels. A multi-cloud
environment is exposed to a greater number of
potential threats, making it important to
implement

multi-layered

encryption,

authentication, and authorization systems. The use
of hybrid solutions, such as cloud firewalls and
intrusion detection systems, allows for effective
protection against cyberattacks.

Real-time network monitoring also plays a crucial
role in managing networks within a multi-cloud
infrastructure. The use of tools for traffic analysis
and rapid anomaly detection enables timely
responses to emerging issues and helps prevent
incidents. Intelligent monitoring systems with
artificial intelligence elements can automatically
adjust network parameters, enhancing resilience
and optimizing resource use.

Thus, building a reliable and resilient network in a
multi-cloud

environment

requires

a

comprehensive approach that includes ensuring
fault tolerance, strengthening security measures,
and applying modern monitoring technologies.
This allows not only for maintaining a high level of
data availability and protection but also for
adapting the network to dynamically changing
operating conditions [3].

2. Machine Learning Methods for Sorting
Optimization


background image

THE USA JOURNALS

THE AMERICAN JOURNAL OF ENGINEERING AND TECHNOLOGY (ISSN

2689-0984)

VOLUME 06 ISSUE10

69

https://www.theamericanjournals.com/index.php/tajet

Machine learning is a field focused on the
development and optimization of algorithms
designed to predict future or unknown data.
Combined with cloud computing resources, it
enables the efficient processing and integration of
large volumes of data, regardless of their sources.

Machine learning algorithms can be utilized at
every stage of working with big data, from
segmentation and analysis to modeling. After these
stages, it becomes possible to gain a holistic view
of the data, identify patterns and insights, which
are then packaged into understandable and useful
formats. The interaction between machine
learning and big data represents a continuous
process: the created algorithms adapt and improve
as new data is received.

Within the context of big data, machine learning is
used to handle the growing volume of information.
Its algorithms analyze incoming data, identify
patterns, and transform them into actionable
insights that can be applied to automate business
processes and improve decision-making.

Examples of machine learning applications in big
data are diverse. One example is marketing
automation, where algorithms allow for
personalized customer interactions, eliminating
pain points, and creating seamless communication
across various channels, including messaging and
branding. Consumer behavior analysis becomes
more

accurate

through

supervised

and

unsupervised algorithms, enabling a better
understanding of the target audience. This, in turn,
contributes

to

more

precise

customer

segmentation and the formation of effective
marketing campaigns.

In the media and entertainment industry, machine
learning is used to analyze audience preferences
and provide content that matches their interests.
Sentiment analysis, one of the machine learning
technologies, helps predict users' reactions to new
products or features, allowing companies to adjust

their strategies even during the development
phase.

Recommendation systems are another area where
machine learning plays a key role. These systems
analyze user behavior and, based on the data,
suggest products or services that may be of
interest. For example, platforms like Netflix
actively use machine learning to personalize
content recommendations.

In risk management, machine learning helps
predict and mitigate risks in financial operations,
such as through automating credit scoring
processes

or

digitizing

credit

decisions.

Regression, decision trees, and neural networks
are among the most effective methods used in this
field.

Machine learning is also actively applied in areas
such as healthcare and pharmaceuticals. It can
analyze medical data, improve diagnostics, and
develop new treatment methods. In particular,
machine learning helps detect diseases at early
stages by analyzing medical reports and patient
histories.

Predictive analytics, based on machine learning
algorithms, allows businesses to forecast future
trends and prepare for them. For example, in the
automotive industry, such analytics help track
vehicle malfunctions and inform manufacturers of
potential defects, contributing to increased
product reliability [4].

Next, a more detailed examination of the Support
Vector Machine (SVM) method is provided. SVM is
an important machine learning approach aimed at
solving classification and regression tasks. This
algorithm is trained on labeled data to determine
the optimal hyperplane that separates different
classes. The main goal of SVM is to find a boundary
that maximizes the distance between the nearest
points of different classes. This allows the
algorithm to effectively process new data and


background image

THE USA JOURNALS

THE AMERICAN JOURNAL OF ENGINEERING AND TECHNOLOGY (ISSN

2689-0984)

VOLUME 06 ISSUE10

70

https://www.theamericanjournals.com/index.php/tajet

produce accurate classification. Once the model is
trained and the hyperplane is established, SVM can
classify new data based on the previously created
model. This is particularly useful in tasks such as
image processing and text analysis, where a clear
distinction between categories is required. The

application of the Support Vector Machine method
is also relevant in commercial scenarios where
precise and fast object classification is crucial.

Example of using SVM in Python:

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Loading the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Creating an SVM model
model = SVC(kernel='linear', C=1)

# Training the model
model.fit(X_train, y_train)

# Making predictions on the test data
predictions = model.predict(X_test)

# Evaluating accuracy
accuracy = accuracy_score(y_test, predictions)

print(f"Model accuracy: {accuracy:.2f}")

In turn, when discussing unsupervised learning
algorithms, they do not use pre-labeled data,
making them useful for tasks involving the
discovery of hidden structures in data. Clustering
helps to identify patterns and relationships that

may not be immediately apparent. Clustering
algorithms are often applied in marketing for
customer segmentation, as well as for anomaly
detection in large datasets. Some of the most
popular methods include K-means, hierarchical
clustering, and DBSCAN, which are widely used to


background image

THE USA JOURNALS

THE AMERICAN JOURNAL OF ENGINEERING AND TECHNOLOGY (ISSN

2689-0984)

VOLUME 06 ISSUE10

71

https://www.theamericanjournals.com/index.php/tajet

group data into clusters based on similarity [5].

Example of K-means algorithm implementation in
Python:

import numpy as np

class KMeans:
def __init__(self, n_clusters: int, max_iters: int = 100):
self.n_clusters = n_clusters # number of clusters
self.max_iters = max_iters # maximum number of iterations
self.centroids = None # to store the coordinates of centroids

def fit(self, X: np.ndarray):
"""Performs clustering using the K-means method"""
# Randomly initialize centroids
np.random.seed(42) # for reproducibility
random_indices = np.random.choice(len(X), self.n_clusters, replace=False)
self.centroids = X[random_indices]

for i in range(self.max_iters:
# Step 1: Find the nearest centroid for each sample
clusters = self._assign_clusters(X)

# Save the current centroids to check for changes
old_centroids = self.centroids.copy()

# Step 2: Recalculate centroids
self._update_centroids(X, clusters)

# Check for convergence (if centroids do not change, stop the algorithm)
if np.all(old_centroids == self.centroids):
break

def _assign_clusters(self, X: np.ndarray) -> np.ndarray:
"""Assigns each object to the nearest cluster (nearest centroid)"""
distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)
return np.argmin(distances, axis=1)

def _update_centroids(self, X: np.ndarray, clusters: np.ndarray):


background image

THE USA JOURNALS

THE AMERICAN JOURNAL OF ENGINEERING AND TECHNOLOGY (ISSN

2689-0984)

VOLUME 06 ISSUE10

72

https://www.theamericanjournals.com/index.php/tajet

"""Recalculates centroids based on the average position of objects in each
cluster"""
for k in range(self.n_clusters):
self.centroids[k] = X[clusters == k].mean(axis=0)

def predict(self, X: np.ndarray) -> np.ndarray:
"""Returns the nearest cluster for each object in X"""
return self._assign_clusters(X)

# Example usage:
if __name__ == "__main__":
# Generate random data
X = np.array([
[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]
])

# Run K-means algorithm for 2 clusters
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)

# Predict clusters for data points
clusters = kmeans.predict(X)

print("Cluster centroids:", kmeans.centroids)
print("Clusters:", clusters)

The application of algorithms in supply chain
management also allows companies to better
control inventory and plan deliveries, reducing

costs and minimizing delays. Algorithms can
analyze historical data and forecast demand,
enabling businesses to make more informed
decisions [6].

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate a random dataset
X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=0)


background image

THE USA JOURNALS

THE AMERICAN JOURNAL OF ENGINEERING AND TECHNOLOGY (ISSN

2689-0984)

VOLUME 06 ISSUE10

73

https://www.theamericanjournals.com/index.php/tajet


# Create a K-means model
kmeans = KMeans(n_clusters=3, random_state=0)

# Train the model
kmeans.fit(X)

# Visualize the clustering results
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200,
marker='*', c='red')

plt.show()

3. Application of Machine Learning Methods in
Real-World Big Data Sorting Tasks

Regression models are used to evaluate the
relationships

between

various

variables.

Specifically, linear regression is applied to solve
problems related to predicting the value of one
variable based on another. This method has long
been used in statistics and mathematical modeling.
Its primary task is to find coefficients in an
equation that reflects the relationship between
two variables, allowing this relationship to be
described by a simple linear equation.

Linear regression is used to analyze both a single
predictor variable and multiple variables. This
method is one of the foundational techniques in
machine learning and is used to create simple
predictive models. For example, students often
study this algorithm to learn the basics of working
with data. On platforms like Kaggle, many
educational datasets, such as auto insurance data,
can be found where linear regression helps analyze
the relationship between various variables, such as
the number of claims and the total amount of
payouts.

Logistic regression also plays a significant role in
machine learning, especially when solving binary

classification tasks. Unlike linear regression,
logistic regression predicts the probability of an
object belonging to one of two classes. This
algorithm is widely used in various fields, such as
predicting real estate prices based on property
characteristics. It allows for the description of
relationships between a dichotomous dependent
variable and independent variables, making it a
convenient tool for classification tasks [7].

Decision trees and their more complex
counterparts, such as random forests, organize
data into a hierarchical structure, where each level
involves splitting data based on the values of
various attributes. These algorithms are also
widely used for creating predictive models;
however, they are prone to overfitting if steps are
not taken to reduce variance. In this context, the
bagging method, which is based on creating
multiple models on different subsamples, helps
stabilize results.

Modern machine learning methods include
powerful algorithms like gradient boosting
machines, which are particularly effective when
working with tabular data. Algorithms such as
XGBoost and CatBoost demonstrate high accuracy
in classification and regression tasks and are often


background image

THE USA JOURNALS

THE AMERICAN JOURNAL OF ENGINEERING AND TECHNOLOGY (ISSN

2689-0984)

VOLUME 06 ISSUE10

74

https://www.theamericanjournals.com/index.php/tajet

used in real-world projects for big data analysis.

Neural networks also find wide application in
machine learning, especially in tasks related to
image and text processing. Convolutional neural
networks (CNN), recurrent neural networks
(RNN), and multilayer perceptrons (MLP) model
complex relationships and provide high predictive
accuracy in various fields such as medicine, time
series forecasting, and speech recognition. These
methods continue to evolve, offering new
opportunities for data analysis [8].

Thus, various machine learning and statistical
analysis methods, such as regression models,
decision trees, boosting, and neural networks,
enable the solution of a wide range of tasks, from
price prediction to image analysis, making them
indispensable tools in modern data science.

CONCLUSION

Machine learning plays a crucial role in optimizing
big data sorting algorithms, offering effective
solutions to improve the performance and
adaptability of information processing systems.
The implementation of machine learning methods
enhances sorting processes by reducing task
execution time and alleviating the load on
computational resources. Promising directions for
future research include the development of more
efficient algorithms capable of accounting for the
diversity and dynamism of big data.

REFERENCES

1.

Kobak D., Linderman G. C. Initialization is
critical for preserving global data structure in
both t-SNE and UMAP //Nature biotechnology.

2021.

T. 39.

No. 2.

pp. 156-157.

2.

Wang J. et al. Big data analytics for intelligent
manufacturing systems: A review //Journal of
Manufacturing Systems.

2022.

T. 62.

P.

738-752.

3.

Adadi A. A survey on data-efficient algorithms
in big data era //Journal of Big Data.

2021.

T. 8.

No. 1.

P. 24.

4.

Li W. et al. A comprehensive survey on
machine learning-based big data analytics for
IoT-enabled smart healthcare system //Mobile
networks and applications.

2021.

T. 26.

P.

234-252.

5.

Patil P. U. et al. Grading and sorting technique
of dragon fruits using machine learning
algorithms //Journal of Agriculture and Food
Research.

2021.

T. 4.

P. 100118.

6.

Koleva L. S., Filipov G. A. Classification of chest
X-Ray images using Orange Data Mining Tool
//Electrotechnica & Electronica (E+ E).

2023.

T. 58.

No. 2.

7.

Sarker I. H. Machine learning: Algorithms, real-
world applications and research directions
//SN computer science.

2021.

T. 2.

No. 3.

P. 160.

8.

Dargan S. et al. A survey of deep learning and
its applications: a new paradigm to machine
learning //Archives of Computational Methods
in Engineering.

2020.

T. 27.

P. 1071-1092.

9.

Natti, M. (2023). Reducing Oracle RAC Wait
Events by Using Instance-Specific Block
Allocation for Production Applications. The
Eastasouth Journal of Information System and
Computer

Science,

1(01),

65

68.

https://doi.org/10.58812/esiscs.v1i01.447

References

Kobak D., Linderman G. C. Initialization is critical for preserving global data structure in both t-SNE and UMAP //Nature biotechnology. – 2021. – T. 39. – No. 2. – pp. 156-157.

Wang J. et al. Big data analytics for intelligent manufacturing systems: A review //Journal of Manufacturing Systems. – 2022. – T. 62. – P. 738-752.

Adadi A. A survey on data-efficient algorithms in big data era //Journal of Big Data. – 2021. – T. 8. – No. 1. – P. 24.

Li W. et al. A comprehensive survey on machine learning-based big data analytics for IoT-enabled smart healthcare system //Mobile networks and applications. – 2021. – T. 26. – P. 234-252.

Patil P. U. et al. Grading and sorting technique of dragon fruits using machine learning algorithms //Journal of Agriculture and Food Research. – 2021. – T. 4. – P. 100118.

Koleva L. S., Filipov G. A. Classification of chest X-Ray images using Orange Data Mining Tool //Electrotechnica & Electronica (E+ E). – 2023. – T. 58. – No. 2.

Sarker I. H. Machine learning: Algorithms, real-world applications and research directions //SN computer science. – 2021. – T. 2. – No. 3. – P. 160.

Dargan S. et al. A survey of deep learning and its applications: a new paradigm to machine learning //Archives of Computational Methods in Engineering. – 2020. – T. 27. – P. 1071-1092.

Natti, M. (2023). Reducing Oracle RAC Wait Events by Using Instance-Specific Block Allocation for Production Applications. The Eastasouth Journal of Information System and Computer Science, 1(01), 65–68. https://doi.org/10.58812/esiscs.v1i01.447