Data preprocessing techniques in machine learning

CC BY f
27-31
12
2
Поделиться
Рахимов, Н., & Хасанов, Д. (2023). Data preprocessing techniques in machine learning . Информатика и инженерные технологии, 1(2), 27–31. извлечено от https://inlibrary.uz/index.php/computer-engineering/article/view/24959
Crossref
Сrossref
Scopus
Scopus

Аннотация

In this paper, importance of preprocessing and techniques in this field such as data cleaning, dimensionality reduction, smoothing, normalization are illustrated. During the research we mentioned some details of techniques above. However, our research includes only theoretical aspect of data preprocessing. The data preprocessing phase while arduous and time-intensive stands as the cornerstone of data science, possessing paramount significance. Neglecting the meticulous cleansing and structuring of data has the potential to undermine the integrity and efficacy of subsequent modeling endeavors.

Похожие статьи


background image

27

13.

Каршиев А. А., Маматкулова У. Е., Шобутаев К. С. РЕАЛИЗАЦИЯ

КВАЛИМЕТРИЧЕСКОГО ПОДХОДА В УПРАВЛЕНИИ КАЧЕСТВОМ
ОБРАЗОВАНИЯ

СТУДЕНТОВ

СОВРЕМЕННОГО

УНИВЕРСИТЕТА

//Европейский журнал исследований и рефлексии в области образовательных
наук. – 2019. – Т. 2019.

14.

Obid o’g, Assistent Salimov Jamshid, Assistent Abror Mamaraimov

Kamalidin o'g, and Assistent Normatov Nizomiddin Kamoliddin o‘g. "Numpy Library
Capabilities. Vectorized Calculation In Numpy Va Type Of Information." Eurasian
Research Bulletin 15 (2022): 132-137.

15.

Ziyoda,

Maydonova,

and

Normatov

Nizommiddin.

"RAQAMLI

IQTISODIYOTDA SUN'IY INTELLEKT TEXNOLOGIYALARINI TURLI
SOHALARDA AVTOMATLASHTIRISH VOSITALARI." International Journal of
Contemporary Scientific and Technical Research (2023): 246-250.

16.

Nizomiddin, Normatov. "TA’LIMDA DASTURLASH JARAYONINI

BAHOLASHGA ASOSLANGAN AVTOMATLASHTIRILGAN TIZIMNI TADBIQ
ETISH." International Journal of Contemporary Scientific and Technical Research
(2023): 24-28.

17.

Kamoliddin o‘g’li, Normatov Nizomiddin, and Ergashev Sirojiddin Baxtiyor

o‘g‘li. "ERWIN DASTURI YORDAMIDA IDEF0, IDEF3 VA DFD STANDAT
DIAGARAMMALARIDAN FOYDALANIB TIZIM SIFATIDA YARATILGAN
UNIVERSITETNING MONITORING BO ‘LIMI LOYIHASI." Новости
образования: исследование в XXI веке 1.6 (2023): 378-386.


DATA PREPROCESSING TECHNIQUES IN MACHINE LEARNING

Raximov Nodir Odilovich,

Khasanov Dilmurod

Tashkent University of Information Technologies

Abstract.

In this paper, importance of preprocessing and techniques in this field

such as data cleaning, dimensionality reduction, smoothing, normalization are
illustrated. During the research we mentioned some details of techniques above.
However, our research includes only theoretical aspect of data preprocessing. The data
preprocessing phase while arduous and time-intensive stands as the cornerstone of data
science, possessing paramount significance. Neglecting the meticulous cleansing and
structuring of data has the potential to undermine the integrity and efficacy of
subsequent modeling endeavors.

Keywords:

data preprocessing, data cleaning, normalization, exploratory data

analysis, dimensionality reduction.

Introduction

When confronted with real-world data, Data Scientists invariably find it

necessary to employ preprocessing techniques to enhance data usability. Such
techniques serve the dual purpose of rendering the data more amenable for utilization


background image

28

within machine learning (ML) algorithms and mitigating complexity to forestall
overfitting, ultimately yielding a superior model. Upon comprehending the nuances of
your dataset and identifying primary data intricacies through Exploratory Data
Analysis (EDA), the subsequent stage entails data preprocessing, which involves
preparing the dataset for its application in a model. Ideally, one would hope for a
dataset devoid of imperfections. Nevertheless, real-world data is inherently prone to
various issues necessitating remediation. For instance, within an organizational
context, inconsistencies such as typographical errors, missing data, disparate scales,
and other anomalies frequently manifest. These real-world challenges must be rectified
to enhance data utility and comprehensibility. This pivotal phase, where data is
cleansed, and most issues are resolved, is aptly termed "data preprocessing"[2].

Preprocessing methods

Neglecting the data preprocessing step carries significant repercussions for

subsequent machine learning model utilization. Many models are incapable of
accommodating missing values, and various adverse data characteristics, including
outliers, high dimensionality, and noise, can adversely impact model performance.
Thus, by undertaking data preprocessing, the dataset is enhanced in terms of
completeness and accuracy. This pivotal phase is indispensable for effecting essential
data adjustments prior to supplying the dataset to the machine learning model, thereby
ensuring the model's effectiveness and reliability[1,3].

Fig 1. Typical architecture of preprocessing

Dimensionality reduction – is primarily concerned with the reduction of input

features within training data. In the realm of real-world datasets, an abundance of
attributes is typically encountered. Failing to curtail this multitude of features can
potentially compromise the performance of a model when it is later applied to this
dataset. Effectively reducing the number of features while preserving a significant
portion of the dataset's variability yields several advantageous outcomes, including:


background image

29

Conservation of computational resources: diminished feature space necessitates

fewer computational resources during modeling.

Enhanced model performance: a streamlined feature set often contributes to

improved model performance.

Mitigation of overfitting: Dimensionality reduction assists in preventing

overfitting, wherein a model becomes excessively complex and memorizes training
data, leading to a substantial drop in performance on test data.

Alleviation of multicollinearity: this technique helps alleviate multicollinearity,

a condition characterized by high correlations among one or more independent
variables. Moreover, the application of dimensionality reduction techniques
contributes to the reduction of noise within the dataset. Now, let us delve into the
primary approaches to dimensionality reduction that can be employed to enhance data
suitability for subsequent analysis[4,6].

Feature selection pertains to the process of identifying and retaining the most

salient variables (features) associated with the prediction variable. In essence, it
involves selecting attributes that exert the greatest influence on the model.

Normalization is a data preparation method employed to standardize the values

of attributes within a dataset, thereby enhancing the effectiveness and precision of
machine learning models. The primary objective of normalization is to mitigate any
potential biases and discrepancies stemming from variations in feature scales. Various
forms of data normalization are available. Suppose you possess a dataset denoted as X,
containing N rows (entries) and D columns (features). In this context, X[:,i] signifies
feature i, while X[j,:] signifies entry j.

-

Z normalization(Standardization);

-

Min-Max normalization;

-

Unit vector normalization.

For instance, Z normalization method Gaussian function is used to change the

value of elements in dataset.

𝑋̂[: , 𝑖] =

𝑋[:,𝑖]− 𝜇

𝑖

𝜎

𝑖

, 𝜇

𝑖

=

1

𝑁

𝑋[𝑘, 𝑖], 𝜎

𝑖

= √

1

𝑁−1

(𝑋[𝑘, 𝑖] − 𝜇

𝑖

)

2

𝑁

𝑘=1

𝑁

𝑘=1

(1)


In (1.1)

standardization does not

change the type of distribution:

𝑋̂ = 𝑎𝑋 + 𝑏 → 𝑓

𝑋̂

(𝑥) =

1

|𝑎|

𝑓(

𝑥−𝑏

𝑎

)

(2)

This transformation sets the mean of data to 0 and the standard deviation to 1. In

most cases, standardization is used feature-wise Min-Max normalization[9]:

𝑋̂[: , 𝑖] =

𝑋[:,𝑖]− min (𝑋[:,𝑖]

max(𝑋[:,𝑖])−min (𝑋[:,𝑖])

(3)

Z normalization result :

[

13 16 19
22 23 38
47 56 70

]

[

0.013 0.051

0.1

0.155 0.172 0.43
0.586 0.741 0.92

]


background image

30

Data cleaning - a paramount facet of the data preprocessing phase pertains to the

identification and rectification of flawed and imprecise observations within the dataset,
thereby augmenting its overall quality. This methodology entails the discernment of
inadequacies, inaccuracies, redundancies, inconsequentialities, or void entries in the
dataset. Following the identification of these anomalies, it becomes imperative to
effectuate corrective measures, which may encompass data modification or exclusion.
The strategic approach employed in this context is contingent upon the problem domain
under consideration and the overarching objectives of the project. We will now
delineate some of the prevalent data analysis challenges and expound upon the methods
for their resolution. Noisy data typically encompasses inconsequential or erroneous
entries within a dataset, which may manifest as meaningless data points, incorrect
records, or duplicated observations. For instance, consider a scenario where a database
column labeled 'age' contains negative values, rendering the observation nonsensical.
Another circumstance pertains to the removal of extraneous or irrelevant data.
Especially, in regression problems we have to work on real numbers, in this case
normalization is one of the most important techniques[11].

Conclusion

Through the research above data preprocessing techniques help to improve

accuracy of the prediction result. For example, Dimensionality reduction makes easier
computational process and reduces execution time for prediction or analysis model.
Moreover, other techniques such as normalization makes understandable dataset, so
dataset values are transformed the mean of data to 0 and the standard deviation to 1.
Some theoretical explanations are given according to preprocessing techniques’ details
in this article. The next researches, we will try to touch mathematical aspect of these
processes.

References:

1.

Jiang S., Li J., Zhang S., Gu Q., Lu C., Liu H. Landslide risk prediction by

using GBRT algorithm: Application of artificial intelligence in disaster prevention of
energy mining. Process. Saf. Environ. Prot. 2022, 166, 384–392.

2.

N.Rahimov, D.Khasanov,“The application of multiple linear regression

algorithm and python for crop yield prediction in agriculture”, Harvard educational and
scientific review, Vol.2. Issue 1 Pg. 181-187.

3.

N.Raximov, J.Kuvandikov, D.Khasanov, “The importance of loss function

in artificial intelligence”, International Conference on Information Science and
Communications

Technologies

(ICISCT

20222),

DOI:

10.1109/ICISCT55600.2022.10146883

4.

Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning:

Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer; 2009.

5.

Oliver Theobald. Machine Learning for Absolute Beginners

.

– Scatterplot

Press. 2017. pg.43-98

6.

M.Tojiyev,O.Primqulov,D.Xasanov, “Image segmentation in OpenCV and

Python,

DOI:10.5958/2249-7137.2020.01735.8


background image

31

7.

N.Raximov, O.Primqulov, B.Daminova,“Basic concepts and stages of

research development on artificial intelligence”, International Conference on
Information

Science

and

Communications

Technologies

(ICISCT),

www.ieeexplore.ieee.org/document/9670085/metrics#metrics

8.

N.Rahimov, D.Khasanov, “The mathematical essence of logistic regression

for machine learning”, International Journal of Contemporary Scientific and Technical
Research. Pg. 102-105.

9.

Khasanov Dilmurod, Tojiyev Ma’ruf,Primqulov Oybek., “Gradient Descent

In Machine”. International Conference on Information Science and Communications
Technologies (ICISCT),

https://ieeexplore.ieee.org/document/9670169

10.

Babomurodov O. Zh., Rakhimov N. O. Stages of knowledge extraction from

electronic information resources. Eurasian Union of Scientists. International Popular
Science Bulletin. Issue. № 10(19)/2015. – pp. 130-133. ISSN: 2411-6467

11.

N Raximov, M Doshchanova, O Primqulov, J Quvondikov. Development of

architecture of intellectual information system supporting decision-making for health
of sportsmen.// 2022 International Congress on Human-Computer Interaction,
Optimization and Robotic Applications (HORA)Prasun Biswas, “Loss function in deep
learning and python implementation”(web article), www.towardsdatascience.com,
2021.

12.

Rahimov Nodir, Khasanov Dilmurod. (2022). The Mathematical Essence

Of

Logistic

Regression

For

Machine

Learning.

https://doi.org/10.5281/zenodo.7239169

13.

T. Maruf, "Hazard recognition system based on violation of the integrity of

the field and changes in the intensity of illumination on the video image," 2022
International Conference on Information Science and Communications Technologies
(ICISCT),

Tashkent,

Uzbekistan,

2022,

pp.

1-3,

doi:

10.1109/ICISCT55600.2022.10146933

14.

Ma’ruf Tojiyev, Ravshan Shirinboyev, Jahongirjon Bobolov. Image

Segmentation By Otsu Method. International Journal of Contemporary Scientific and
Technical Research, (Special Issue), 2023. 64–72,

https://zenodo.org/record/7630893


MULOHAZALAR VA MATRITSALARNING OʻZORO BOGʻLANISHI

Maniyozov Oybek Azatboyevich

Toshkent axborot texnologiyalari univeristeti Fargʻona filiali

maniyozovo@gmail.com

Annotatsiya

: Ushbu maqolada mulohalar va matritsalar bogʻlanishi,

mulohazalarni shifrlash, shirflangan ma’lumotlarni yechish, matritsalar ko’paytmasi
hamda teskari matritsalarni Matlab programmasida ishlatish haqida ma’lumot berilgan.

Kalit soʻzlar:

Mulohaza, sodda mulohaza, shifr matritsa, teskari matritsa,

matritsalar koʻpaytmasi.

Библиографические ссылки

Jiang S., Li J., Zhang S., Gu Q., Lu C., Liu H. Landslide risk prediction by using GBRT algorithm: Application of artificial intelligence in disaster prevention of energy mining. Process. Saf. Environ. Prot. 2022, 166, 384–392.

N.Rahimov, D.Khasanov,“The application of multiple linear regression algorithm and python for crop yield prediction in agriculture”, Harvard educational and scientific review, Vol.2. Issue 1 Pg. 181-187.

N.Raximov, J.Kuvandikov, D.Khasanov, “The importance of loss function in artificial intelligence”, International Conference on Information Science and Communications Technologies (ICISCT 20222), DOI:10.1109/ICISCT55600.2022.10146883

Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer; 2009.

Oliver Theobald. Machine Learning for Absolute Beginners. – Scatterplot Press. 2017. pg.43-98

M.Tojiyev,O.Primqulov,D.Xasanov, “Image segmentation in OpenCV and Python, DOI:10.5958/2249-7137.2020.01735.8

N.Raximov, O.Primqulov, B.Daminova,“Basic concepts and stages of research development on artificial intelligence”, International Conference on Information Science and Communications Technologies (ICISCT), www.ieeexplore.ieee.org/document/9670085/metrics#metrics

N.Rahimov, D.Khasanov, “The mathematical essence of logistic regression for machine learning”, International Journal of Contemporary Scientific and Technical Research. Pg. 102-105.

Khasanov Dilmurod, Tojiyev Ma’ruf,Primqulov Oybek., “Gradient Descent In Machine”. International Conference on Information Science and Communications Technologies (ICISCT), https://ieeexplore.ieee.org/document/9670169

Babomurodov O. Zh., Rakhimov N. O. Stages of knowledge extraction from electronic information resources. Eurasian Union of Scientists. International Popular Science Bulletin. Issue. № 10(19)/2015. – pp. 130-133. ISSN: 2411-6467

N Raximov, M Doshchanova, O Primqulov, J Quvondikov. Development of architecture of intellectual information system supporting decision-making for health of sportsmen.// 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)Prasun Biswas, “Loss function in deep learning and python implementation”(web article), www.towardsdatascience.com, 2021.

Rahimov Nodir, Khasanov Dilmurod. (2022). The Mathematical Essence Of Logistic Regression For Machine Learning. https://doi.org/10.5281/zenodo.7239169

T. Maruf, "Hazard recognition system based on violation of the integrity of the field and changes in the intensity of illumination on the video image," 2022 International Conference on Information Science and Communications Technologies (ICISCT), Tashkent, Uzbekistan, 2022, pp. 1-3, doi: 10.1109/ICISCT55600.2022.10146933

Ma’ruf Tojiyev, Ravshan Shirinboyev, Jahongirjon Bobolov. Image Segmentation By Otsu Method. International Journal of Contemporary Scientific and Technical Research, (Special Issue), 2023. 64–72, https://zenodo.org/record/7630893

inLibrary — это научная электронная библиотека inConference - научно-практические конференции inScience - Журнал Общество и инновации UACD - Антикоррупционный дайджест Узбекистана UZDA - Ассоциации стоматологов Узбекистана АСТ - Архитектура, строительство, транспорт Open Journal System - Престиж вашего журнала в международных базах данных inDesigner - Разработка сайта - создание сайтов под ключ в веб студии Iqtisodiy taraqqiyot va tahlil - ilmiy elektron jurnali yuridik va jismoniy shaxslarning in-Academy - Innovative Academy RSC MENC LEGIS - Адвокатское бюро SPORT-SCIENCE - Актуальные проблемы спортивной науки GLOTEC - Внедрение цифровых технологий в организации MuviPoisk - Смотрите фильмы онлайн, большая коллекция, новинки кинопроката Megatorg - Доска объявлений Megatorg.net: сайт бесплатных частных объявлений Skinormil - Космецевтика активного действия Pils - Мультибрендовый онлайн шоп METAMED - Фармацевтическая компания с полным спектром услуг Dexaflu - от симптомов гриппа и простуды SMARTY - Увеличение продаж вашей компании ELECARS - Электромобили в Ташкенте, Узбекистане CHINA MOTORS - Купи автомобиль своей мечты! PROKAT24 - Прокат и аренда строительных инструментов