27
13.
Каршиев А. А., Маматкулова У. Е., Шобутаев К. С. РЕАЛИЗАЦИЯ
КВАЛИМЕТРИЧЕСКОГО ПОДХОДА В УПРАВЛЕНИИ КАЧЕСТВОМ
ОБРАЗОВАНИЯ
СТУДЕНТОВ
СОВРЕМЕННОГО
УНИВЕРСИТЕТА
//Европейский журнал исследований и рефлексии в области образовательных
наук. – 2019. – Т. 2019.
14.
Obid o’g, Assistent Salimov Jamshid, Assistent Abror Mamaraimov
Kamalidin o'g, and Assistent Normatov Nizomiddin Kamoliddin o‘g. "Numpy Library
Capabilities. Vectorized Calculation In Numpy Va Type Of Information." Eurasian
Research Bulletin 15 (2022): 132-137.
15.
Ziyoda,
Maydonova,
and
Normatov
Nizommiddin.
"RAQAMLI
IQTISODIYOTDA SUN'IY INTELLEKT TEXNOLOGIYALARINI TURLI
SOHALARDA AVTOMATLASHTIRISH VOSITALARI." International Journal of
Contemporary Scientific and Technical Research (2023): 246-250.
16.
Nizomiddin, Normatov. "TA’LIMDA DASTURLASH JARAYONINI
BAHOLASHGA ASOSLANGAN AVTOMATLASHTIRILGAN TIZIMNI TADBIQ
ETISH." International Journal of Contemporary Scientific and Technical Research
(2023): 24-28.
17.
Kamoliddin o‘g’li, Normatov Nizomiddin, and Ergashev Sirojiddin Baxtiyor
o‘g‘li. "ERWIN DASTURI YORDAMIDA IDEF0, IDEF3 VA DFD STANDAT
DIAGARAMMALARIDAN FOYDALANIB TIZIM SIFATIDA YARATILGAN
UNIVERSITETNING MONITORING BO ‘LIMI LOYIHASI." Новости
образования: исследование в XXI веке 1.6 (2023): 378-386.
DATA PREPROCESSING TECHNIQUES IN MACHINE LEARNING
Raximov Nodir Odilovich,
Khasanov Dilmurod
Tashkent University of Information Technologies
Abstract.
In this paper, importance of preprocessing and techniques in this field
such as data cleaning, dimensionality reduction, smoothing, normalization are
illustrated. During the research we mentioned some details of techniques above.
However, our research includes only theoretical aspect of data preprocessing. The data
preprocessing phase while arduous and time-intensive stands as the cornerstone of data
science, possessing paramount significance. Neglecting the meticulous cleansing and
structuring of data has the potential to undermine the integrity and efficacy of
subsequent modeling endeavors.
Keywords:
data preprocessing, data cleaning, normalization, exploratory data
analysis, dimensionality reduction.
Introduction
When confronted with real-world data, Data Scientists invariably find it
necessary to employ preprocessing techniques to enhance data usability. Such
techniques serve the dual purpose of rendering the data more amenable for utilization
28
within machine learning (ML) algorithms and mitigating complexity to forestall
overfitting, ultimately yielding a superior model. Upon comprehending the nuances of
your dataset and identifying primary data intricacies through Exploratory Data
Analysis (EDA), the subsequent stage entails data preprocessing, which involves
preparing the dataset for its application in a model. Ideally, one would hope for a
dataset devoid of imperfections. Nevertheless, real-world data is inherently prone to
various issues necessitating remediation. For instance, within an organizational
context, inconsistencies such as typographical errors, missing data, disparate scales,
and other anomalies frequently manifest. These real-world challenges must be rectified
to enhance data utility and comprehensibility. This pivotal phase, where data is
cleansed, and most issues are resolved, is aptly termed "data preprocessing"[2].
Preprocessing methods
Neglecting the data preprocessing step carries significant repercussions for
subsequent machine learning model utilization. Many models are incapable of
accommodating missing values, and various adverse data characteristics, including
outliers, high dimensionality, and noise, can adversely impact model performance.
Thus, by undertaking data preprocessing, the dataset is enhanced in terms of
completeness and accuracy. This pivotal phase is indispensable for effecting essential
data adjustments prior to supplying the dataset to the machine learning model, thereby
ensuring the model's effectiveness and reliability[1,3].
Fig 1. Typical architecture of preprocessing
Dimensionality reduction – is primarily concerned with the reduction of input
features within training data. In the realm of real-world datasets, an abundance of
attributes is typically encountered. Failing to curtail this multitude of features can
potentially compromise the performance of a model when it is later applied to this
dataset. Effectively reducing the number of features while preserving a significant
portion of the dataset's variability yields several advantageous outcomes, including:
29
Conservation of computational resources: diminished feature space necessitates
fewer computational resources during modeling.
Enhanced model performance: a streamlined feature set often contributes to
improved model performance.
Mitigation of overfitting: Dimensionality reduction assists in preventing
overfitting, wherein a model becomes excessively complex and memorizes training
data, leading to a substantial drop in performance on test data.
Alleviation of multicollinearity: this technique helps alleviate multicollinearity,
a condition characterized by high correlations among one or more independent
variables. Moreover, the application of dimensionality reduction techniques
contributes to the reduction of noise within the dataset. Now, let us delve into the
primary approaches to dimensionality reduction that can be employed to enhance data
suitability for subsequent analysis[4,6].
Feature selection pertains to the process of identifying and retaining the most
salient variables (features) associated with the prediction variable. In essence, it
involves selecting attributes that exert the greatest influence on the model.
Normalization is a data preparation method employed to standardize the values
of attributes within a dataset, thereby enhancing the effectiveness and precision of
machine learning models. The primary objective of normalization is to mitigate any
potential biases and discrepancies stemming from variations in feature scales. Various
forms of data normalization are available. Suppose you possess a dataset denoted as X,
containing N rows (entries) and D columns (features). In this context, X[:,i] signifies
feature i, while X[j,:] signifies entry j.
-
Z normalization(Standardization);
-
Min-Max normalization;
-
Unit vector normalization.
For instance, Z normalization method Gaussian function is used to change the
value of elements in dataset.
𝑋̂[: , 𝑖] =
𝑋[:,𝑖]− 𝜇
𝑖
𝜎
𝑖
, 𝜇
𝑖
=
1
𝑁
∑
𝑋[𝑘, 𝑖], 𝜎
𝑖
= √
1
𝑁−1
∑
(𝑋[𝑘, 𝑖] − 𝜇
𝑖
)
2
𝑁
𝑘=1
𝑁
𝑘=1
(1)
In (1.1)
standardization does not
change the type of distribution:
𝑋̂ = 𝑎𝑋 + 𝑏 → 𝑓
𝑋̂
(𝑥) =
1
|𝑎|
𝑓(
𝑥−𝑏
𝑎
)
(2)
This transformation sets the mean of data to 0 and the standard deviation to 1. In
most cases, standardization is used feature-wise Min-Max normalization[9]:
𝑋̂[: , 𝑖] =
𝑋[:,𝑖]− min (𝑋[:,𝑖]
max(𝑋[:,𝑖])−min (𝑋[:,𝑖])
(3)
Z normalization result :
[
13 16 19
22 23 38
47 56 70
]
[
0.013 0.051
0.1
0.155 0.172 0.43
0.586 0.741 0.92
]
30
Data cleaning - a paramount facet of the data preprocessing phase pertains to the
identification and rectification of flawed and imprecise observations within the dataset,
thereby augmenting its overall quality. This methodology entails the discernment of
inadequacies, inaccuracies, redundancies, inconsequentialities, or void entries in the
dataset. Following the identification of these anomalies, it becomes imperative to
effectuate corrective measures, which may encompass data modification or exclusion.
The strategic approach employed in this context is contingent upon the problem domain
under consideration and the overarching objectives of the project. We will now
delineate some of the prevalent data analysis challenges and expound upon the methods
for their resolution. Noisy data typically encompasses inconsequential or erroneous
entries within a dataset, which may manifest as meaningless data points, incorrect
records, or duplicated observations. For instance, consider a scenario where a database
column labeled 'age' contains negative values, rendering the observation nonsensical.
Another circumstance pertains to the removal of extraneous or irrelevant data.
Especially, in regression problems we have to work on real numbers, in this case
normalization is one of the most important techniques[11].
Conclusion
Through the research above data preprocessing techniques help to improve
accuracy of the prediction result. For example, Dimensionality reduction makes easier
computational process and reduces execution time for prediction or analysis model.
Moreover, other techniques such as normalization makes understandable dataset, so
dataset values are transformed the mean of data to 0 and the standard deviation to 1.
Some theoretical explanations are given according to preprocessing techniques’ details
in this article. The next researches, we will try to touch mathematical aspect of these
processes.
References:
1.
Jiang S., Li J., Zhang S., Gu Q., Lu C., Liu H. Landslide risk prediction by
using GBRT algorithm: Application of artificial intelligence in disaster prevention of
energy mining. Process. Saf. Environ. Prot. 2022, 166, 384–392.
2.
N.Rahimov, D.Khasanov,“The application of multiple linear regression
algorithm and python for crop yield prediction in agriculture”, Harvard educational and
scientific review, Vol.2. Issue 1 Pg. 181-187.
3.
N.Raximov, J.Kuvandikov, D.Khasanov, “The importance of loss function
in artificial intelligence”, International Conference on Information Science and
Communications
Technologies
(ICISCT
20222),
DOI:
10.1109/ICISCT55600.2022.10146883
4.
Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer; 2009.
5.
Oliver Theobald. Machine Learning for Absolute Beginners
– Scatterplot
Press. 2017. pg.43-98
6.
M.Tojiyev,O.Primqulov,D.Xasanov, “Image segmentation in OpenCV and
Python,
DOI:10.5958/2249-7137.2020.01735.8
31
7.
N.Raximov, O.Primqulov, B.Daminova,“Basic concepts and stages of
research development on artificial intelligence”, International Conference on
Information
Science
and
Communications
Technologies
(ICISCT),
www.ieeexplore.ieee.org/document/9670085/metrics#metrics
8.
N.Rahimov, D.Khasanov, “The mathematical essence of logistic regression
for machine learning”, International Journal of Contemporary Scientific and Technical
Research. Pg. 102-105.
9.
Khasanov Dilmurod, Tojiyev Ma’ruf,Primqulov Oybek., “Gradient Descent
In Machine”. International Conference on Information Science and Communications
Technologies (ICISCT),
https://ieeexplore.ieee.org/document/9670169
10.
Babomurodov O. Zh., Rakhimov N. O. Stages of knowledge extraction from
electronic information resources. Eurasian Union of Scientists. International Popular
Science Bulletin. Issue. № 10(19)/2015. – pp. 130-133. ISSN: 2411-6467
11.
N Raximov, M Doshchanova, O Primqulov, J Quvondikov. Development of
architecture of intellectual information system supporting decision-making for health
of sportsmen.// 2022 International Congress on Human-Computer Interaction,
Optimization and Robotic Applications (HORA)Prasun Biswas, “Loss function in deep
learning and python implementation”(web article), www.towardsdatascience.com,
2021.
12.
Rahimov Nodir, Khasanov Dilmurod. (2022). The Mathematical Essence
Of
Logistic
Regression
For
Machine
Learning.
https://doi.org/10.5281/zenodo.7239169
13.
T. Maruf, "Hazard recognition system based on violation of the integrity of
the field and changes in the intensity of illumination on the video image," 2022
International Conference on Information Science and Communications Technologies
(ICISCT),
Tashkent,
Uzbekistan,
2022,
pp.
1-3,
doi:
10.1109/ICISCT55600.2022.10146933
14.
Ma’ruf Tojiyev, Ravshan Shirinboyev, Jahongirjon Bobolov. Image
Segmentation By Otsu Method. International Journal of Contemporary Scientific and
Technical Research, (Special Issue), 2023. 64–72,
https://zenodo.org/record/7630893
MULOHAZALAR VA MATRITSALARNING OʻZORO BOGʻLANISHI
Maniyozov Oybek Azatboyevich
Toshkent axborot texnologiyalari univeristeti Fargʻona filiali
Annotatsiya
: Ushbu maqolada mulohalar va matritsalar bogʻlanishi,
mulohazalarni shifrlash, shirflangan ma’lumotlarni yechish, matritsalar ko’paytmasi
hamda teskari matritsalarni Matlab programmasida ishlatish haqida ma’lumot berilgan.
Kalit soʻzlar:
Mulohaza, sodda mulohaza, shifr matritsa, teskari matritsa,
matritsalar koʻpaytmasi.