European International Journal of Multidisciplinary Research
and Management Studies
6
https://eipublication.com/index.php/eijmrms
TYPE
Original Research
PAGE NO.
6-11
DOI
OPEN ACCESS
SUBMITED
07 December 2024
ACCEPTED
08 January 2025
PUBLISHED
10 February 2025
VOLUME
Vol.05 Issue02 2025
COPYRIGHT
© 2025 Original content from this work may be used under the terms
of the creative commons attributes 4.0 License.
Machine Learning-Based
Classification of Mental
Health Status on Social
Media: A Case Study of
Kunduz, Afghanistan
Rohullah Adeeb
Department of Information Systems, Computer Science Faculty, Kunduz
University, Afghanistan
Hekmatullah Hekmat
Department of Information Systems, Computer Science Faculty, Kunduz
University, Afghanistan
Abdullah Zahirzada
Department of Information Systems, Computer Science Faculty, Kunduz
University, Afghanistan
Abstract:
The proliferation of user-generated content
results from social media's rapid development. An
interdisciplinary field called computational cyber-
psychology uses machine learning techniques to
investigate fundamental psychological tendencies. Our
study uses social media usage patterns to infer users'
mental health status. The emergence of social media
platforms in Afghanistan has profoundly affected the
country's young people. This study aims to implement a
predictive model using three machine-learning
algorithms. The three selected algorithms are Naïve
Bayes, Random Forest, and K-Nearest Neighbor (K-NN).
The dataset contained in this study was collected
through a questionnaire from students at Kunduz public
and private universities in 2024. Data preprocessing is
done before implementing the predictive models. The
dataset is prepared carefully to ensure well-balanced
samples in each category. The study reveals that
Random Forest is the best classifier, with an accuracy of
77%. The study directly benefits the other researchers
and policy or decision-makers in Afghanistan.
Keywords:
Random Forest, K-NN, Social Media, Mental
Health, Afghanistan.
European International Journal of Multidisciplinary Research
and Management Studies
7
https://eipublication.com/index.php/eijmrms
European International Journal of Multidisciplinary Research and Management Studies
Introduction:
A notable aspect of the Internet's
explosive growth has been the creation of social
media, which has drawn a large number of people
eager to express themselves on sites like Facebook,
Twitter, and others. Users are creating a cyberspace
that interacts with and reflects the actual world
through their online habits. As a result, a person's
online behavior may be a good reflection of their
psychological traits offline. Due to demands, the
outside world, and other factors, an increasing number
of people these days are experiencing mental illnesses
including depression, anxiety, stress, and so on. These
mental illnesses can have a serious negative impact on
users' lives and can even cause suicides. In the past,
individuals with mental health issues would have been
told to see a therapist or could have gone out on their
own to get psychotherapy. Individuals become aware
of mental health issues through surveys or gut feelings.
Lack of resources may prevent psychotherapy from
becoming accessible, even in cases where a mental
health issue is identified. These days, the Internet
offers a fresh way to handle such circumstances [1].
Due to increased social comparison and validation-
seeking tendencies, people who use social media
regularly are Due to increased social comparison and
validation-seeking tendencies, people who use social
media regularly are more likely to suffer from mental
health conditions like anxiety, depression, and
loneliness, according to studies. Social media usage
frequently reflects offline habits, which can be used to
create mental health assessment prediction models
[2]. Research from the World Health Organization in
2018 found two million Afghans struggling with mental
distress, and these numbers are likely much higher
today. Still, many suffer in silence. The mental well-
being of Afghans has become a pressing concern that
demands immediate attention. Afghanistan lacks
qualified mental health professionals, such as
psychiatrists,
psychologists,
and
counselors.
Therefore, it is beneficial to identify the factors
associated with mental health in Afghanistan through
a predictive mental health model. It also aims to find a
suitable classifier for this task. Three popular
classifiers, Naïve Bayes, Random Forest, and K-Nearest
Neighbor are selected for the study. The dataset used
in this study was collected through a questionnaire
from students at Kunduz public and private universities
during the 2024 year, which comprises records of
students whose age group is between 18 and 27 years.
Classification Selected
Classification is a supervised learning technique whose
primary objective is to construct models based on
known data and predict new data categories. In
classification, models are built by splitting a supplied
dataset into training and test sets. One or more
classification algorithms run through the training set,
and the classifier models are subsequently developed.
The test set is then used to assess the accuracy of the
models [3]. Previous studies show that the dataset and
application have a major impact on the accuracy and
efficiency of machine learning algorithms. For example,
algorithms like Support Vector Machines (SVM) have
demonstrated efficacy in mental health prediction
research because of their capacity to handle high-
dimensional data, even if Random Forest has
demonstrated strong results in a variety of prediction
tasks. A fair evaluation is made possible by using a range
of classifiers, which also offers insights into which
algorithms work best in certain mental health prediction
settings [4]. This study's inclusion of Naïve Bayes, K-NN,
and Random Forest serves as a comparative basis to
identify the optimal classifier for social media-based
mental health prediction. Subsections A to C briefly
introduce the classifiers selected for this study, while
Section III describes methodology of the work.
A. Naïve Bayes: In Naïve Bayes learning, a Bayesian
probabilistic model accredits a back-class probability to
an instance P(Y=Yj|X=Xi). The simple Naïve Bayes
algorithm uses these probabilities to accredit an
example to a class. Naive Bayes classifier converges
faster than logistic regression, so it requires only less
training data. This method is prevalent for different
applications for several features. This classifier can be
trained with reasonable accuracy in a supervised
learning setting, and its performance is satisfactory in
many complex real-life problem situations. Naive Bayes
is technically less precise than other classifiers and can
result in a higher lower rate [5].
B.
K-NN:
The
K-Nearest-Neighbors
(K-NN)
approach does not make any assumptions about the
elementary dataset because it is a nonparametric
classification algorithm. It is renowned for being both
straightforward and efficient. It is an algorithm for
supervised learning. To predict the class of the
unlabeled data, a labeled training dataset with data
points divided into several classes is provided. Different
criteria are used in classification to identify the class to
which the unlabeled data belongs. Typically, KNN is
employed as a classifier. It is used to categorize data
based on nearby or nearby training examples in a
certain area. This approach is employed due to its
speedy computation and ease of operation. It computes
its closest neighbors for continuous data using the
Euclidean distance [6].
C.
Random Forest: The random forest algorithm,
proposed by L. Breiman in 2001, has been extremely
successful as a general-purpose classification and
regression method. The approach, which combines
European International Journal of Multidisciplinary Research
and Management Studies
8
https://eipublication.com/index.php/eijmrms
European International Journal of Multidisciplinary Research and Management Studies
several randomized decision trees and aggregates their
predictions by averaging, has shown excellent
performance in settings where the number of variables
is much larger than the number of observations.
Moreover, it is versatile enough to be applied to large-
scale problems, is easily adapted to various ad hoc
learning tasks, and returns measures of variable
importance [7].
METHODOLOGY
Machine learning is responsible for discovering
unknown and secret patterns in a large amount of data
to obtain valuable information. As stated earlier, one
of the objectives of this work is to find a suitable
classifier among the most popular machine learning
techniques.
A. Data Source:
In Kunduz, Afghanistan, during the 2024 academic
year, a questionnaire survey was administered at
public and private universities to collect the data for
this study. The collection contains the records of 304
students in total, 171 males and 132 females. The
dataset contains 20 attributes in total. The attributes
can be categorized into groups based on factors.
B. Data Preprocessing:
There were no missing or outlier values to clear up
because the dataset used in this study was based on
relevant job studies and a background review of the
mental health status prediction models. However, the
relevant features have been selected in order to get a
precise model.
C. Selected Features:
Each sample in the dataset consists of 20 attributes; an
essential step in this initial stage is to identify relevant
attributes related to the study. This is carried out using
an extensive literature review. Any data mining model's
implementation must be successful to have meaningful
factors/attributes. In fact, adding unnecessary features
might harm data mining since it makes it harder to
conclude from samples that are stuffed with redundant
and unnecessary information.
Table 1. Reports the attributes used in this study.
No
Attributes
Values
1
Gender
Male
Female
2
What is your age?
Under
18
18-22
23-26
>27
3
Relationship Status
Single
Married
Engaged
4
Occupation Status
Salaried worker
University student
Jobless
5
What type of organizations are you
affiliated with?
Universit
y
Company
Government
6
Do you use social media?
Yes
No
7
What social media platforms do you
commonly use?
Faceboo
k
YouTu
be
WhatsAp
p
8
What is the average time you spend on
social media every day?
Less than
1 hour
Between
1 & 2
hour
Between 3 & 4 hour Between
5 & 6
hour
9
How often do you find yourself using
social media without a specific
purpose?
1
2
3
4
5
10
How often do you get distracted by
social media when you are busy doing
something?
1
2
3
4 5
European International Journal of Multidisciplinary Research
and Management Studies
9
https://eipublication.com/index.php/eijmrms
European International Journal of Multidisciplinary Research and Management Studies
11
Do you feel restless if you haven't used
social media in a while?
1
2
3
4 5
12
On a scale of 1 to 5, how easily
distracted are you?
1
2
3
4 5
13
On a scale of 1 to 5, how much are you
bothered by worries?
1
2
3
4 5
14
Do you find it difficult to concentrate
on things?
1
2
3
4 5
15
On a scale of 1-5, how often do you
compare yourself to other successful
people through the use of social media?
1
2
3
4 5
16
Following the previous question, how
do you feel about these comparisons,
generally speaking?
1
2
3
4 5
17
How often do you look to seek
validation from features of social
media?
1
2
3
4 5
18
How often do you feel depressed or
down?
1
2
3
4 5
19
On a scale of 1 to 5, how frequently
does your interest in daily activities
fluctuate?
1
2
3
4 5
20
On a scale of 1 to 5, how often do you
face issues regarding sleep?
1
2
3
4
5
D. Evaluation Metrics:
Evaluation metrics are applied in order to assess the
efficiency and performance of the deployed predictive
model. The four generally used measures are used in
this study [5]. These are Accuracy, Precision, F-
Measure, and Recall. Below are some succinct
summaries of each:
Accuracy
: Since it is the first significant measure of how
well the model performs this metric is most frequently
employed in classification. It is stated as a percentage
(0 percent to 100 percent) Equation 1 below can be
used to determine it:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
=
TP+TN
TP+TN+FP+FN
(1)
The number of data rows in the test set that both had
a positive target and were expected to have a positive
target is known as true positives, or TP. The amount of
test set data rows with both a negative target and a
target that was anticipated to be negative is known as
true negatives, or TN. The number of test set data rows
with a negative target but a forecasted positive target is
known as FP, or false positives. False Negative, or FN,
refers to the number of test set data rows with a positive
target but a forecasted negative target [8].
Precision
: This metric was developed by the field of
information retrieval, but it also has applications in
classification and is a valuable addition to measuring
performance. Additionally, it is given as a percentage.
Equation 2 below can be used to determine it:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
=
TP
TP+FP
(2)
Recall
Precision and Recall are frequently combined in the area
of information retrieval. As a result, it can provide useful
data for assessing performance. Additionally, it is given
as a percentage. Equation 3 below can be used to
determine it:
𝑅𝑒𝑐𝑎𝑙𝑙
=
TP
TP+FN
(3)
European International Journal of Multidisciplinary Research
and Management Studies
10
https://eipublication.com/index.php/eijmrms
European International Journal of Multidisciplinary Research and Management Studies
F-Measure:
The harmonic mean of precision and recall, where
precision is the percentage of projected positive
events that are actually positive and recall is the
percentage of positive occurrences that are actually
accurately recognized by the algorithm, is known as
the F-Measure [9].
Equation 4 below can be used to determine it:
𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 2
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
(4)
E: Implementation:
According to the research review, no single classifier
consistently generates correct predictions for all cases.
This study uses the three machine learning techniques
discussed in section 2 to create a prediction model for
the mental health status of young adults on social
media. The execution of the predictive model can begin
after preprocessing is finished. Using the well-known
machine learning tool WEKA [10], the model is trained
and tested. All parameters for the models’
implementation are set to default and the ratio of the
training set to the test set is 80:20.
RESULTS
Numerous attempts have been carried out in order to
implement the most suitable predictive model. The
evaluation metrics mentioned in section 3 are used to
assess models. Table. 2 displays the outcomes.
Table 2.Reveals the accuracy and performance measures of models
Classifier
Accuracy
Precision
Recall
F-Measure
Random Forest
77%
73%
78%
76%
Naïve Bayes
72%
73%
60%
66%
KNN
75%
70%
82%
75%
When measured, Random Forest produces the best
outcome, as indicated in Table 2 . The nature of the
data can have a significant impact on the methods
utilized, therefore this is by no means proof that
Random Forest is always better than other algorithms.
It can be argued that Random Forest is the most
precise tool, among the most often used tools, to
implement the prediction model utilizing this dataset.
CONCLUSION
As of July 2024, there were 5.45 billion internet users
worldwide, which amounted to 67.1 percent of the
global population. Afghanistan was home to 3.70
million social media users in January 2024, equating to
8.6 percent of the total population. A total of 27.67
million cellular mobile connections were active in
Afghanistan in early 2024, with this figure equivalent
to 64.6 percent of the total population. In this study, a
predictive model is used to address the issue of mental
health status on social media among young adults in
Kunduz province, Afghanistan. When the performance
of the various models was compared, Random Forest
produced the best results, with Accuracy, Precision,
Recall, and F-Measure of 77%, 73%, 78%, and 76%,
respectively. The findings of this study could help other
researchers and policy or decision makers in
Afghanistan.
REFERENCES
Hao, B., Li, L., Li, A., & Zhu, T. (2013). Predicting mental
health status on social media: a preliminary study on
microblog. In Cross-Cultural Design. Cultural Differences
in Everyday Life: 5th International Conference, CCD
2013, Held as Part of HCI International 2013, Las Vegas,
NV, USA, July 21-26, 2013, Proceedings, Part II 5 (pp.
101-110). Springer Berlin Heidelberg.
C. E. Montenegro, P. F. Geng, and R. P. Oliver, “Social
media and mental health: A perspective on youth and
online behavior,” Journal of Social Computing, vol. 12,
no. 4, pp. 67-81, 2021.
Zahirzda, A., Chanmas, G., & Chan, J. H. (2023). A data
mining model for predicting diarrhea in Afghan children.
Authorea Preprints.
K. S. Ray, J. T. Hogan, and A. N. Kumar, “Comparative
study of classifiers for mental health prediction on social
media,” in Proc. IEEE Int. Conf. Machine Learning
Applications (ICMLA), New York, USA, 2020, pp. 554-
560.
Zahirzada, A., & Lavangnananda, K. (2021, January).
Implementing predictive model for low birth weight in
Afghanistan. In 2021 13th International Conference on
Knowledge and Smart Technology (KST) (pp. 67-72).
IEEE.
Zahirzada, A., Zaheer, N., & Shahpoor, M. A. (2023).
Machine Learning Algorithms to Predict Anemia in
Children Under the Age of Five Years in Afghanistan: A
Case of Kunduz Province. Journal of Survey in Fisheries
Sciences, 10(4S), 752-762.
European International Journal of Multidisciplinary Research
and Management Studies
11
https://eipublication.com/index.php/eijmrms
European International Journal of Multidisciplinary Research and Management Studies
Biau, G., & Scornet, E. (2016). A random forest guided
tour. Test, 25, 197-227.
Altabrawee, H., Ali, O. A. J., & Ajmi, S. Q. (2019).
Predicting students’ performance using machine
learning techniques. JOURNAL OF UNIVERSITY OF
BABYLON for pure and applied sciences, 27(1), 194-
205.
Br, T. Der. (2022). The F-Measure Paradox The F-
Measure Paradox. February.
[10] Frank, E., Hall, M., Trigg, L., Holmes, G., & Witten,
I. H. (2004). Data mining in bioinformatics using Weka.
Bioinformatics, 20(15), 2479
–
2481.
