The American Journal of Engineering and Technology
18
https://www.theamericanjournals.com/index.php/tajet
TYPE
Original Research
PAGE NO.
18-23
10.37547/tajet/Volume07Issue07-03
OPEN ACCESS
SUBMITED
18 June 2025
ACCEPTED
29 June 2025
PUBLISHED
04 July 2025
VOLUME
Vol.07 Issue 07 2025
CITATION
Denis PINCHUK. (2025). Scalable Computer Vision in Enterprises:
Deployment, Limitations and Future Directions. The American Journal of
Engineering and Technology, 7(07).
https://doi.org/10.37547/tajet/Volume07Issue07-03
COPYRIGHT
© 2025 Original content from this work may be used under the terms
of the creative commons attributes 4.0 License.
Scalable Computer Vision
in Enterprises:
Deployment, Limitations
and Future Directions.
Denis PINCHUK
Senior Data Science Engineer at The Walt Disney
Company, USA
Abstract:
Computer vision (CV) is increasingly
embedded in enterprise workflows. This article presents
a comprehensive analysis of how CV systems are being
used to automate complex visual tasks, replace
repetitive labor, and enhance decision-making in
different industries at scale. Special attention is given to
the key determinants of CV effectiveness and
operational
challenges
companies
face
when
implementing the technology. The author notes that
treating computer vision not as a static tool but as an
evolving infrastructure, organizations can unlock
substantial value while preparing for the next
generation of AI-driven optimization.
Keywords:
computer vision, artificial intelligence, neural
networks, enterprise workflows, business process
optimization.
INTRODUCTION
Computer vision (CV), a subfield of artificial intelligence
focused on enabling machines to interpret and process
visual information, has transitioned from experimental
research into widespread industrial application. As
enterprises increasingly seek to automate visually
intensive tasks, CV systems are being integrated into
core business workflows across diverse sectors,
including manufacturing, retail, logistics, insurance,
agriculture, and public safety. These systems promise
not only operational efficiency and cost savings, but also
new levels of scalability. However, real-world
deployments reveal a complex landscape marked by
significant technical, infrastructural, and organizational
challenges.
This article provides a comprehensive overview of how
computer vision is currently used to optimize business
The American Journal of Engineering and Technology
19
https://www.theamericanjournals.com/index.php/tajet
processes, with an emphasis on practical use cases,
deployment barriers, and emerging trends. It draws on
both peer-reviewed research and first-hand industry
experience to examine the determinants of system
effectiveness,
including
data
quality,
model
architecture, computational constraints, and integration
strategies. The analysis highlights the need to treat CV
not as a static automation tool, but as a continuously
evolving component of enterprise infrastructure
—
one
that demands adaptive pipelines, robust evaluation
metrics, and long-term cross-functional collaboration.
By critically evaluating existing applications and future
directions, this paper aims to contribute to a deeper
understanding of the conditions under which computer
vision can deliver sustained value in real-world
enterprise environments.
Where Computer Vision Optimizes Business Processes
Computer vision replaces humans in tasks that require
constant visual attention
—
where people are prone to
fatigue, distraction, or inconsistency. The aim isn’t
simply to automate, but to do so with greater precision,
scale, and reliability.
In manufacturing, computer vision systems are widely
used to detect defects and monitor production lines.
They can spot inconsistencies in materials or shapes in
milliseconds. Research shows that modern CV solutions
can achieve 97% inspection accuracy [1]. Companies like
Siemens already have AI-driven quality control systems
designed to identify anomalies and defects across
different industries. One example is the automotive
industry, where manufacturers must avoid scratches,
dents, poor welds, and defective electronic components
[2].
In retail, computer vision delivers measurable
improvements. One common use is shelf monitoring:
instead of relying on employees to walk around the
store checking stock, cameras and vision algorithms
detect when a product is running low or misplaced and
automatically signal staff. This streamlines restocking,
cuts labor costs, and helps avoid lost sales due to empty
shelves [3]. A more analytical application involves
optimizing product placement. The system tracks how
customers interact with displays and uses that data to
determine which shelf arrangements maximize
purchase behavior. Though this is a different business
goal
—
one focused on logistics, the other on psychology
and sales
—
both rely on nearly identical CV technology.
In logistics and warehousing there are CV applications
like package sorting or movement tracking. In Amazon’s
fulfillment centers, for instance, an AI-powered trio of
robotic arms sort, stack, and consolidate millions of
items and customer orders. The latest version of one of
them called Sparrow uses advanced computer vision to
handle over 200 million unique products of all different
shapes, sizes, and weights [4]. One of the recent studies
shows that a computer vision platform can lead to as
much as a 45% reduction in the time required for
inventory counting and a 9% increase in inventory
accuracy [5].
In insurance, companies use computer vision to assess
vehicle damage from photos submitted after an
accident, reducing the need for physical inspections and
speeding up claims processing. Tractable, for example,
offers a CV-based platform that enables insurers to
process claims up to ten times faster than traditional
methods, with damage detection and cost estimation
models trained on millions of data-rich images [6].
And in agriculture, large-scale farms employ drone
footage and satellite imagery enhanced by computer
vision models to monitor crop health, detect pest
infestations, and predict yield. These models can analyze
patterns in plant coloration or canopy size to trigger
alerts or guide precision interventions. For instance,
leveraging AI and CV, John Deere is building AI-equipped
robot sprayers reducing herbicide usage by only
targeting weeds [7].
Security and surveillance represent another widespread
application. Large companies, airports, and factories use
CV to detect unauthorized access, track movement
patterns, or recognize suspicious behavior. These
systems analyze dozens of camera feeds in real time and
can flag anomalies before human operators notice
them. Research shows the AI сamera market is expected
to generate $35,5 billion in sales by 2034 as smart
surveillance and analytics gain popularity [8].
An illustrative example from my professional experience
involves the deployment of computer vision systems at
Walt Disney amusement parks to ensure that park
visitors remain secure. We use a network of installed
The American Journal of Engineering and Technology
20
https://www.theamericanjournals.com/index.php/tajet
cameras and custom-built CV algorithms to monitor
guest behavior in real time. These models are trained to
identify potentially dangerous situations
—
like someone
standing up on a ride or entering a restricted area
—
and
automatically notify operators. Previously, this kind of
monitoring required a person to watch 10 or 12 screens
at once, which is not only exhausting but also error-
prone. Our system can process all video feeds
simultaneously and with greater consistency, allowing us
to detect risks faster and more reliably.
Key Determinants of CV System Effectiveness
The success of a CV solution in optimizing business
processes depends on several interrelated factors. The
first and arguably most important component is data
quality. If the raw images used for training or prediction
are poor, there is little that even the advanced model
can salvage. Lighting conditions, resolution, occlusions,
and changes in the physical environment all introduce
variance. On our own project, for example, we’ve had to
continually monitor how changes in the park
environment
—
such as the installation of a new ride
that casts shadows in different areas
—
affect model
performance. Even if the camera itself hasn’t moved, the
new visual context can alter what the model “sees.” This
underscores the importance not only of high-quality
data at the training stage, but also of consistent, well-
understood inputs at inference time.
Beyond quality, data preprocessing and balancing are
also critical. In real-world scenarios, it's common for
datasets to contain far more examples of "normal"
conditions than of rare or hazardous events. Without
deliberate augmentation, rebalancing, or synthetic data
generation, the model may struggle to generalize
properly
—
especially in safety-critical use cases where
false negatives are unacceptable. As research has
shown, imbalanced training data can lead to
performance degradation in CV models, particularly in
anomaly classification tasks [9].
Another major factor is model architecture. Selecting
the right neural network structure depends not just on
accuracy, but also on computational efficiency,
deployment scalability, and hardware compatibility. In
our project, we’ve recently begun the proces
s of
migrating from one model to a more modern one
—
not
because the old one failed, but because it couldn’t scale
as efficiently meeting real-time requirements across
video streams. This kind of transition is resource-
intensive. Even with an established annotation pipeline
and pre-balanced datasets, adopting a new model often
means rebuilding much of the system infrastructure. It’s
effectively a new project, albeit one built on familiar
tracks.
That’s why model selection is both a technical and a
business decision. On the one hand, models trained
several years ago may still perform reasonably well, but
newer models released are often more adaptive, faster,
and easier to scale. Where an older model may require
extensive tuning to support inference from hundreds of
video streams, newer transformer-based or quantized
CNN architectures can handle such demands natively
and with far lower latency. In some cases, the primary
motivation for switching isn’
t accuracy at all
—
it’s cost.
If a modern model can generate predictions with 50%
less GPU load or achieve true real-time inference, the
operational savings across data centers can quickly
justify the upfront migration costs.
From a business process standpoint, companies must
strike a balance between locking themselves into a long-
term model (creating a “one
-
way door”) and preparing
for regular evolution. Ideally, a model pipeline should be
designed with modularity in mind
—
allowing retraining,
adaptation, or substitution of core components without
rebuilding everything from scratch. This flexibility is
especially important in rapidly evolving environments
where camera views, lighting, or visual targets may shift
over time.
Ultimately, the effectiveness of a CV system is not static.
It is shaped by a continuous feedback loop between
data, model, environment, and business constraints.
The companies that benefit most from computer vision
are those that treat it not as a static tool, but as a living
infrastructure
—
designed to evolve alongside
technology, context, and operational needs.
Implementation Challenges and Common Pitfalls
Despite the growing maturity of computer vision
technology, many implementation efforts still fall short
due to recurring and often underestimated challenges.
One of the most common problems is the lack of high-
quality training data discussed earlier. Computer vision
is data-hungry, and while organizations may have access
The American Journal of Engineering and Technology
21
https://www.theamericanjournals.com/index.php/tajet
to large volumes of images or video, these assets are
often unusable without proper annotation.
Sometimes data is simply missing
—
there aren’t enough
examples of edge cases or rare classes. Other times, the
annotations are too crude to be useful; for example,
drawing bounding boxes around objects may work for
simple detection tasks, but segmentation problems
require pixel-level precision. Without careful annotation
strategies, models end up overfitting to narrow patterns
—
essentially memorizing what an object “usually looks
like” rather than learning how to generalize to new
contexts.
To mitigate this, teams often turn to techniques like data
augmentation
—
flipping, cropping, rotating, or adding
noise to existing samples to artificially expand the
dataset. More advanced strategies include synthetic
data generation, where entire scenes are created
digitally to simulate rare or difficult-to-capture
conditions. Research has shown that training with high-
quality synthetic data, especially when combined with
domain adaptation techniques, can yield performance
close to real-data-trained models [10]. Recent advances
in self-supervised learning also opens possibilities for
models to bootstrap their own learning without fully
labeled data [11]. However, these methods require
expertise to implement effectively, and poor execution
can lead to models that perform well in simulation but
break down in the real world.
Another major barrier is the computational cost. Even
when the model performs well during testing, it may
require expensive GPU clusters to run at scale or meet
real-time demands. Many state-of-the-art vision models
are extremely resource-intensive, requiring high-
performance GPUs, expensive cloud infrastructure, and
significant energy consumption [12]. Updating to more
efficient architectures or shifting to edge-computing
strategies can reduce hardware load
—
but such
transitions must be weighed against engineering
overhead and financial cost.
Technical infrastructure often becomes a limiting factor.
Unlike typical web applications, computer vision
requires tight integration between hardware (cameras,
network interfaces, storage systems) and software.
Bandwidth constraints, latency, and storage limitations
become serious concerns, especially when dealing with
high-resolution live video. Organizations may also lack
the backend maturity to support continuous
deployment (CI/CD) pipelines or robust monitoring
systems to detect model drift
—
where prediction
accuracy degrades as real-world conditions shift over
time.
A further challenge, often overlooked, is organizational
and operational misalignment. Business leaders may
evaluate new CV deployments not against a defined set
of key performance indicators (KPIs), but against legacy
manual processes
—
often without establishing success
criteria or appropriate benchmarks. Such misalignment
can lead to unrealistic expectations, misinterpretation
of results, and the premature abandonment of
promising technologies.
I encountered this firsthand while working with the
Florida Department of Transportation. The goal was to
monitor vehicle counts and classify traffic composition
on public roads to inform infrastructure planning and
safety initiatives. Traditionally, the department relied on
inductive loops
—
physical sensors embedded in
roadways
—
to collect such data. Seeking a more
scalable solution, they aimed to repurpose existing
surveillance cameras for this task.
I was tasked with designing and deploying a CV-based
system capable of analyzing live video streams of
relatively low quality. The model needed to classify
vehicles in real time by type (e.g., passenger car, truck).
I selected YOLO (You Only Look Once), an object
detection architecture recognized for its high processing
speed [13]. While this choice involved a minor trade-off
in accuracy compared to more computationally
intensive models, YOLO’s real
-time performance was
essential, as legal constraints prohibited the department
from storing any footage.
The resulting system achieved 97% classification
accuracy
—
surpassing the 90% accuracy of the legacy
inductive loop system. Yet even this comparison
undersells the advantages of the CV-based approach.
Unlike physical sensors, which must be installed
individually across thousands of roadways, CV models
can be deployed across existing camera networks with
minimal marginal cost. Moreover, such systems are
inherently more maintainable: they do not degrade
physically, require no manual recalibration, and offer
The American Journal of Engineering and Technology
22
https://www.theamericanjournals.com/index.php/tajet
centralized scalability. The department gained a more
reliable, flexible, and cost-efficient source of traffic
intelligence
—
one that fundamentally changed the
nature of their data infrastructure.
This case underscores why the evaluation of CV systems
requires a shift in mindset. Traditional performance
metrics such as accuracy or cost must be contextualized
within
broader
considerations
—
deployment
scalability, operational resilience, and long-term
maintainability. In many instances, success cannot be
determined through a direct comparison to legacy
systems, but rather through a holistic assessment of
how well the technology aligns with the organization’s
evolving goals and constraints.
CONCLUSION AND FUTURE RESEARCH DIRECTIONS
Despite its immense potential, computer vision remains
a field where technical success and business value do
not always align seamlessly. Many organizations
underestimate the complexity of implementation.
Challenges
like
data
imbalance
and
shifting
environments can derail promising pilot projects if not
addressed proactively. Additionally, success metrics
often remain misaligned across teams, and traditional
benchmarks fail to capture the strengths of scalable CV
deployments. This calls for a mindset shift: evaluating
computer vision not merely as an automation upgrade,
but as an evolving system that demands infrastructure
thinking, adaptive strategies, and long-term cross-
functional collaboration.
Looking ahead, the future of enterprise computer vision
lies in greater modularity, improved model efficiency,
and tighter integration with self-learning systems. As
new architectures like transformer-based CV models
mature and edge computing becomes more accessible,
companies will be able to deploy vision solutions more
flexibly and cost-effectively.
A particularly promising direction is the integration of
large language models with CV systems, enabling more
context-aware, multimodal AI solutions. This fusion
allows enterprises not only to "see" but also to
"understand" visual input in more complex operational
contexts
—
such as interpreting surveillance footage
alongside written reports or automating inspection
workflows using both image data and text-based
specifications. Simultaneously, advances in synthetic
data generation and self-supervised learning are
reducing dependency on fully labeled datasets, making
it easier to scale in data-constrained domains.
Ultimately, organizations that treat computer vision as a
strategic capability
—
not a plug-and-play tool
—
will be
better positioned to harness its full transformative
power across industries.
REFERENCES
1.
El Melhaoui Ouafae,
Jeffali, Sara Said, S Elouaham, Integration of
artificial intelligence algorithms for defect
detection and shape recognition in mechanical
quality
control
//
Interactions,
2024.
2.
Siemens, AI
‑
based quality inspection, URL:
3.
Rocco Pietrini, Marina Paolanti, Adriano Mancini,
Emanuele Frontoni, Primo Zingaretti, A deep
learning-based system for shelf visual monitoring
// Expert Systems with Applications, 2024.
4.
Amazon, Amazon fulfillment center robotics and
AI,
2024.
URL:
https://www.aboutamazon.com/news/operation
s/amazon-fulfillment-center-robotics-ai
5.
William
Villegas-Ch,
Alexandra
Maldonado
Navarro, Santiago Sanchez-Viteri, Optimization of
inventory management through computer vision
and machine learning technologies // Intelligent
Systems
with
Applications,
2024.
DOI:
6.
InsurTech Digital, Insurance claims AI unicorn
Tractable closes $65 M Series E, 2023. URL:
https://insurtechdigital.com/articles/insurance-
claims-ai-unicorn-tractable-closes-65m-series-e
7.
Chris Padwick, Director of Machine Learning and
Computer Vision, John Deere, Transforming
Agriculture with AI and Computer Vision, 2024.
URL:
The American Journal of Engineering and Technology
23
https://www.theamericanjournals.com/index.php/tajet
8.
Transparency
Market
Research,
Artificial
Intelligence (AI) Camera Market to Generate US$
35,5 Billion in Sales by 2034 as Smart Surveillance
and Analytics Gain Popularity, 2025.
9.
Manisha Saini, Seba Susan, Tackling class
imbalance in computer vision: a contemporary
review // Artificial Intelligence Review, 2023. DOI:
10.
Lukas Malte Kemeter, Rasmus Hvingelby, Paulina
Sierak, Tobias Schön, Bishwajit Gosswami,
Towards Reducing Data Acquisition and Labeling
for Defect Detection using Simulated Data, 2024.
arXiv:2406.19175v1 [cs.LG]
11.
Zhong Hong, Exploring self
‑
supervised learning:
training without labeled data, Medium, 2023. URL:
12.
AI Stack, How Much GPU Resources Are Required
for AI Development and ML Model Training, 2024.
URL:
https://ai-stack.ai/en/ai-model-training-gpu-
13.
Momina Liaqat Ali, Zhou Zhang, The YOLO
Framework: A Comprehensive Review of
Evolution, Applications, and Benchmarks in Object
Detection // Computers, 2024, 13(12), 336. DOI:
