MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention

Aruco markers are crucial for navigation in complex indoor environments, especially for those with visual impairments. Traditional CNNs handle image segmentation well, but transformers excel at capturing long-range dependencies, essential for machine vision tasks. Our study introduces MoNetViT (Mini...

Full description

Saved in:
Bibliographic Details
Main Authors: Liliek Triyono, Rahmat Gernowo, Prayitno
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-02-01
Series:Frontiers in Computer Science
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fcomp.2025.1510252/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823860408414371840
author Liliek Triyono
Liliek Triyono
Rahmat Gernowo
Prayitno
author_facet Liliek Triyono
Liliek Triyono
Rahmat Gernowo
Prayitno
author_sort Liliek Triyono
collection DOAJ
description Aruco markers are crucial for navigation in complex indoor environments, especially for those with visual impairments. Traditional CNNs handle image segmentation well, but transformers excel at capturing long-range dependencies, essential for machine vision tasks. Our study introduces MoNetViT (Mini-MobileNet MobileViT), a lightweight model combining CNNs and MobileViT in a dual-path encoder to optimize global and spatial image details. This design reduces complexity and boosts segmentation performance. The addition of a multi-query attention (MQA) module enhances multi-scale feature integration, allowing end-to-end learning guided by ground truth. Experiments show MoNetViT outperforms other semantic segmentation algorithms in efficiency and effectiveness, particularly in detecting Aruco markers, making it a promising tool to improve navigation aids for the visually impaired.
format Article
id doaj-art-29131a43a9a44b1182b3dee1f504667b
institution Kabale University
issn 2624-9898
language English
publishDate 2025-02-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Computer Science
spelling doaj-art-29131a43a9a44b1182b3dee1f504667b2025-02-10T14:10:02ZengFrontiers Media S.A.Frontiers in Computer Science2624-98982025-02-01710.3389/fcomp.2025.15102521510252MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attentionLiliek Triyono0Liliek Triyono1Rahmat Gernowo2 Prayitno3Doctoral Program of Information System, Diponegoro University, Central Java, Semarang, IndonesiaDepartment of Electrical Engineering, Politeknik Negeri Semarang, Semarang, IndonesiaDoctoral Program of Information System, Diponegoro University, Central Java, Semarang, IndonesiaDepartment of Electrical Engineering, Politeknik Negeri Semarang, Semarang, IndonesiaAruco markers are crucial for navigation in complex indoor environments, especially for those with visual impairments. Traditional CNNs handle image segmentation well, but transformers excel at capturing long-range dependencies, essential for machine vision tasks. Our study introduces MoNetViT (Mini-MobileNet MobileViT), a lightweight model combining CNNs and MobileViT in a dual-path encoder to optimize global and spatial image details. This design reduces complexity and boosts segmentation performance. The addition of a multi-query attention (MQA) module enhances multi-scale feature integration, allowing end-to-end learning guided by ground truth. Experiments show MoNetViT outperforms other semantic segmentation algorithms in efficiency and effectiveness, particularly in detecting Aruco markers, making it a promising tool to improve navigation aids for the visually impaired.https://www.frontiersin.org/articles/10.3389/fcomp.2025.1510252/fullindoor navigationcomputer visionmarkersassistive technologymobile devices
spellingShingle Liliek Triyono
Liliek Triyono
Rahmat Gernowo
Prayitno
MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention
Frontiers in Computer Science
indoor navigation
computer vision
markers
assistive technology
mobile devices
title MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention
title_full MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention
title_fullStr MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention
title_full_unstemmed MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention
title_short MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention
title_sort monetvit an efficient fusion of cnn and transformer technologies for visual navigation assistance with multi query attention
topic indoor navigation
computer vision
markers
assistive technology
mobile devices
url https://www.frontiersin.org/articles/10.3389/fcomp.2025.1510252/full
work_keys_str_mv AT liliektriyono monetvitanefficientfusionofcnnandtransformertechnologiesforvisualnavigationassistancewithmultiqueryattention
AT liliektriyono monetvitanefficientfusionofcnnandtransformertechnologiesforvisualnavigationassistancewithmultiqueryattention
AT rahmatgernowo monetvitanefficientfusionofcnnandtransformertechnologiesforvisualnavigationassistancewithmultiqueryattention
AT prayitno monetvitanefficientfusionofcnnandtransformertechnologiesforvisualnavigationassistancewithmultiqueryattention