LIGHTWEIGHT HYBRID U-NET WITH VISION TRANSFORMER BLOCKS FOR CROSS-DOMAIN MEDICAL IMAGE SEGMENTATION

Inanemoh Jossy; Jubril Abu Al-Amin; Isah Mohammed Monday

DJCCMT

Delta Journal of Computing, Communications and Media Technologies

ISSN:3092-8478

Advancing research and innovation at the intersection of computing technology and media. A publication of Southern Delta University, Ozoro.

Delta Journal of Computing, Communications and Media Technologies(DJCCMT) is an open access double-blind peer reviewed and refereed Journal that brings together reasoned thoughts, research, and industry practice in areas of Computing, Artificial Intelligence, Robotics, System Engineering, Data Science, Analytics, Embedded Systems, Information and System Security, Media Studies, Communication Technologies, Information Science, Library Science, Educational Technologies, Applied Computing, and related disciplines in a reader-friendly format. The Journal is published online monthly with print version issue in February, May, August and November.

Delta Journal of Computing, Communications and Media Technologies

Volume 2 · Issue 1 · July 2025

Title of Paper

LIGHTWEIGHT HYBRID U-NET WITH VISION TRANSFORMER BLOCKS FOR CROSS-DOMAIN MEDICAL IMAGE SEGMENTATION

Abstract

This study proposes a lightweight hybrid model that integrates U-Net with Vision Transformer (ViT) blocks to enable accurate and efficient segmentation across two medical imaging domains: cardiac MRI and breast cancer ultrasound. The model employs a compact U-Net backbone enhanced with lightweight ViT modules inspired by MobileViT and is designed for deployment on resource-constrained platforms such as Google Colab. It was trained and evaluated on two public datasets—the ACDC cardiac MRI dataset for segmenting the left ventricle (LV), right ventricle (RV), and myocardium, and the BUSI breast ultrasound dataset for classifying benign and malignant lesions. Performance was benchmarked against U-Net, Attention U-Net, and TransUNet using the Dice coefficient. Experimental results show that the proposed hybrid model achieves segmentation accuracy comparable to TransUNet (Dice ≈ 0.92 on ACDC and ≈ 0.85 on BUSI) while reducing parameter count by 40% and VRAM usage by approximately 35%. The model also demonstrates strong cross-domain generalization, with only a 3% Dice score reduction when fine-tuned across domains, compared to up to 7% degradation observed in baseline models. These findings indicate that the proposed lightweight U-Net–ViT hybrid offers an effective balance between accuracy, efficiency, and adaptability, making it highly suitable for low-resource medical imaging applications.

Authors

Inanemoh Jossy, Jubril Abu Al-Amin, Isah Mohammed Monday

Keywords

Image Segmentation, Vision Transformer, Lightweight, Medical Imaging

References

Ashino, K., & Kamiya. (2024). Joint segmentation of sternocleidomastoid and skeletal muscles in computed tomography images using a multiclass learning approach. Radiology and Physics Technology, 17, 854–861.
Azizi, A. (2020). Applications of artificial intelligence techniques to enhance sustainability of Industry 4.0: Design of an artificial neural network model. Complexity.
Berhane, G., & Deng. (2023). Transformer fusion context pyramid medical image segmentation network. Frontiers in Neuroscience, 17, 1288366.
Cetinsoy, E. E. (2021). A hybrid approach to kinematic calibration of robots using artificial neural networks. IEEE Transactions on Robotics and Automation, 37(3), 1072–1081.
Chen, J., & Adeli. (2021). TransUNet: Transformers make strong encoders for medical image segmentation. arXiv. https://arxiv.org/abs/2102.04306
Denavit, J., & Hartenberg, R. S. (1955). A kinematic notation for lower-pair mechanisms based on matrice. Trans ASME Journal of Applied Mechanics, 23.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. B. (2018). Pre-training of deep bidirectional transformers for language understanding. arXiv. https://arxiv.org/abs/1810.04805
Ding, Wang, W., Chen, C. M., Yu, H., Zha, S., & L. (2022). TransBTS: Multimodal brain tumor segmentation using transformer. Medical Image Analysis, 75, 102275.
Dosovitskiy, A., & Beyer. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv. https://arxiv.org/abs/2010.11929
Enferadi, J., & S., H. (2020). Comparative study of the neural and neuro-fuzzy networks for direct path generation of a new fully spherical parallel manipulator. Australian Journal of Mechanical Engineering.
Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y., Han, X., Chen, Y.-W., & Wu, J. (2020, May). UNet 3+: A full-scale connected U-Net for medical image segmentation. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1055–1059). IEEE. https://doi.org/10.1109/ICASSP40776.2020.9053405
Kayalibay, B., Jensen, G., & van der Smagt, P. (2017). CNN-based segmentation of medical imaging data. arXiv preprint. arXiv:1701.03056. https://arxiv.org/abs/1701.03056
Lou, A., Guan, S., & Loew, M. H. (2021). DC-UNet: Rethinking the U-Net architecture with dual-channel efficient CNN for medical image segmentation. In Medical Imaging 2021: Image Processing (Vol. 11596, 115962T). SPIE. https://doi.org/10.1117/12.2582338
Lin, Z., Liu, Y., Li, Y., Lin, W., & Guo, B. (2021). Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021) (pp. 9992–10002). IEEE. https://doi.org/10.1109/ICCV48922.2021.00986
Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1), 1334–1373. https://jmlr.org/papers/v17/15-522.html
Nguyen, H. M. (2019). Dynamic analysis and control of a 3-link robotic arm using Lagrangian formulation. International Journal of Mechanical Engineering and Robotics Research, 8(5), 612–618.
Niku, S. B. (2012). Introduction to robotics: Analysis, control, applications (2nd ed.). John Wiley & Sons.
Peng, X. B. (2018). DeepMimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics, 37(4), 1–14.
Ronneberger, O., Fischer, P., & Brox. (2016). U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI.
Ronneberger, O. F. (2021). U-Net: Convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 234–241.
Rus, D. &. (2019). Design, fabrication and control of soft robots. Nature, 521(7553), 467–475.
Sahu, T. A. (2021). Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3), 50–60.
Sayed, A. A. (2020). Deep learning-based kinematic modeling of a 3-RRR parallel manipulator. Advances in Intelligent Systems and Computing.
Schulman, J. W. (2017). Proximal policy optimization algorithms. arXiv. https://arxiv.org/abs/1707.06347
Subramanian, S. &. (n.d.). Numerical analysis of robotic manipulator subject to mechanical flexibility by Lagrangian method. Proceedings of the National Academy of Sciences, India Section A: Physical Sciences, 90, 503–513.
Sutton, R. S. (2019). Reinforcement learning: An introduction (2nd ed.). MIT Press.
Thet, N. Y. P., & T., M. (2019). Forward kinematics and performance test of a six degree. International Journal of Advances in Scientific Research and Engineering (ijasre), 148.
Tobin, J. F. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. IEEE/RSJ IROS.
Tursynbek, I., & S., A. (2021). Infinite rotational motion generation and analysis of a spherical parallel manipulator with coaxial input axes. Mechatronics.
Valanarasu, & Hacihaliloglu. (2021). Medical transformer: Gated axial-attention for medical image segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) (pp. 36–46).
Wu, H., Xiao, B., & Codella. (2021, October). Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 22–31).
Xiao, X., Lian, S., Luo, Z., & Li, S. (2018, October 19–21). Weighted res-unet for high-quality retina vessel segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME) (pp. 327–331).
Yao, T., Li, Y., Pan, Y., & Wang. (2023). Dual vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 10870–10882.
Younis, M. A. (2020). Lagrangian-based modeling and motion optimization for medical rehabilitation robots. Biomedical Engineering Letters, 10(4), 475–483.
Zhong, M. L. (2020). Improved kinematic modeling for serial manipulators using hybrid analytical techniques. Robotics and Computer-Integrated Manufacturing, 65, 101983.
Zhou, Z., Rahman Siddiquee, M., Tajbakhsh, N., & Liang. (2018, September 20). A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (DLMIA 2018 & ML-CDS 2018).
Zhu, Y. M.-F. (2020). Target-driven visual navigation in indoor scenes using deep reinforcement learning. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).