Communication barriers between the deaf and hearing communities often limit inclusivity in education, healthcare, and daily interactions. This research presents VOCAL CODE, a vision-based deep learning framework designed to translate American Sign Language (ASL) gestures into text and speech in real time. Utilizing a standard webcam for gesture capture and a Convolutional Neural Network (CNN) for classification, the system eliminates the need for costly sensors or wearable devices. The integrated auto-correction module enhances text accuracy, while the text-to-speech engine facilitates auditory feedback for seamless interaction. Experimental evaluation achieved an accuracy of approximately 98%, demonstrating the models robustness and adaptability across varied conditions. The proposed solution provides a cost-effective, scalable, and user-friendly approach to assistive communication, promoting accessibility and inclusivity for individuals with hearing impairments. Future work aims to extend the system to recognize full-word gestures, incorporate multilingual capabilities, and optimize real-time performance on mobile and web platforms.