Text this: An IoT-enhanced automatic music composition system integrating audio-visual learning with transformer and SketchVAE