ML/AI production and research
I am working on multimodal content understanding, with a heavy focus on audio understanding models. I mainly use vision transformer based models and tokenization/quantization methods in research to enhance the capabilities of content understanding models to represent audio embeddings better. My work includes designing the pipelines for model pretraining in audio standalone setting, model evaluation on audio classification tasks, and the process of audio fusion into a prebuilt vision and language pipelines.
My work on audio obtained good performance, which results in my internship extension as a part-time researcher. I am working towards publishing my results.