Hi, my name is
Videet Mehta.
I'm a student at MIT studying Computer Science. I'm passionate about frontier AI research and its applications in the real world.
About Me
Hi! I'm Videet Mehta, a Computer Science student at MIT passionate about pushing the boundaries of AI. My interests focuses on optimizing large language models and understanding reasoning capabilities. In the future, I want to work on either applied-AI or in core-AI research.
Fast-forward to today, and I've had the privilege of working at a commodities trading firm, an AI industry research lab, a gaming start-up, an MIT NLP lab
Additionally, I'm proud to have represented USA in the International Olympiad in Artificial Intelligence and to have won a gold medal!
Here are a few technologies I've been working with recently:
- Python
- PyTorch
- JAX
- SQL
- Node.JS
- React

Where I've Worked
- Developed a multivariate ridge-regularized model to forecast hyperscale data-center load, isolating demand anomalies across industrial, residential, and weather variables. Modeling ~60% of future data-center load accurately.
- Implemented GPU optimizations via building my own CUDA kernels for loading large weather foundation models onto limited GPU compute. Cut down fine-tuning costs but 50%.
- Designed a next-gen AI architecture combining graph and sequence modeling to capture complex power grid dynamics to predict eletricity marginal prices (LMPs), outperforming existing industry benchmarks by nearly 25% on MAE/RMSE metrics.
Some Things I've Built
Featured Project
Adaptive Splash Attention CUDA Kernel
Inspired from Goncalves et al.'s paper on Adaptive Sparse Attention and their Triton implementation, I decided to implement the kernel in CUDA.
This project implements Adaptive Sparse Attention in CUDA, which outperforms Flash Attention at longer sequence lengths by dynamically learning sparsity patterns. The ฮฑ-entmax normalization enables higher throughput on sparse sequences while maintaining accuracy comparable to dense attention.
We achieved up to 99.6% sparsity while maintaining high accuracy to vanilla self-attention (<1e-5). Currently we are significantly losing speed as a result, but I am currently working on increasing speed as a result of an increase in memory.
- Pytorch
- CUDA
- GPU Profiling
- C++
Featured Project
Reasoning in Diffusion Language Models
Wanted to learn about diffusion language models and how their internal reasoning mechanism emerges during the diffusion process. Developed mutual information estimation frameworks using multiple estimators (binning, KSG) to track information accumulation patterns and discrete "eureka moments" where models gain sudden insights about correct answers.
Implemented Feature-wise Linear Modulation (FiLM) adapters for latent reasoning injection, enabling fine-grained control over diffusion trajectories by conditioning reasoning representations at each layer.
- Pytorch
- Distributed Training
- LLM Reasoning
- SLURM
Featured Project
Cipher ML
Cipher eliminates traditional barriers to machine learning development - like the need for deep technical expertise, complex coding, and manual algorithm selection - by providing a natural language interface where users can simply describe what they want to predict in plain English. The platform leverages GPT-4 to analyze your data and recommend optimal algorithms, while offering an automated pipeline that handles the complete ML workflow from data upload to deployment. It provides intelligent explanations with AI-generated insights about your models and predictions, and enables one-click deployment with instant Docker containerization for production use.
- Python
- Node
- Sci-Kit Learn
- GPT API
Featured Project
Parameter Efficient Fine Tuning in Audio Visual Language Models
This project explores Parameter-Efficient Fine-Tuning (PEFT) strategies for mWhisper-Flamingo, a state-of-the-art multilingual audio-visual speech recognition model. With 3B parameters, the full model requires significant computing resources to fine-tune. We demonstrate that using LoRA with rank-16 and tuning only Query & Value projections recovers over 85% of noisy-speech performance while training 700x fewer parameters.
Our evaluation compares PEFT techniques like Linear Adapters, LoRA, and Soft Prompting, showing that LoRA-16 QV achieves the optimal balance between parameter efficiency and Word Error Rate performance. This makes fine-tuning large multimodal models more accessible to researchers with limited resources.
- Pytorch
- Python
- Distributed Training
- OpenAI Whisper
Recent Posts
View All Posts โProximal Policy Optimization (PPO)
Learning foundational and popular reinforcement learning algorithmsUnderstanding Adaptive Sparse Flash Attention
Attempting to beat traditional Flash-attention with sparse-attention optimizationsGaussian Mixture Models
Understanding Gaussian Mixture Models, a statistical machine learning model
Whatโs Next?
Get In Touch
I'm always open to new opportunities and connecting with other students and professionals. Whether you have a question or just want to say hi, feel free to reach out!
Say Hello