
Search by job, company or skills

This job is no longer accepting applications
Founding AI Engineer (Video x Multimodal Models) AuraFarming
Location:Remote (India only)
Type:Full-time, Founding Team
About Us
We're buildingAuraFarming, the Cursor for video creation.
A next-generationAI video platformwhere users can upload any video, describe what changes they want, and our system will automatically recreate or clone it with new visuals, new products, and new music.
ThinkSora + Veo + Gemini + Runwaycombined into one integrated creative IDE.
Our goal is to build the unified layer between video understanding and video generation.
The Role
We're hiring aFounding AI Engineerto lead our multimodal intelligence stack.
You'll design the pipeline that lets our platform watch, understand, and recreate videos.
Your work will define the heart of the system: transforming raw media into structured scene representations and turning them into generative prompts for top-tier models like Sora, Veo, and Runway.
This is a zero-to-one role: build, fine-tune, and iterate. You'll be working directly with the founders and the founding full-stack engineer.
Your Mission
Design thevideo understanding engine: detect scenes, subjects, motion, music, voice, and text.
Build theprompt compilerthat converts user edits into model-ready JSON instructions.
IntegrateGemini 1.5 Pro,GPT-4o, andLLaVA-Videofor multimodal reasoning.
Connect with video generation APIs:Sora, Veo, Runway, Higgsfield, WAN 2.5.
Prototypevideo-to-video delta generation(upload + describe regenerate).
Collaborate on backend integration and optimization for latency and cost.
Own R&D for diffusion-based and transformer-based video models (AnimateDiff, I2VGen-XL, VideoCrafter2).
Core Competencies
Model Integration
Experience with multimodal APIs (Gemini, GPT-4o, Claude, LLaVA).
Knowledge of diffusion pipelines (AnimateDiff, Stable Video Diffusion).
Ability to call and orchestrate video generation endpoints (Sora, Veo, Runway).
Computer Vision / Audio
Familiar with ffmpeg, frame extraction, CLIP embeddings, Whisper transcription.
Understands temporal modeling and scene segmentation.
Experience with image/video captioning, visual grounding, or action recognition.
Prompt Engineering & Reasoning
Design structured prompt schemas for multimodal models.
Ability to parse user deltas into JSON commands.
Experience fine-tuning or prompting LLMs for structured output.
Programming Stack
Python, PyTorch, FastAPI, Celery, ffmpeg, PostgreSQL.
Working familiarity with OpenAI, Google AI, Replicate, or RunPod APIs.
Mindset
Thinks in systems: how to turn raw media into data, not demos.
Ships fast and iterates with founders.
Wants to invent the next layer of creative AI, not just use existing APIs.
Target Tech Stack
Core Models:Gemini 1.5 Pro, GPT-4o, LLaVA-Video, AnimateDiff, Veo, Sora
Frameworks:PyTorch, FastAPI, Celery, ffmpeg
Infrastructure:RunPod, Modal, AWS, R2 Storage
Integrations:ElevenLabs (voice), Suno/Mubert (music), Stripe (credits)
Why Join AuraFarming
Ground floor founding position with deep product ownership.
Direct involvement in cutting-edge multimodal video systems.
Work with founders shipping real generative products, not research demos.
Equity upside and full creative control over AI direction.
Fast execution culture: idea to prototype in days, not months.
Compensation
Competitive salary (India benchmark)
6L-12L/yr
Founder-level equity allocation
Performance-linked upside
Who You Are
2-4+ years experience in AI/ML or computer vision
1+ years working with multimodal or diffusion models
Strong Python + PyTorch background
Experience shipping production-grade model pipelines
Self-sufficient, execution-first, comfortable building fast
How to Apply
Send your GitHub, LinkedIn, and a short note on:
Apply here:https://tally.so/r/w89RjY
Job ID: 128872073