Why I'm excited about Sora and Image Generation Models
AI Image and Video Generation models like DALL.E and Sora boost efficiency and productivity
I recently read this post on
(would recommend for AI tips & tricks) in which Navin talks about why he doesn’t find DALL.E/Midjourney useful for actual business/work use cases and isn’t excited about Open AI’s Video Generation model - Sora for similar reasons.Contrary to Navin’s viewpoint, my experience with these tools has been pretty transformative. At work, we have been using AI Image & Video tools to streamline our workflow with notable time and cost savings. Initially intending to leave a brief comment on Navin’s post, I ultimately ended up with this full blog post.
Let me start by writing about how we are employing image generation tools in my work (with links to the specific tools) and later some thoughts on where I see this going.
My company, XPR POS, offers Self Ordering Solutions for restaurants i.e. Kiosks and Mobile Apps that help their customers skip the line and order food. While a lot of work goes into selecting the right hardware and designing an efficient and reliable software, what truly impresses the clients is the graphics and presentation on the screen. We end up putting in a significant amount of work in presentation of the food images, creating banners, videos, GIFs to create a great overall experience.
Here are a list of problems that AI tools are helping us solve-
Background removal - A lot of times the food images we get from the clients have other unwanted background objects that need to be removed. For e.g. cropping out a Burger from a messy background. This is hard and required a lot of manual work by a graphic designer. Most of the times they would end up editing it pixel by pixel but it would still result in rough edges. This was one of the first and major problems AI solved for us. Since last year, we use an AI based tool called Photo Room . We can upload a set of images in bulk and it can output a set with backgrounds cleanly removed. Many hours of work is now reduced to few mins.
Stock images - For most of our demos and for clients that haven’t done food photography of their menu, we end up using stock images. Finding good license-free images is time consuming. Or you need to pay for images which increases cost. Now we are using Clip Drop - Stable Diffusion XL. They have an API that can be used for bulk image generation. Also exploring DALL.E that has an API. Midjourney is good quality but doesn’t have an API yet.
Upscaling images - Sometimes the food images provided by the client are low-resolution and look pixelated on the large kiosk screens. Until now, the only option was to re-shoot the images which is expensive. However, now there are some great AI tools for upscaling. We are currently evaluating Clip Drop - Image Upscaler and Magnific AI . (Be warned though that sometimes the upscalers re-imagine a completely different image so QC the output).
Animations & Promotions - Sometimes some subtle animations help highlight a product on the Kiosk. For example a steaming cup of coffee that actually shows steam rising from the cup. This requires hours of work by a graphic designer. Tools by RunwayML and Pika make this much easier. You will still need a graphic designer (for now) but they will spend much less time building the animations frame by frame.
Videos - We create a ton of videos that play on the home screen when the Kiosk is idle. It is generally a slideshow of the promotional items animating and zooming in various different ways with text interspersed in between. Again, this is time consuming and requires expensive video editing software. This is one problem, that I am hoping Sora will solve for us - it could a be huge and time and cost saver. Note that the demo page only shows text to video but check the Technical Report for examples of prompting with images.
Here are a few limitations with these AI tools -
As with all AI tools, whether it’s text or image models, the quality is not always reliable. There will always be a small percentage of data that it does not process correctly and will have to manually fixed.
Unlike text model like GPT where the output is editable (like source code, blog post etc.) with image models, your only hope is to re-run it several times hoping it gives you the desired output.
Which is why you still need a graphic designer to verify the output and fix things that the AI did not process correctly. But if you earlier needed full time folks for graphics work, you only need them part-time now.
Looking ahead, I can see a lot of disruption happening in online marketing and advertising. A few examples -
Logo Design - GPT4+DALL.E is ideal for brainstorming the design process and creating great logo designs.
Stock Images & Videos - Stock images and videos are used in numerous ads. Stock images can now be cheaply generated using the image models and I can easily see videos like this being used without any expensive cameras or drone footage.
Online Retail - Clothing brands can now do one photoshoot with their model and automatically present all new collections with the same model without having to reshoot with each new garment like this. They could even use an AI model rather than a real person like this. We will soon have models walking around in those garments with something like Alibaba’s Animate Anyone. Any retail brand - perfume, footwear etc can showcase their product with professional backgrounds and settings using this. Until recently, this was out of reach for small businesses that couldn’t afford expensive photoshoots.
Even longer term, we may see AI generate full length video advertisements, animated shorts or even movies but it’s hard to predict how long that will take. Actually, might not be that long considering that “Will Smith eating spaghetti” AI video was just about an year ago and look at where we are now with Sora!
Kaustubh, your article is very insightful. More than just being fascinated by AI, identifying its utility in day-to-day work is critical. The blog beautifully describes this for your sector in detail. Now I’m just looking forward to the release of Sora to a wider audience.
I guess I should have qualified my statement: I don't find DALL.E/Midjourney useful for actual business/work use cases (unless your business itself is production of images/videos and related creatives).