Skip to content

SD – News 24

Dif­fe­ren­ti­al Dif­fu­si­on: Giving Each Pixel Its Strength

Dif­fu­si­on models have revo­lu­tio­ni­zed image gene­ra­ti­on and editing, pro­du­cing sta­te-of-the-art results in con­di­tio­ned and uncon­di­tio­ned image syn­the­sis. While cur­rent tech­ni­ques enable user con­trol over the degree of chan­ge in an image edit, the con­troll­a­bi­li­ty is limi­t­ed to glo­bal chan­ges over an enti­re edi­ted regi­on. This paper intro­du­ces a novel frame­work that enables cus­to­miza­ti­on of the amount of chan­ge per pixel or per image region.




Pro­teus ser­ves as a sophisti­ca­ted enhance­ment over OpenDalleV1.1, lever­aging its core func­tion­a­li­ties to deli­ver supe­ri­or out­co­mes. Key are­as of advance­ment include heigh­ten­ed respon­si­ve­ness to prompts and aug­men­ted crea­ti­ve capa­ci­ties. To achie­ve this, it was fine-tun­ed using appro­xi­m­ate­ly 220,000 GPTV cap­tio­ned images from copy­right-free stock images (with some ani­me included), which were then nor­ma­li­zed. Addi­tio­nal­ly, DPO (Direct Pre­fe­rence Opti­miza­ti­on) was employ­ed through a coll­ec­tion of 10,000 careful­ly sel­ec­ted high-qua­li­ty, AI-gene­ra­ted image pairs.


Film­grai­ner is an image pro­ces­sing algo­rithm that adds noi­se to an image resembling pho­to­gra­phic film grain. It’s imple­men­ted in Python and runs as a com­mand line uti­li­ty on Linux plat­forms, installable with pip.



Unli­ke com­mon text-to-video model (like OpenAI/​Sora), this model is for per­so­na­li­zed vide­os using pho­tos of your fri­ends, fami­ly, or pets. By trai­ning an embed­ding with the­se images, it crea­tes cus­tom vide­os fea­turing your loved ones, brin­ging a uni­que touch to your memories.




GLIGEN is a novel way to spe­ci­fy the pre­cise loca­ti­on of objects in text-to-image models. I pre­sent here an intui­ti­ve DARK GUI that makes it signi­fi­cant­ly easier to use GLIGEN with ComfyUI.


Detect Any­thing You Want with Groun­ding DINO |

Cos­mo­pe­dia is a data­set of syn­the­tic text­books, blog­posts, sto­ries, posts and WikiHow artic­les gene­ra­ted by Mixtral-8x7B-Instruct-v0.1.The data­set con­ta­ins over 30 mil­li­on files and 25 bil­li­on tokens, making it the lar­gest open syn­the­tic data­set to date.


Goog­le Gem­ma Models

run­ning on Hug­ging­face Infe­rence Client



YOLOv9: Lear­ning What You Want to Learn Using Pro­gramma­ble Gra­di­ent Information

Today’s deep lear­ning methods focus on how to design the most appro­pria­te objec­ti­ve func­tions so that the pre­dic­tion results of the model can be clo­sest to the ground truth. Mean­while, an appro­pria­te archi­tec­tu­re that can faci­li­ta­te acqui­si­ti­on of enough infor­ma­ti­on for pre­dic­tion has to be desi­gned. Exis­ting methods igno­re a fact that when input data under­goes lay­er-by-lay­er fea­ture extra­c­tion and spa­ti­al trans­for­ma­ti­on, lar­ge amount of infor­ma­ti­on will be lost. This paper will del­ve into the important issues of data loss when data is trans­mit­ted through deep net­works, name­ly infor­ma­ti­on bot­t­len­eck and rever­si­ble func­tions. We pro­po­sed the con­cept of pro­gramma­ble gra­di­ent infor­ma­ti­on (PGI) to cope with the various chan­ges requi­red by deep net­works to achie­ve mul­ti­ple objectives.



PIVOT: Promp­ting with Ite­ra­ti­ve Visu­al Optimization

The demo below show­ca­ses a ver­si­on of the PIVOT algo­rithm, which uses ite­ra­ti­ve visu­al prompts to opti­mi­ze and gui­de the reaso­ning of Visi­on-Lang­au­ge-Models (VLMs). Given an image and a descrip­ti­on of an object or regi­on, PIVOT ite­ra­tively sear­ches for the point in the image that best cor­re­sponds to the descrip­ti­on. This is done through visu­al promp­ting, whe­re ins­tead of reaso­ning with text, the VLM reasons over images anno­ta­ted with sam­pled points, in order to pick the best points. In each ite­ra­ti­on, we take the points pre­vious­ly sel­ec­ted by the VLM, resam­ple new points around the their mean, and repeat the process.


Intro­du­cing Sta­ble Cascade

Today we are releasing Sta­ble Cas­ca­de in rese­arch pre­view, a new text to image model buil­ding upon the Würst­chen archi­tec­tu­re. This model is being released under a non-com­mer­cial licen­se that per­mits non-com­mer­cial use only.


Com­fyUI ProPost

A set of cus­tom Com­fyUI nodes for per­forming basic post-pro­ces­sing effects. The­se effects can help to take the edge off AI imagery and make them feel more natu­ral. We only have five nodes at the moment, but we plan to add more over time.


Intro­du­cing Sora, our text-to-video model. Sora can gene­ra­te vide­os up to a minu­te long while main­tai­ning visu­al qua­li­ty and adhe­rence to the user’s prompt.



While this node is con­nec­ted, this will turn your sampler’s CFG sca­le into some­thing else. This methods works by res­ca­ling the CFG at each step by eva­lua­ting the poten­ti­al avera­ge min/​max values. Aiming at a desi­red out­put inten­si­ty (by inten­si­ty I mean over­all brightness/​saturation/​sharpness). The base inten­si­ty has been arbi­tra­ri­ly cho­sen by me and your sampler’s CFG sca­le will make this tar­get vary. I have set the “cen­tral” CFG at 8. Mea­ning that at 4 you will aim at half of the desi­red ran­ge while at 16 it will be dou­bled. This makes it feel slight­ly like the usu­al when you’­re around the nor­mal values.


YOLO-World + EfficientSAM

This is a demo of zero-shot object detec­tion and ins­tance seg­men­ta­ti­on using YOLO-World and EfficientSAM.


SDXL-Light­ning is a light­ning-fast text-to-image gene­ra­ti­on model. It can gene­ra­te high-qua­li­ty 1024px images in a few steps. For more infor­ma­ti­on, plea­se refer to our rese­arch paper: SDXL-Light­ning: Pro­gres­si­ve Adver­sa­ri­al Dif­fu­si­on Distil­la­ti­on. We open-source the model as part of the research.



We pro­po­se a dif­fu­si­on distil­la­ti­on method that achie­ves new sta­te-of-the-art in one-step/­few-step 1024px text-to-image gene­ra­ti­on based on SDXL. Our method com­bi­nes pro­gres­si­ve and adver­sa­ri­al distil­la­ti­on to achie­ve a balan­ce bet­ween qua­li­ty and mode covera­ge. In this paper, we dis­cuss the theo­re­ti­cal ana­ly­sis, dis­cri­mi­na­tor design, model for­mu­la­ti­on, and trai­ning tech­ni­ques. We open-source our distil­led SDXL-Light­ning models both as LoRA and full UNet weights.



This repo con­ta­ins the PyTorch imple­men­ta­ti­on, pre-trai­ned weights, and pre-trai­nin­g/­fi­ne-tuning code for YOLO-World.

This repo con­ta­ins the PyTorch imple­men­ta­ti­on, pre-trai­ned weights, and pre-trai­nin­g/­fi­ne-tuning code for YOLO-World.
YOLO-World is pre-trai­ned on lar­ge-sca­le data­sets, inclu­ding detec­tion, groun­ding, and image-text data­sets.
YOLO-World is the next-gene­ra­ti­on YOLO detec­tor, with a strong open-voca­bu­la­ry detec­tion capa­bi­li­ty and groun­ding abili­ty.
YOLO-World pres­ents a prompt-then-detect para­digm for effi­ci­ent user-voca­bu­la­ry infe­rence, which re-para­me­ter­i­zes voca­bu­la­ry embed­dings as para­me­ters into the model and achie­ve supe­ri­or infe­rence speed. You can try to export your own detec­tion model wit­hout extra trai­ning or fine-tuning in our online demo!


Motion­C­trl: A Uni­fied and Fle­xi­ble Moti­on Con­trol­ler for Video Generation

We pro­po­se Motion­C­trl, a uni­fied and fle­xi­ble moti­on con­trol­ler for video gene­ra­ti­on. This con­trol­ler is desi­gned to inde­pendent­ly and effec­tively mana­ge both came­ra and object moti­ons in the gene­ra­ted videos.


AVID is a text-gui­ded video inpain­ting method
ver­sa­ti­le across a spec­trum of video dura­ti­ons and tasks


Mid­jour­ney v6

The Dev Team gon­na let the com­mu­ni­ty test an alpha-ver­si­on of Mid­jour­ney v6 model over the win­ter break, start­ing tonight, Decem­ber 21st, 2023.


Midjourney’s V6 Brings New Era of AI Image Generation

Midjourney’s V6, the latest ite­ra­ti­on of the estee­med AI image gene­ra­ti­on tool, has just been released as an alpha release, mar­king a signi­fi­cant mile­stone in the realm of arti­fi­ci­al intel­li­gence and digi­tal crea­ti­vi­ty. This new ver­si­on arri­ves as a much-anti­ci­pa­ted upgrade for enthu­si­asts and pro­fes­sio­nals ali­ke, brin­ging with it a suite of enhance­ments that pro­mi­se to rede­fi­ne the stan­dards of AI-gene­ra­ted imagery.


Drag­NU­WA enables users to mani­pu­la­te back­grounds or objects within images direct­ly, and the model seam­less­ly trans­la­tes the­se actions into came­ra move­ments or object moti­ons, gene­ra­ting the cor­re­spon­ding video.


From Audio to Pho­to­re­al Embodiment

We pre­sent a frame­work for gene­ra­ting full-bodi­ed pho­to­rea­li­stic ava­tars that ges­tu­re accor­ding to the con­ver­sa­tio­nal dyna­mics of a dya­dic inter­ac­tion. Given speech audio, we out­put mul­ti­ple pos­si­bi­li­ties of ges­tural moti­on for an indi­vi­du­al, inclu­ding face, body, and hands. The key behind our method is in com­bi­ning the bene­fits of sam­ple diver­si­ty from vec­tor quan­tiza­ti­on with the high-fre­quen­cy details obtai­ned through dif­fu­si­on to gene­ra­te more dyna­mic, expres­si­ve moti­on. We visua­li­ze the gene­ra­ted moti­on using high­ly pho­to­rea­li­stic ava­tars that can express cru­cial nuan­ces in ges­tu­res (e.g. sneers and smirks). To faci­li­ta­te this line of rese­arch, we intro­du­ce a first-of-its-kind mul­ti-view con­ver­sa­tio­nal data­set that allows for pho­to­rea­li­stic recon­s­truc­tion. Expe­ri­ments show our model gene­ra­tes appro­pria­te and diver­se ges­tu­res, out­per­forming both dif­fu­si­on- and VQ-only methods. Fur­ther­mo­re, our per­cep­tu­al eva­lua­ti­on high­lights the importance of pho­to­rea­lism (vs. mes­hes) in accu­ra­te­ly asses­sing subt­le moti­on details in con­ver­sa­tio­nal ges­tu­res. Code and data­set will be publicly released.

PF-LRM: Pose-Free Lar­ge Recon­s­truc­tion Model for Joint Pose and Shape Prediction

NeRF and poses from 2 – 4 unpo­sed synthetic/​generated/​real images in ~1.3 seconds.


Gene­ra­ti­ve Models by Sta­bi­li­ty AI

Fol­lo­wing the launch of SDXL-Tur­bo, we are releasing SD-Turbo.


X‑Adapter: Adding Uni­ver­sal Com­pa­ti­bi­li­ty of Plug­ins for Upgraded Dif­fu­si­on Model‑Adapter

Magi­c­Ani­ma­te: Tem­po­ral­ly Con­sis­tent Human Image Ani­ma­ti­on using Dif­fu­si­on Model



All the vide­os are gene­ra­ted using AI, for rese­arch pur­po­ses only. Some models might pro­du­ce fac­tual­ly incor­rect or bia­sed outputs.


Per­so­na­li­zed Res­to­ra­ti­on via Dual-Pivot Tuning

By using a few refe­rence images of an indi­vi­du­al, we per­so­na­li­ze a dif­fu­si­on pri­or within a blind image res­to­ra­ti­on frame­work. This results in a natu­ral image that clo­se­ly resem­bles the individual’s iden­ti­ty, while retai­ning the visu­al attri­bu­tes of the degra­ded image.



Sce­ne Inte­gra­ted Gene­ra­ti­on for Neu­ral Radi­ance Fields


Dub­bing for Everyone

Data-Effi­ci­ent Visu­al Dub­bing using Neu­ral Ren­de­ring Priors


Sketch Video Synthesis

Under­stan­ding seman­tic intri­ca­ci­es and high-level con­cepts is essen­ti­al in image sketch gene­ra­ti­on, and this chall­enge beco­mes even more for­mi­da­ble when appli­ed to the domain of vide­os. To address this, we pro­po­se a novel opti­miza­ti­on-based frame­work for sket­ching vide­os repre­sen­ted by the frame-wise Bézier Cur­ves. In detail, we first pro­po­se a cross-frame stro­ke initia­liza­ti­on approach to warm up the loca­ti­on and the width of each cur­ve. Then, we opti­mi­ze the loca­ti­ons of the­se cur­ves by uti­li­zing a seman­tic loss based on CLIP fea­tures and a new­ly desi­gned con­sis­ten­cy loss using the self-decom­po­sed 2D atlas net­work. Built upon the­se design ele­ments, the resul­ting sketch video show­ca­ses impres­si­ve visu­al abs­trac­tion and tem­po­ral cohe­rence. Fur­ther­mo­re, by trans­forming a video into SVG lines through the sket­ching pro­cess, our method unlocks appli­ca­ti­ons in sketch-based video editing and video dood­ling, enab­led through video com­po­si­ti­on, as exem­pli­fied in the teaser.