Google announced Gemini 1.0, their next-generation foundation model, which was teased at I/O 2023 in May, and is making it available through Bard. Gemini is Google's "most capable and general model", with the ability to understand, operate across, and combine text, code, audio, photos, and video. Being "natively multimodal" offers improved comprehension, reasoning, and coding ability.
The present method for developing multimodal models consists of "training separate components for different modalities and then stitching them together." While these models excel at some tasks, Google claims they "struggle with more conceptual and complex reasoning." Google "pre-trained from the start on different modalities" for Gemini utilizing TPU 4 and TPU v5e. Google also introduced today the TPU v5p (pictured below) as their "most powerful, efficient, and scalable" AI accelerator, aimed specifically for advanced models.
Google demonstrated Gemini digesting 200,000 scientific research papers, selecting the relevant ones, and summarizing the results in about an hour to demonstrate its "sophisticated reasoning" capabilities. Another major focus is coding, with Gemini able to "understand, explain, and generate high-quality code" in Python, Java, C++, and Go.
Gemini 1.0 is available in three sizes ranging from data centers to phones:
In terms of performance, Google demonstrated that Gemini Ultra outperformed GPT-4 in text-based benchmarks that assess logic, math, and programming. Gemini Ultra is the "first model to outperform human experts on MMLU (massive multitask language understanding)" at 90.0%, according to the business. This standard "uses a combination of 57 subjects such as math, physics, history, law, medicine, and ethics for testing both world knowledge and problem-solving abilities," with OpenAI's product getting 86.4%.
On the multimodal front, Gemini Ultra outperforms GPT-4V in image, video, and audio testing, and Google DeepMind has produced a technical report with more details. Gemini Ultra surpassed prior state-of-the-art models in the picture benchmarks we tested, even without the assistance of object character recognition (OCR) systems, which extract text from images for further processing. These benchmarks demonstrate Gemini's natural multimodality and early indications of Gemini's more complicated reasoning ability.
Gemini is stated to feature "the most comprehensive safety evaluations of any Google AI model to date," with new safeguards in place to account for the multimodal capabilities. Google is working hard to combat bias and hostility.
The first method to get a feel for this new basic paradigm is to play "Bard with Gemini Pro." This "specifically tuned version" of Gemini Pro, which is currently available, provides more advanced thinking, planning, and writing, as well as content comprehension and summary. Google specifically emphasized that performance exceeds GPT 3.5 (in six out of eight benchmarks, including MMLU and GSM8K) and that it provides the single greatest quality increase to Bard since launch.
Bard is now the most preferred free chatbot when compared to prominent alternatives in blind assessments with our third-party raters. Bard with Gemini Pro is now available in English for 170 countries/territories, with the UK and Europe to follow "in the near future." Gemini Pro will first power text-based prompts, with support for "other modalities coming soon."
Meanwhile, Gemini Ultra will be released early next year. Google is currently "completing extensive trust and safety checks," as well as model refinements, before making it available to developers and enterprise clients more broadly. It will be made available via a new "Bard Advanced" option, which Google advertises as allowing early access to its most advanced models and capabilities, such as Gemini Ultra.
Gemini will be available in Google Search, Chrome, Duet AI, and Ads in the coming months. Gemini has been found in early tests to reduce SGE (Search Generative Experience) latency by 40%.