Claude 3 released, may completely surpass GPT-4?
Written by: You Xin
Just now, Anthropic announced the launch of the Claude 3 model series, which sets new industry benchmarks across a wide range of cognitive tasks. The series includes three state-of-the-art models, arranged in increasing order of capability: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus. Each subsequent model offers increasingly powerful performance, allowing users to choose the best balance of intelligence, speed, and cost for their specific applications.
Opus and Sonnet are now available on claude.ai and the Claude API, which is currently fully available in 159 countries/regions. Haiku will be released soon.
Claude 3 Model Series
A New Standard of Intelligence
Opus, Anthropic's most intelligent model, excels on most common AI system evaluation benchmarks, including undergraduate-level expert knowledge (MMLU), graduate-level expert reasoning (GPQA), basic mathematics (GSM8K), and more. It demonstrates near-human-level understanding and fluency on complex tasks, leading the frontier of general intelligence.
All Claude 3 models show enhanced capabilities in analysis and prediction, detailed content creation, code generation, and conversing in non-English languages such as Spanish, Japanese, and French.
Here is a comparison of the Claude 3 models against peers on multiple capability benchmarks [1]:
Near-Instant Results
Claude 3 models can support instant customer chats, auto-completion, and data extraction tasks, where responses must be immediate and real-time.
Among its intelligence category, Haiku is the fastest and most cost-effective model on the market. It can read an information-dense arXiv research paper (~10k tokens) containing charts and graphs in under three seconds. With the launch, Anthropic expects performance to improve further.
For the vast majority of workloads, Sonnet is over twice as fast as Claude 2 and Claude 2.1, with a higher level of intelligence. It excels in tasks requiring quick responses, such as knowledge retrieval or sales automation. Opus's speed is similar to Claude 2 and 2.1 but with a higher level of intelligence.
Powerful Visual Capabilities
Claude 3 models possess complex visual capabilities comparable to other leading models. They can handle various visual formats, including photos, charts, graphics, and technical diagrams. Anthropic is particularly excited to offer this new modality to enterprise customers, some of whom have up to 50% of their knowledge base encoded in various formats, such as PDFs, flowcharts, or presentation slides.
Reduced Refusals
Previous Claude models often made unnecessary refusals, indicating a lack of contextual understanding. Anthropic has made significant progress in this area: the likelihood of Opus, Sonnet, and Haiku refusing to answer prompts that approach system guardrails has noticeably decreased, with fewer refusals than previous models. As shown in the figure below, Claude 3 models demonstrate a more nuanced understanding of requests, identifying genuine harm and significantly reducing the frequency of refusals to harmless prompts.
Improved Accuracy
Businesses of all sizes rely on Anthropic's models to serve their customers, making it crucial for Anthropic's model outputs to maintain high accuracy at scale. To assess this, Anthropic used a set of complex factual questions targeting known weaknesses of current models. Responses were classified as correct answers, incorrect answers (or hallucinations), and acknowledgments of uncertainty, where the model indicates it does not know the answer rather than providing incorrect information. Compared to Claude 2.1, Opus achieved a twofold increase in accuracy (or correct answers) on these challenging open-ended questions while also reducing the level of incorrect answers.
In addition to generating more trustworthy responses, Anthropic will soon enable citations in its Claude 3 models, allowing them to point to precise sentences in references to validate their answers.
Long Context and Near-Perfect Recall
The Claude 3 series models will offer a 200,000-token context window at launch. However, all three models can accept inputs of over 1 million tokens, which Anthropic may provide to specific customers needing enhanced processing capabilities.
To effectively handle long-context prompts, models require strong recall abilities. The "Needle In A Haystack" (NIAH) evaluation measures a model's ability to accurately recall information from a large data corpus. Anthropic enhanced the robustness of this benchmark by using one of 30 random needle/question pairs for each prompt and testing on a diverse crowd-sourced corpus of documents.
Claude 3 Opus not only achieved near-perfect recall with over 99% accuracy but, in some cases, even identified the limitations of the evaluation itself by recognizing "needle" sentences that appeared to be artificially inserted into the original text.
Responsible Design
Anthropic developed the Claude 3 series models to ensure they are trustworthy alongside their capabilities. Anthropic has several dedicated teams tracking and mitigating various risks, from misinformation and CSAM to bio-abuse, election interference, and autonomous replication skills. Anthropic continuously develops methods, such as "Constitution AI," to enhance the safety and transparency of its models and adjusts them to mitigate privacy concerns that may arise from new modalities.
Addressing biases in increasingly complex models is an ongoing effort, and Anthropic has made progress in this new version. As indicated in the model card, Claude 3 shows less bias than Anthropic's previous models based on the Bias Benchmark for Question and Answer (BBQ). Anthropic is committed to advancing technologies that reduce bias and promote greater neutrality in its models, ensuring they do not favor any particular partisan stance.
While the Claude 3 model series has made advancements in biological knowledge, web-related knowledge, and autonomy compared to previous models, it remains at AI Safety Level 2 (ASL-2) according to Anthropic's responsible scaling policy. Anthropic's red team evaluations (conducted in alignment with Anthropic's White House commitments and the 2023 U.S. executive order) concluded that the current model's potential for catastrophic risk is negligible. Anthropic will continue to closely monitor future models to assess their proximity to the ASL-3 threshold. More safety details are provided in the Claude 3 model card.
Easier to Use
Claude 3 models perform better in following complex multi-step instructions. They are particularly adept at adhering to brand voice and response guidelines, developing customer-facing experiences that users can trust. Additionally, Claude 3 models excel in generating popular structured outputs, such as JSON format—making it easier to guide Claude for use cases like natural language classification and sentiment analysis.
Model Details
Claude 3 Opus is Anthropic's most intelligent model, demonstrating market-leading performance on highly complex tasks. It excels in open-ended prompts and unseen scenarios with human-like understanding capabilities. Opus showcases the limits of what generative AI can achieve.
Claude 3 Sonnet strikes an ideal balance between intelligence and speed—especially for enterprise workloads. It delivers powerful performance at a lower cost compared to peers and is designed for high durability in large-scale AI deployments.
Claude 3 Haiku is Anthropic's fastest and most compact model, capable of near-instant responses. It answers simple queries and requests with unparalleled speed. Users will be able to build seamless AI experiences that simulate human interaction.
Model Availability
Opus and Sonnet are available today in Anthropic's API, which is now widely accessible, allowing developers to sign up and start using these models immediately. Haiku will be released soon. Sonnet is supporting free experiences on claude.ai, while Opus is available for Claude Pro subscribers.
Sonnet is also accessible through Amazon's Bedrock and Google Cloud's Vertex AI Model Garden, with Opus and Haiku coming soon.
Smarter, Faster, Safer
Anthropic believes that model intelligence has not yet reached its limits and plans to frequently update the Claude 3 model series in the coming months. Anthropic is also excited to release a range of features to enhance the capabilities of its models, particularly for enterprise use cases and large-scale deployments. These new features will include tool usage (also known as function calling), interactive coding (also known as REPL), and more advanced agent capabilities.