Anthropic makes “jailbreak” progress to stop AI models, creating harmful results

admin6 hours ago

0 0 2 minutes read

Be informed about free updates

Anthropic Anthropic Anthropic has shown a new technique to prevent users from getting harmful content from their models, as leading technological groups, including Microsoft and Meta Race to find ways that protect against the danger represented by top technology.

In a work published on Monday, a start-up based in San Francisco described a new system called “Constitutional Classifiers”. It is a model that acts as a protective layer on top of large language models like the one that triggers anthropic -ov Claude Chatbot, which can control and enter and come out for harmful content.

The development of anthropic, which is an interview with $ 2 billion with an estimate of $ 60 billion, comes due to growing concern for industry because of “Jailbreaking” – attempting to manipulate the AI models in creating illegal or dangerous information, such as making instructions for the construction of chemical weapons.

Other companies are also racing to distribute measures to protect themselves from practice, in moves that could help them avoid regulatory supervision, while convinced by companies to safely adopt AI models. Microsoft presented “Fast Shields” last March, while Meta presented a fast modern model in July last year, which researchers quickly found ways to bypass, but have been fixed since then.

Mrinink Sharma, a member of the technical staff in Anthropic, said: “The main motivation behind the work was for severe chemicals [weapon] things [but] The real advantage of the method is his ability to respond quickly and adapt. “

Anthropic said he would not immediately use the system on current Claude models, but would consider implementation if risk models are published in the future. Sharma added: “A big move from this work is to think it is a problem that can be pierced.”

The proposed launch solution is built on the so -called “Constitution” of the rules that define what is allowed and restricted and can be adapted to record different types of materials.

Some prison attempts are known, such as using unusual capitalization in prompt or looking for a model to adopt a grandmother’s person to tell the story of a bed of a nasty topic.

To confirm the efficiency of the system, Anthropopi offered “Beetles” up to $ 15,000 to individuals who tried to bypass security measures. These testers, known as Red teamThey spent more than 3000 hours trying to break through the defense.

The Anthropic Sonnaon Claude model has rejected more than 95 percent of attempts with existing classification, compared to 14 percent without protective measures.

The leading technological companies are trying to reduce the abuse of their models while maintaining their usefulness. Often, when moderation measures are established, models can become careful and reject benign requirements, such as early versions Google’s twins Image generator or Meta’s llam 2. Anthropic said their classification caused “only 0.38 percent of the absolute increase in the rejection rate.”

However, the addition of these protection also has additional costs for companies that already pay huge amounts for the computer power needed to train and start the model. Anthropic said the classification would increase almost 24 percent of “conclusions”, the cost of launching a model.

Safety experts claimed that the affordable nature of such generative chatbots enabled ordinary people without previously learning to try to get dangerous information.

“In 2016, the threat actor we would have in mind was a really powerful opponent of the nation -state,” said Ram Shakar Siva Kumar, who runs AI Red Tim in Microsoft. “Now he’s literally one of my actors threatening a teenager is with his mouth.”

Source link