CiteBar
  • Log in
  • Join

Safety-aligned language models can be compromised by malicious inputs 86%

Truth rate: 86%
u1727780071003's avatar u1727780207718's avatar u1727780010303's avatar u1727780186270's avatar
  • Pros: 0
  • Cons: 0
Safety-aligned language models can be compromised by malicious inputs
Pros: 0
  • Cons: 0
  • ⬆

Be the first who create Pros!



Cons: 0
  • Pros: 0
  • ⬆

Be the first who create Cons!


Refs: 1
  • CS 194/294-196 (LLM Agents) - Lecture 12, Dawn Song

Info:
  • Created by: citebot
  • Created at: Jan. 28, 2025, 6:10 a.m.
  • ID: 19289

Related:
Simple prompts can reveal system instructions in language models 87%
87%
u1727779919440's avatar u1727780252228's avatar u1727780031663's avatar u1727780140599's avatar
Simple prompts can reveal system instructions in language models

Attackers can extract private data by querying language models 84%
84%
u1727779945740's avatar u1727694216278's avatar u1727780140599's avatar u1727780050568's avatar u1727694254554's avatar u1727780237803's avatar u1727779927933's avatar u1727780040402's avatar u1727780037478's avatar u1727780216108's avatar
Attackers can extract private data by querying language models

Data quality issues compromise predictive modeling accuracy 86%
86%
u1727779958121's avatar u1727779915148's avatar u1727780124311's avatar u1727780347403's avatar u1727780094876's avatar

Kids should be digital role models for safety 71%
71%
u1727694227436's avatar u1727780182912's avatar u1727780304632's avatar u1727779941318's avatar u1727779953932's avatar u1727780207718's avatar u1727780202801's avatar u1727780190317's avatar

New technologies can compromise human health safety 77%
77%
u1727779950139's avatar u1727694232757's avatar u1727780074475's avatar u1727780295618's avatar u1727780269122's avatar
New technologies can compromise human health safety

Government surveillance can compromise activist personal safety always 73%
73%
u1727780212019's avatar u1727779966411's avatar u1727694239205's avatar u1727694210352's avatar u1727780024072's avatar u1727780256632's avatar u1727779906068's avatar u1727780016195's avatar u1727780247419's avatar u1727779950139's avatar u1727779923737's avatar u1727780127893's avatar u1727780304632's avatar u1727780228999's avatar u1727780224700's avatar u1727780110651's avatar u1727780219995's avatar

Weak passwords compromise smart lock safety 74%
74%
u1727780016195's avatar u1727780314242's avatar u1727780152956's avatar u1727780050568's avatar u1727780237803's avatar u1727780224700's avatar u1727780324374's avatar
Weak passwords compromise smart lock safety

Transparency can be compromised through false data input 80%
80%
u1727780110651's avatar u1727779906068's avatar u1727780067004's avatar u1727780148882's avatar u1727780136284's avatar u1727780309637's avatar

False information can compromise national security and safety 88%
88%
u1727780115101's avatar u1727694210352's avatar u1727780286817's avatar u1727779927933's avatar u1727780050568's avatar
False information can compromise national security and safety

Small input changes can cause AI models to give wrong outputs 94%
94%
u1727780010303's avatar u1727694254554's avatar u1727780333583's avatar u1727780031663's avatar u1727779923737's avatar u1727780182912's avatar u1727779919440's avatar u1727780282322's avatar u1727780078568's avatar u1727780247419's avatar
Small input changes can cause AI models to give wrong outputs
© CiteBar 2021 - 2025
Home About Contacts Privacy Terms Disclaimer
Please Sign In
Sign in with Google