Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Insider Threat Final Powerpoint Prezi
Next
Download to read offline and view in fullscreen.

0

Share

Multimedia Privacy

Download to read offline

Tutorial for ACM Multimedia 2016, given together with Gerald Friedland, with contributions from Julia Bernd and Yiannis Kompatsiaris. The presentation covered an introduction to the problem of disclosing personal information through multimedia sharing, the associated security risks, methods for conducting multimodla inferences and technical frameworks that could help alleviate such risks.

  • Be the first to like this

Multimedia Privacy

  1. 1. Multimedia Privacy Gerald Friedland Symeon Papadopoulos Julia Bernd Yiannis Kompatsiaris ACM Multimedia, Amsterdam, October 16, 2016
  2. 2. What’s the Big Deal?
  3. 3. Overview of Tutorial • Part I: Understanding the Problem • Part II: User Perceptions About Privacy • Part III: Multimodal Inferences • Part IV: Some Possible Solutions • Part V: Future Directions
  4. 4. Part I: Understanding the Problem
  5. 5. What Can a Mindreader Read? • This vulnerability is a problem with any type of public or semi-public post. They’re not specific to a particular type of information, e.g. text, image, or video. • However, let’s focus on multimedia data: images, audio, video, social media context, etc.
  6. 6. Multimedia on the Internet Is Big! Source: Domosphere
  7. 7. Resulting Problem • More multimedia data = Higher demand for retrieval and organization tools. • But multimedia retrieval is hard! • Researchers work on making retrieval better (cf. latest advances in Deep Learning for content-based retrieval). • Industry develops workarounds to make retrieval easier right away.
  8. 8. Hypothesis • Retrieval is already good enough to cause major issues for privacy that are not easy to solve. • Let’s take a look at some retrieval approaches: • Image tagging • Geo-tagging • Multimodal Location Estimation • Audio-based user matching
  9. 9. Workaround: Manual Tagging
  10. 10. Workaround: Geo-Tagging Source: Wikipedia
  11. 11. Geo-Tagging Allows easier clustering of photo and video series, among other things.
  12. 12. Geo-Tagging Everywhere Part of the location-based service hype: But: Geo-coordinates + Time = Unique ID!
  13. 13. Support for Geo-Tags • Social media portals provide APIs to connect geo- tags with metadata, accounts, and web content. • Allows easy search, retrieval, and ad placement. Portal %* Total YouTube 3.0 3M Flickr 4.5 180M *estimate (2013)
  14. 14. Hypothesis • Since geo-tagging is a workaround for multimedia retrieval, it allows us to peek into a future where multimedia retrieval works perfectly. • What if multimedia retrieval actually just worked?
  15. 15. Related Work “Be careful when using social location sharing services, such as Foursquare.”
  16. 16. Related Work Mayhemic Labs, June 2010: “Are you aware that Tweets are geo-tagged?”
  17. 17. Can you do real harm? • Cybercasing: Using online (location-based) data and services to enable physical-world crimes. • Three case studies: G. Friedland and R. Sommer: "Cybercasing the Joint: On the Privacy Implications of Geotagging", Proceedings of the Fifth USENIX Workshop on Hot Topics in Security (HotSec 10), Washington, D.C, August 2010.
  18. 18. Case Study 1: Twitter • Pictures in Tweets can be geo-tagged • From a tech-savvy celebrity we found: • Home location (several pics) • Where the kids go to school • Where he/she walks the dog • “Secret” office
  19. 19. Celebs Unaware of Geo-Tagging Source: ABC News
  20. 20. Celebs Unaware of Geotagging
  21. 21. Google Maps Shows Address...
  22. 22. Case Study 2: Craigslist “For Sale” section of Bay Area Craigslist.com: • 4 days: 68,729 pictures total - 1.3% geo-tagged
  23. 23. Users Are Unaware of Geo-Tagging • Many “anonymized” ads had geo-location • Sometimes selling high-value goods, e.g. cars, diamonds, etc. • Sometimes “call Sunday after 6pm” • Multiple photos allow interpolation of coordinates for higher accuracy
  24. 24. Craigslist: Real Example
  25. 25. Geo-Tagging Resolution Measured accuracy: +/- 1m iPhone 3G picture Google Street View
  26. 26. What About Inference? Owner Valuable
  27. 27. Case Study 3: YouTube Recall: • Once data is published, the Internet keeps it (often with many copies). • APIs are easy to use and allow quick retrieval of large amounts of data. Can we find people on vacation using YouTube?
  28. 28. Cybercasing on YouTube Experiment: Cybercasing using the YouTube API (240 lines in Python)
  29. 29. Cybercasing on YouTube Input parameters Location: 37.869885,-122.270539 Radius: 100km Keywords: kids Distance: 1000km Time-frame: this_week
  30. 30. Cybercasing on YouTube Output • Initial videos: 1000 (max_res) • User hull: ~50k videos • Vacation hits: 106 • Cybercasing targets: >12
  31. 31. The Threat Is Real!
  32. 32. Question Do you think geo-tagging should be illegal? a) No, people just have to be more careful. The possibilities still outweigh the risks. b) Maybe it should be regulated somehow to make sure no harm can be done. c) Yes, absolutely! This information is too dangerous.
  33. 33. But… Is this really about geo-tags? (remember: hypothesis)
  34. 34. But… Is this really about geo-tags? No, it’s about the privacy implications of multimedia retrieval in general.
  35. 35. Question And now? What do you think should be done? a) Nothing can be done. Privacy is dead. b) I will think before I post, but I don’t know that it matters. c)We need to educate people about this and try to save privacy. (Fight!) d) I’ll never post anything ever again! (Flight!)
  36. 36. Observations • Many applications encourage heavy data sharing, and users go with it. • Multimedia isn’t only a lot of data, it’s also a lot of implicit information. • Both users and engineers often unaware of the hidden retrieval possibilities of shared (multimedia) data. • Local anonymization and privacy policies may be ineffective against cross-site inference.
  37. 37. Dilemma • People will continue to want social networks and location-based services. • Industry and research will continue to improve retrieval techniques. • Government will continue to do surveillance and intelligence-gathering.
  38. 38. Solutions That Don’t Work • I blur the faces •Audio and image artifacts can still give you away • I only share with my friends •But who are they sharing with, on what platforms? • I don’t do social networking •Others may do it for you!
  39. 39. Further Observations • There is not much incentive to worry about privacy, until things go wrong. • People’s perception of the Internet does not match reality (enough).
  40. 40. Basics: Definitions and Background
  41. 41. Definition • Privacy is the right to be let alone (Justices Warren and Brandeis) • Privacy is: a) the quality or state of being apart from company or observation b) freedom from unauthorized intrusion (Merriam Webster’s)
  42. 42. Starting Points • Privacy is a human right. Every individual has a need to keep something about themselves private. • Companies have a need for privacy. • Governments have a need for privacy (currently heavily discussed).
  43. 43. Where We’re At (Legally) Keep an eye out for multimedia inference!
  44. 44. A Taxonomy of Social Networking Data • Service data: Data you give to an OSN to use it, e.g. name, birthday, etc. • Disclosed data: What you post on your page/space • Entrusted data: What you post on other people’s pages, e.g. comments • Incidental data: What other people post about you • Behavioural data: Data the site collects about you • Derived data: Data that a third party infers about you based on all that other data B. Schneier. A Taxonomy of Social Networking Data, Security & Privacy, IEEE, vol.8, no.4, pp.88, July-Aug. 2010
  45. 45. Privacy Bill of Rights In February 2012, the US Government released CONSUMER DATA PRIVACY IN A NETWORKED WORLD: A FRAMEWORK FOR PROTECTING PRIVACY AND PROMOTING INNOVATION IN THE GLOBAL DIGITAL ECONOMY http://www.whitehouse.gov/sites/default/files/privacy-final.pdf
  46. 46. Privacy Bill of Rights 1) Individual Control: Consumers have a right to exercise control over what personal data is collected from them and how they use it. 2) Transparency: Consumers have a right to easily understandable and accessible information about privacy and security practices. 3) Respect for Context: Consumers have a right to expect that organizations will collect, use, and disclose personal data in ways consistent with the context in which consumers provide the data.
  47. 47. Privacy Bill of Rights 4) Security: Consumers have a right to secure and responsible handling of personal data. 5) Access and Accuracy: Consumers have a right to access and correct personal data in usable formats, in a manner that is appropriate to the sensitivity of the data and the risk of adverse consequences to citizens if the data is inaccurate.
  48. 48. Privacy Bill of Rights 6) Focused Collection: Consumers have a right to reasonable limits on the personal data that organizations collect and retain. 7) Accountability: Consumers have a right to have personal data handled by organizations with appropriate measures in place to assure they adhere to the Consumer Privacy Bill of Rights.
  49. 49. One View The Privacy Bill of Rights could serve as a requirements framework for an ideally privacy-aware Internet service. ...if it were adopted.
  50. 50. Limitations • The Privacy Bill of Rights is subject to interpretation. • What is “reasonable”? • What is “context”? • What is “personal data”? • The Privacy Bill of Rights presents technical challenges.
  51. 51. Personal Data Protection in EU • The Data Protection Directive* (aka Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data) is an EU directive adopted in 1995 which regulates the processing of personal data within the EU. It is an important component of EU privacy and human rights law. • The General Data Protection Regulation, in progress since 2012 and adopted in April 2016, will supersede the Data Protection Directive and be enforceable as of 25 May 2018 • Objectives • Give control of personal data to citizens • Simplify regulatory environment for businesses * A directive is a legal act of the European Union, which requires member states to achieve a particular result without dictating the means of achieving that result.
  52. 52. When is it legitimate…? Collecting and processing the personal data of individuals is only legitimate in one of the following circumstances (Article 7 of Directive): • Individual gives unambiguously consent • If data processing is needed for a contract (e.g. electricity bill) • If processing is required by a legal obligation • If processing is necessary to protect the vital interests of the person (e.g. processing medical data of an accident victim) • If processing is necessary to perform tasks of public interest • If data controller or third party have a legitimate interest in doing so, as long as this does not affect the interests of the data subject or infringe his/her fundamental rights
  53. 53. Obligations of data controllers in EU Respect for the following rules: • Personal Data collected and used for explicit and legitimate purposes • It must be adequate, relevant and not excessive in relation to the above purposes • It must be accurate and updated when needed • Data subjects must be able to correct, remove, etc. incorrect data about themselves (access) • Personal data should not be kept longer than necessary • Data controllers must protect personal data (incl. from unauthorized access to third parties) using appropriate measures of protection (security, accountability)
  54. 54. Handling sensitive data Definition of sensitive data in EU: • religious beliefs • political opinions • health • sexual orientation • race • trade union membership Processing sensitive data comes under stricter set of rules (Article 8)
  55. 55. Enforcing data protection in EU? • The Directive states that every EU country must provide one or more independent supervisory authorities to monitor its application. • In principle, all data controllers must notify their supervisory authorities when they process personal data. • The national authorities are also in charge of receiving and handling complaints from individuals.
  56. 56. Data Protection: US vs EU • US has no legislation that is comparable to EU’s Data Protection Directive. • US privacy legislation is adopted on ad hoc basis, e.g. when certain sectors and circumstances require (HIPAA, CTPCA, FCRA) • US adopts a more laissez-faire approach • In general, US privacy legislation is considered “weaker” compared to EU
  57. 57. Example: What Is Sensitive Data? Public records indicate you own a house.
  58. 58. Example: What Is Sensitive Data? A geo-tagged photo taken by a friend reveals who attended your party!
  59. 59. Example: What Is Sensitive Data? Facial recognition match with a public record: Prior arrest for drug offense!
  60. 60. Example: What Is Sensitive Data? 1) Public records indicate you own a house 2) A geo-tagged photo taken by a friend reveals who attended your party 3) Facial recognition match with a public record: Prior arrest for drug offense! → “You associate with convicts”
  61. 61. Example: What Is Sensitive Data? “You associate with convicts” What will this do for your reputation when you: • Date? • Apply for a job? • Want to be elected to public office?
  62. 62. Example: What Is Sensitive Data? But: Which of these is the sensitive data? a) Public record: You own a house b) Geotagged photo taken by a friend at your party c) Public record: A friend’s prior arrest for a drug offense d) Conclusion: “You associate with convicts.” e) None of the above.
  63. 63. Who Is to Blame? a) The government, for its Open Data policy? b) Your friend who posted the photo? c) The person who inferred data from publicly available information?
  64. 64. Part II: User Perceptions About Privacy
  65. 65. Study 1: Users’ Understandings of Privacy
  66. 66. The Teaching Privacy Project • Goal: Create a privacy curriculum for K-12 and undergrad, with lesson plans, teaching tools, visualizations, etc. • NSF sponsored. (CNS‐1065240 and DGE-1419319; all conclusions ours.) • Check It Out: Info, public education, and teaching resources: http://teachingprivacy.org
  67. 67. Based on Several Research Strands • Joint work between Friedland, Bernd, Serge Egelman, Dan Garcia, Blanca Gordo, and many others! • Understanding of user perceptions comes from: • Decades of research comparing privacy comprehension, preferences, concerns, and behaviors, including by Egelman and colleagues at CMU • Research on new Internet users’ privacy perceptions, including Gordo’s evaluations of digital-literacy programs • Observation of multimedia privacy leaks, e.g. “cybercasing” study • Reports from high school and undergraduate teachers about students’ misperceptions • Summer programs for high schoolers interested in CS
  68. 68. Common Research Threads • What happens on the Internet affects the “real” world. • However: Group pressure, impulse, convenience, and other factors usually dominate decision making. • Aggravated by lack of understanding of how sharing on the Internet really works. • Wide variation in both comprehension and actual preferences.
  69. 69. Multimedia Motivation • Many current multimedia R&D applications have a high potential to compromise the privacy of Internet users. • We want to continue pursuing fruitful and interesting research programs! • But we can also work to mitigate negative effects by using our expertise to educate the public about effects on their privacy.
  70. 70. What Do People Need to Know? Starting point: 10 observations about frequent misperceptions + 10 “privacy principles” to address them Illustrations by Ketrina Yim.
  71. 71. Misconception #1 • Perception: I keep track of what I’m posting. I am in control. Websites are like rooms, and I know what’s in each of them. • Reality: Your information footprint is larger than you think! • An empty Twitter post has kilobytes of publicly available metadata. • Your footprint includes what others post about you, hidden data attached by services, records of your offline activities… Not to mention inferences that can be drawn across all those “rooms”!
  72. 72. Misconception #2 • Perception: Surfing is anonymous. Lots of sites allow anonymous posting. • Reality: There is no anonymity on the Internet. •Bits of your information footprint — geo-tags, language patterns, etc. — may make it possible for someone to uniquely identify you, even without a name.
  73. 73. Misconception #3 • Perception: There’s nothing interesting about what I do online. • Reality: Information about you on the Internet will be used by somebody in their interest — including against you. •Every piece of information has value to somebody: other people, companies, organizations, governments... •Using or selling your data is how Internet companies that provide “free” services make money.
  74. 74. Misconception #4 • Perception: Communication on the Internet is secure. Only the person I’m sending it to will see the data. • Reality: Communication over a network, unless strongly encrypted, is never just between two parties. •Online data is always routed through intermediary computers and systems… •Which are connected to many more computers and systems...
  75. 75. Misconception #5 • Perception: If I make a mistake or say something dumb, I can delete it later. Anyway, people will get what I mean, right? • Reality: Sharing information over a network means you give up control over that information — forever! •The Internet never forgets. Search engines, archives, and reposts duplicate data; you can’t “unshare”. •Websites sell your information, and data can be subpoenaed. •Anything shared online is open to misinterpretation. The Internet can’t take a joke!
  76. 76. Misconception #6 • Perception: Facial recognition/speaker ID isn’t good enough to find this. As long as no one can find it now, I’m safe. • Reality: Just because it can’t be found today, it doesn’t mean it can’t be found tomorrow. •Search engines get smarter. •Multimedia retrieval gets better. •Analog information gets digitized. •Laws, privacy settings, and privacy policies change.
  77. 77. Misconception #7 • Perception: What happens on the Internet stays on the Internet. • Reality: The online world is inseparable from the “real” world. •Your online activities are as much a part of your life as your offline activities. •People don’t separate what they know about Internet-you from what they know about in-person you.
  78. 78. Misconception #8 • Perception: I don’t chat with strangers. I don’t “friend” people on Facebook that I don’t know. • Reality: Are you sure? Identity isn’t guaranteed on the Internet. •Most information that “establishes” identity in social networks may already be public. •There is no foolproof way to match a real person with their online identity.
  79. 79. Misconception #9 • Perception: I don’t use the Internet. I am safe. • Reality: You can’t avoid having an information footprint by not going online. •Friends and family will post about you. •Businesses and government share data about you. •Companies track transactions online. •Smart cards transmit data online.
  80. 80. Misconception #10 • Perception: There’s laws that keep companies and people from sharing my data. If a website has a privacy policy, that means they won’t share my information. It’s all good. • Reality: Only you have an interest in maintaining your privacy! •Internet technology is rarely designed to protect privacy. •“Privacy policies” are there to protect providers from lawsuits. •Laws are spotty and vary from place to place. •Like it or not, your privacy is your own responsibility!
  81. 81. What Came of All This? Example: “Ready or Not?” educational app LINK
  82. 82. What Came of All This? Example: “Digital Footprints” video
  83. 83. Study 2: Perceived vs. Actual Predictability of Personal Information in Social Nets Papadopoulos and Kompatsiaris with Eleftherios Spyromitros- Xioufis, Giorgos Petkos, and Rob Heyman (iMinds)
  84. 84. Personal Information in OSNs Participation in OSNs comes at a price! • User-related data is shared with: • a) other OSN users, b) the OSN itself, c) third parties (e.g. ad networks) • Disclosure of specific types of data: • e.g. gender, age, ethnicity, political or religious beliefs, sexual preferences, employment status, etc. • Information isn’t always explicitly disclosed! • Several types of personal information can be accurately inferred based on implicit cues (e.g. Facebook likes) and machine learning! (cf. Part III)
  85. 85. Inferred Information & Privacy in OSNs • Study of user awareness with regard to inferred information largely neglected by social research. • Privacy usually presented as a question of giving access or communicating personal information to some party, e.g.: “The claim of individuals, groups, or institutions to determine for themselves when, how, and to what extent information about them is communicated to others.” (Westin, 1970) [1] Alan Westin. Privacy and freedom. Bodley Head, London, 1970.
  86. 86. Inferred Information & Privacy in OSNs • However, access control is non-existent for inferred information: • Users are unaware of the inferences being made. • Users have no control over the way inferences are made. • Goal: Investigate whether and how users intuitively grasp what can be inferred from their disclosed data!
  87. 87. Main Research Questions 1. Predictability: How predictable are different types of personal information, based on users’ OSN data? 2. Actual vs. perceived predictability: How realistic are user perceptions about the predictability of their personal information? 3. Predictability vs. sensitivity: What is the relationship between perceived sensitivity and predictability of personal information? • Previous work has focused mainly on Q1 • We address Q1 using a variety of data and methods, and additionally we address Q2 and Q3
  88. 88. Data Collection • Three types of data about 170 Facebook users: • OSN data: Likes, posts, images -- collected through a test Facebook application • Answers to questions about 96 personal attributes, organized into 9 categories, e.g. health factors, sexual orientation, income, political attitude, etc. • Answers to questions related to their perceptions about the predictability and sensitivity of the 9 categories http://databait.eu http://www.usemp-project.eu
  89. 89. Example From Questionnaire • What is your sexual orientation? → ground truth • Do you think the information on your Facebook profile reveals your sexual orientation? Either because you yourself have put it online, or it could be inferred from a combination of posts. → perceived predictability • How sensitive do you find the information you had to reveal about your sexual orientation? (1=not sensitive at all, 7= very sensitive) → perceived sensitivity Response # heterosexual 147 homosexual 14 bisexual 7 n/a 2 Response # yes 134 no 33 n/a 3
  90. 90. Features Extracted From OSN Data • likes: binary vector denoting presence/absence of a like (#3.6K) • likesCats: histogram of like category frequencies (#191) • likesTerms: Bag-of-Words (BoW) of terms in description, title, and about sections of likes (#62.5K) • msgTerms: BoW vector of terms in user posts (#25K) • lda-t: Distribution of topics in the textual contents of both likes (description, title, and about section) and posts • Latent Dirichlet Allocation with t=20,30,50,100 • visual: concepts depicted in user images (#11.9K), detected using CNN, top 12 concepts per images, 3 variants • visual-bin: hard 0/1 encoding • visual-freq: concept frequency histogram • visual-conf: sum of detection scores across all images
  91. 91. Experimental Setup • Evaluation method: repeated random sub-sampling • Data split randomly 𝑛=10 times into train (67%) / test (33%) • Model fit on train / accuracy of inferences assessed on test • 96 questions (user attributes) were considered • Evaluation measure: area under ROC curve (AUC) • Appropriate for imbalanced classes • Classification algorithms • Baseline: 𝑘-nearest neighbors, decision tree, naïve Bayes • SoA: Adaboost, random forest, regularized logistic regression
  92. 92. Predictability per Attribute nationality is employed can be moody smokes cannabis plays volleyball
  93. 93. What Is More Predictable? Rank Perceived Actual predictability Predictability SoA* 1 Demographics Demographics - Demographics 2 Relationship status and living condition Political views +3 Political views 3 Sexual orientation Sexual orientation - Religious views 4 Consumer profile Employment/Income +4 Sexual orientation 5 Political views Consumer profile -1 Health status 6 Personality traits Relationship status and living condition -4 Relationship status and living condition 7 Religious views Religious views - 8 Employment/Incom e Health status +1 9 Health status Personality traits -3 * Kosinski, et al. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 2013.
  94. 94. Predictability Versus Sensitivity
  95. 95. Part III: Multimodal Inferences
  96. 96. Personal Data: Truly Multimodal • Text: posts, comments, content of articles you read/like, etc. • Images/Videos: posted by you, liked by you, posted by others but containing you • Resources: likes, visited websites, groups, etc. • Location: check-ins, GPS of posted images, etc. • Network: what your friends look like, what they post, what they like, community where you belong • Sensors: wearables, fitness apps, IoT
  97. 97. What Can Be Inferred? A lot….
  98. 98. Three Main Approaches • Content-based • What you post is what/where/how/etc. you are • Supervised learning • Learn by example • Network-based • Show me your friends and I’ll tell you who you are
  99. 99. Content-Based Beware of your posts…
  100. 100. Location Multimodal Location Estimation
  101. 101. Multimodal Location Estimation http://mmle.icsi.berkeley.edu
  102. 102. Multimodal Location Estimation We infer the location of a video based on visual stream, audio stream, and tags: • Use geo-tagged data as training data • Allows faster search, inference, and intelligence- gathering, even without GPS. G. Friedland, O. Vinyals, and T. Darrell: "Multimodal Location Estimation," pp. 1245- 1251, ACM Multimedia, Florence, Italy, October 2010.
  103. 103. Intuition for the Approach {berkeley, sathergate, campanile} {berkeley, haas} {campanile} {campanile, haas} Node: Geolocation of video Edge: Correlated locations (e.g. common tag, visual, acoustic features) Edge Potential: Strength of an edge (e.g. posterior distribution of locations given common tags)
  104. 104. MediaEval J. Choi, G. Friedland, V. Ekambaram, K. Ramchandran: "Multimodal Location Estimation of Consumer Media: Dealing with Sparse Training Data," in Proceedings of IEEE ICME 2012, Melbourne, Australia, July 2012.
  105. 105. YouTube Cybercasing Revisited YouTube Cybercasing With Geo-Tags vs. Multimodal Location Estimation Old Experiment No Geo-Tags Initial Videos 1000 (max) 107 User Hull ~50k ~2000 Potential Hits 106 112 Actual Targets >12 >12
  106. 106. Account Linking Can we link accounts based on their content?
  107. 107. Using Internet Videos: Dataset Test videos from Flickr (~40 sec) • 121 users to be matched, 50k trials • 70% have heavy noise • 50% speech • 3% professional content H. Lei, J. Choi, A. Janin, and G. Friedland: “Persona Linking: Matching Uploaders of Videos Across Accounts”, at IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP), Prague, May 2011.
  108. 108. Matching Users Within Flickr Algorithm: 1) Take 10 seconds of the soundtrack of a video 2) Extract the Spectral Envelope 3) Compare using Manhattan Distance
  109. 109. Spectral Envelope
  110. 110. User ID on Flickr Videos
  111. 111. Persona Linking Using Internet Videos Result: •On average, having 40 seconds in the test and training sets leads to a 99.2% chance for a true positive match!
  112. 112. Another Linkage Attack Exploiting users’ online activity to link accounts • Link based on where and when a user is posting • Attack model is individual targeting • Datasets: Yelp, Flickr, Twitter • Methods • Location profile • Timing profile
  113. 113. When a User Is Posting
  114. 114. Where a User Is Posting - Twitter locations - Yelp locations
  115. 115. De-Anonymization Model Targeted account (YELP users are ID’d) Candidate list
  116. 116. Datasets • Three social networks: Yelp, Twitter, Flickr • Two types of data sets • Ground truth data set • Yelp-Twitter: 2,363 -> 342 (with geotags) -> 57 (in SF bay) • Flickr-Twitter: 6,196 -> 396 (with geotags) -> 27 (in SF bay) • Candidate Twitter list data set: 26,204
  117. 117. Performance on Matching
  118. 118. Supervised Learning Learn by example
  119. 119. Inferring Personal Information • Supervised learning algorithms • Learn a mapping (model) from inputs 𝒙𝑖 to outputs 𝑦𝑖 by analyzing a set of training examples 𝐷=(𝒙𝑖,𝑦𝑖 )i 𝑁 • In this case • 𝑦𝑖 corresponds to a personal user attribute, e.g. sexual orientation • 𝒙𝑖 corresponds to a set of predictive attributes or features, e.g. user likes • Some previous results • Kosinski et al. [1]: likes features (SVD) + logistic regression: Highly accurate inferences of ethnicity, gender, sexual orientation, etc. • Schwartz et al. [2] status updates (PCA) + linear SVM: Highly accurate inference of gender [1] Kosinski, et al. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 2013. [2] Schwartz, et al. Personality, gender, and age in the language of social media: The open- vocabulary approach. PloS one, 2013.
  120. 120. What Do Your Likes Say About You? M. Kosinski, D. Stillwell, T. Graepel. “Private Traits and Attributes are Predictable from Digital Records of Human Behavior”. PNAS 110: 5802 – 5805, 2013
  121. 121. M. Kosinski, D. Stillwell, T. Graepel. “Private Traits and Attributes are Predictable from Digital Records of Human Behavior”. PNAS 110: 5802 – 5805, 2013 Results: Prediction Accuracy
  122. 122. M. Kosinski, D. Stillwell, T. Graepel. “Private Traits and Attributes are Predictable from Digital Records of Human Behavior”. PNAS 110: 5802 – 5805, 2013 The More You Like…
  123. 123. Our Results: USEMP Dataset (Part II) Testing different classifiers
  124. 124. Our Results: USEMP Dataset (Part II) Testing different features
  125. 125. Our Results: USEMP Dataset (Part II) Testing combinations of features
  126. 126. Caution: Reliability of Predictions MODEL 1 MODEL 2 MODEL N α% training set ENSEMBLE α% α%
  127. 127. Caution: Reliability of Predictions Percentage of users, for which individual models have low agreement (Sx<0.5). Classification accuracy for those users. MyPersonality dataset (subset)
  128. 128. Conclusions • Representing users as feature vectors and using supervised learning can help achieve pretty good accuracy in several cases. However: • There will be several cases where the output of the trained model will be unreliable (close to random). • For many classifiers and for abstract feature representations (e.g. SVD), it is very hard to explain why a particular user has been classified as belonging to a given class.
  129. 129. Network-Based Learning Show me your friends…. with Georgios Rizos
  130. 130. Network-Based Classification • People with similar interests tend to connect → homophily • Knowing about one’s connections could reveal information about them • Knowing about the whole network structure could reveal even more…
  131. 131. My Social Circles A variety of affiliations: • Work • School • Family • Friends …
  132. 132. SoA: User Classification (1) Graph-based semi-supervised learning: • Label propagation (Zhu and Ghahramani, 2002) • Local and global consistency (Zhou et al., 2004) Other approaches to user classification: • Hybrid feature engineering for inferring user behaviors (Pennacchiotti et al., 2011 , Wagner et al., 2013) • Crowdsourcing Twitter list keywords for popular users (Ghosh et al., 2012)
  133. 133. SoA: Graph Feature Extraction (2) Use of community detection: • EdgeCluster: Edge centric k-means (Tang and Liu, 2009) • MROC: Binary tree community hierarchy (Wang et al., 2013) Low-rank matrix representation methods: • Laplacian Eigenmaps: k eigenvectors of the graph Laplacian (Belkin and Niyogi, 2003 , Tang and Liu, 2011) • Random-Walk Modularity Maximization: Does not suffer from the resolution limit of ModMax (Devooght et al., 2014) • Deepwalk: Deep representation learning (Perozzi et al., 2014)
  134. 134. Overview of Framework Online social interactions (retweets, mentions, etc.) Social interaction user graph ARCTE Partial/Sparse Annotation Supervised graph feature representation Feature Weighting User Label Learning Classified Users
  135. 135. ARCTE: Intuition
  136. 136. Evaluation: Datasets Ground truth generation: • SNOW2014 Graph: Twitter list aggregation & post-processing • IRMV-PoliticsUK: Manual annotation • ASU-YouTube: User membership to group • ASU-Flickr: User subscription to interest group Datasets Labels Vertices Vertex Type Edges Edge Type SNOW2014 Graph (Papadopoulos et al., 2014) 90 533,874 Twitter Account 949,661 Mentions + Retweets IRMV-PoliticsUK (Greene & Cunningham, 2013) 5 419 Twitter Account 11,349 Mentions + Retweets ASU-YouTube (Mislove et al., 2007) 47 1,134,890 YouTube Channel 2,987,624 Subscriptions ASU-Flickr (Tang and Liu, 2009) 195 80,513 Flickr Account 5,899,882 Contacts
  137. 137. Example: Twitter Twitter Handle Labels @nytimes usa, press, new york @HuffPostBiz finance @BBCBreaking press, journalist, tv @StKonrath journalist Examples from SNOW 2014 Data Challenge dataset
  138. 138. Evaluation: SNOW 2014 dataset SNOW2014 Graph (534K, 950K): Twitter mentions + retweets, ground truth based on Twitter list processing
  139. 139. Evaluation: ASU-YouTube • ASU-YouTube (1.1M, 3M): YouTube subscriptions, ground truth based on membership to groups
  140. 140. Part IV: Some Possible Solutions
  141. 141. Solution 1: Disclosure Scoring Framework with Georgios Petkos
  142. 142. Problem and Motivation • Several studies have shown that privacy is a challenging issue in OSNs. •Madejski et al. performed a study with 65 users asking them to carefully examine their profiles → all of them identified a sharing violation. • Information about a user may appear not only explicitly, but also implicitly, and may therefore be inferred (also think of institutional privacy). • Different users have different attitudes towards privacy and online information sharing (Knijnenbourg, 2013). Madejski et al., “A study of privacy setting errors in an online social network”. PERCOM, 2012 Knijnenbourg, “Dimensionality of information disclosure behavior”. IJHCS, 2013
  143. 143. Disclosure Scoring “A framework for quantifying the type of information one is sharing, and the extent of such disclosure.” Requirements: • It must take into account the fact that privacy concerns are different across users. • Different types of information have different significance to users. • Must take into account both explicit and inferred information.
  144. 144. Related Work 1. Privacy score [Liu10]: based on the concepts of visibility and sensitivity: 1. Privacy Quotient and Leakage [Srivastava13] 2. Privacy Functionality Score [Ferrer10] 3. Privacy index [Nepali13] 4. Privacy Scores [Sramka15]
  145. 145. Types of Personal Information aka Disclosure Dimensions
  146. 146. Overview of PScore A F A.1 A.6 F.1 F.3A.5 • Explicitly Disclosed / Inferred • Value / Predicted Value • Confidence of Prediction • Level of Sensitivity • Level of Disclosure • Reach of Disclosure • Level of Sensitivity Observed data (URLs, likes, posts) Inference Algorithms 0101 1101 1001 Disclosure Dimensions User Attributes
  147. 147. Example
  148. 148. Visualization Bubble color/size proportional to disclosure score → red/big corresponds to more sensitive/risky
  149. 149. Visualization Hierarchical exploration of types of personal information. http://usemp-mklab.iti.gr/usemp/
  150. 150. Solution 2: Personalized Privacy-Aware Image Classification with Eleftherios Spyromitros-Xioufis and Adrian Popescu (CEA-LIST)
  151. 151. Privacy-Aware Image Classification • Photo sharing may compromise privacy • Can we make photo sharing safer? • Yes: build “private” image detectors • Alerts whenever a “private” image is shared • Personalization is needed because privacy is subjective! -Would you share such an image? -Does it depend with whom?
  152. 152. Previous Work, and Limitations • Focus on generic (“community”) notion of privacy • Models trained on PicAlert [1]: Flickr images annotated according to a common privacy definition • Consequences: • Variability in user perceptions not captured • Over-optimistic performance estimates • Justifications are barely comprehensible [1] Zerr et al., I know what you did last summer!: Privacy-aware image classification and search, CIKM, 2012.
  153. 153. Goals of the Study • Study personalization in image privacy classification • Compare personalized vs. generic models • Compare two types of personalized models • Semantic visual features • Better justifications and privacy insights • YourAlert: more realistic than existing benchmarks
  154. 154. Personalization Approaches • Full personalization: • A different model for each user, relying only on their feedback • Disadvantage: requires a lot of feedback • Partial personalization: • Models rely on user feedback + feedback from other users • Amount of personalization controlled via instance weighting
  155. 155. Visual and Semantic Features • vlad [1]: aggregation of local image descriptors • cnn [2]: deep visual features • semfeat [3]: outputs of ~17K concept detectors • Trained using cnn • Top 100 concepts per image [1] Spyromitros-Xioufis et al., A comprehensive study over vlad and product quantization in large-scale image retrieval. IEEE Transactions on Multimedia, 2014. [2] Simonyan and Zisserman, Very deep convolutional networks for large-scale image recognition, ArXiv, 2014. [3] Ginsca et al., Large-Scale Image Mining with Flickr Groups, MultiMedia Modeling, 2015.
  156. 156. Explanations via Semfeat • Semfeat can be used to justify predictions • A tag cloud of the most discriminative visual concepts • Explanations may often be confusing • Concept detectors are not perfect • Semfeat vocabulary (ImageNet) is not privacy-oriented knitwear young-back hand-glass cigar-smoker smoker drinker Freudian
  157. 157. semfeat-LDA: Enhanced Explanations • Project semfeat to a latent space (second level semantic representation) • Images treated as text documents (top 10 concepts) • Text corpus created from private images (Pic+YourAlert) • LDA is applied to create a topic model (30 topics) • 6 privacy-related topics are identified (manually) Topic Top 5 semfeat concepts assigned to each topic children dribbler child godson wimp niece drinking drinker drunk tipper thinker drunkard erotic slattern erotic cover-girl maillot back relatives great-aunt second-cousin grandfather mother great-grandchild vacations seaside vacationer surf-casting casting sandbank wedding groom bride celebrant wedding costume
  158. 158. semfeat-LDA: Example knitwear young-back hand-glass cigar-smoker smoker drinker Freudian 1st level semantic representation 2nd level semantic representation
  159. 159. YourAlert: A Realistic Benchmark • User study • Participants annotate their own photos (informed consent, only extracted features shared) • Annotation based on the following definitions: • Private: “would share only with close OSN friends or not at all” • Public: “would share with all OSN friends or even make public” • Resulting dataset: YourAlert • 1.5K photos, 27 users, ~16 private/40 public per user • Main advantages: •Facilitates realistic evaluation of privacy models •Allows development of personalized models Publicly available at: http://mklab.iti.gr/datasets/image-privacy/
  160. 160. Generic Models: PicAlert vs. YourAlert
  161. 161. Key Findings • Almost perfect performance for PicAlert with CNN • semfeat performs similarly to CNN • Significantly worse performance for YourAlert • Similar performance for all features • Additional findings • Using more generic training examples does not help • Large variability in performance across users
  162. 162. Personalized privacy models • Evaluation carried out on YourAlert • A modified k-fold cross-validation for unbiased estimates • Personalized model types • ‘user’: only user-specific examples from YourAlert • ‘hybrid’: a mixture of user-specific examples from YourAlert and generic examples from PicAlert • User-specific examples are weighted higher
  163. 163. Evaluation of Personalized Models PicAlert YourAlert u1 3-fold cv k=1 test set u2 u3 Model type: ‘user’
  164. 164. Evaluation of Personalized Models PicAlert YourAlert u1 3-fold cv k=1 test set u2 u3 Model type: ‘user’
  165. 165. Evaluation of Personalized Models PicAlert YourAlert u1 3-fold cv k=1 test set u2 u3 Model type: ‘user’
  166. 166. Evaluation of Personalized Models PicAlert YourAlert u1 3-fold cv k=1 test set u2 u3 Model type: ‘hybrid w=1’
  167. 167. Evaluation of Personalized Models PicAlert YourAlert u1 3-fold cv k=1 test set u2 u3 Model type: ‘hybrid w=1’
  168. 168. Evaluation of Personalized Models PicAlert YourAlert u1 3-fold cv k=1 test set u2 u3 Model type: ‘hybrid w=1’
  169. 169. Evaluation of Personalized Models PicAlert YourAlert u1 3-fold cv k=1 test set u2 u3 Model type: ‘hybrid w=2’
  170. 170. Evaluation of Personalized Models PicAlert YourAlert u1 3-fold cv k=1 test set u2 u3 Model type: ‘hybrid w=2’
  171. 171. Evaluation of Personalized Models PicAlert YourAlert u1 3-fold cv k=1 test set u2 u3 Model type: ‘hybrid w=2’
  172. 172. Results
  173. 173. Privacy Insights via Semfeat child mate son private uphill lakefront waterside public
  174. 174. Identifying Recurring Privacy Themes • A prototype semfeat-LDA vector for each user • The centroid of the semfeat-LDA vectors of their private images • K-means (k=5) clustering on the prototype vectors
  175. 175. Would you share the following? With whom would you share the photos in the following slides: a)family b)friends c)colleagues d)your Facebook friends e)everyone (public)
  176. 176. Part V: Future Directions
  177. 177. Towards Private Multimedia Systems We should: • Research methods to help mitigate risks and offer choice. • Develop privacy policies and APIs that take into account multimedia retrieval. • Educate users and engineers on privacy issues. ...before panic slows progress in the multimedia field.
  178. 178. The Role of Research Research can help: • Describe and quantify risk factors • Visualize and offer choices in UIs • Identify privacy-breaking information • Filter out “irrelevant information” through content analysis
  179. 179. Reality Check Can we build a privacy-proof system? No. We can’t build a theft-proof car either. However, we can make it more or less privacy-proof.
  180. 180. Emerging Issue: Internet of Things Graphic by Applied Materials using International Data Corporation data.
  181. 181. Emerging Issue: Wearables Source: Amish Gandhi via SlideShare
  182. 182. Multimedia Things • Much of the IoT data collected is multimedia data. •Requires (exciting!) new approaches to real-time multimedia content analysis. → •Presents new threats to security and privacy. → •Requires new best practices for Security and Privacy by Design and new privacy enhancing technologies (PETs). → •Presents opportunities to work on privacy enhancements to multimedia!
  183. 183. Example IoT Advice From Future of Privacy Forum • Get creative with using multimedia affordances (visual, audio, tactile) to alert users to data collection. • Respect for context: Users may have different expectations for data they input manually and data collected by sensors. • Inform users about how their data will be used. • Choose de-identification practices according to your specific technical situation. •In fact, multimedia expertise can contribute to improving de- identification! • Build trust by allowing users to engage with their own data, and to control who accesses it. Source: Christopher Wolf, Jules Polonetsky, and Kelsey Finch, A Practical Privacy Paradigm for Wearables. Future of Privacy Forum, 2015.
  184. 184. One Privacy Design Practice Above All Think about privacy (and security) as you BEGIN designing a system or planning a research program. Privacy is not an add-on!
  185. 185. Describing Risks A Method from Security Research • Build a model for potential attacks as a set of: • attacker properties • attack goals • Proof your system against it as much as possible. • Update users’ expectations about residual risk.
  186. 186. Attacker Properties: Individual Privacy • Resources • individual/institutional/moderate resource • Target Model • targeted individual/easiest k of N/everyone • Database access • full (private, public) data access/well-indexed access/poorly indexed access/hard retrieval/soft retrieval (multimedia)
  187. 187. Goals of Privacy Attacks • Cybercasing (attack preparation) • Cyberstalking • Socio-Economic profiling • Espionage (industry, country) • Cybervetting • Cyberframing
  188. 188. Towards Privacy-Proof MM Systems • Match users’ expectations of privacy in system behavior (e.g. include user evaluation) • If that’s not possible, educate users about risks • Ask yourself: What is the best trade-off for the users between privacy, utility, and convenience? • Don’t expose as much information as possible, expose only as much information as is required!
  189. 189. Engineering Rules From the Privacy Community • Inform users of the privacy model and quantify the possible audience: • Public/link-to-link/semi-public/private • How many people will see the information (avg. friends- of-friends on Facebook: 70k people!) • If users expect anonymity, explain the risks of exposure • Self-posting of PII, hidden meta-data, etc. • Provide tools that make it easier to stay (more) anonymous based on expert knowledge (e.g. erase EXIF)
  190. 190. Engineering Rules from the Privacy Community • Show users what metadata is collected by your service/app and to whom it is made available (AKA “Privacy Nutrition Label”) • At the least, offer an opt-out! • Make settings easily configurable (Facebook is not easily configurable) • Offer methods to delete and correct data • If possible, trigger search engine updating after deletion • If possible, offer “deep deletion” (i.e. delete re-posts, at least within-system)
  191. 191. Closing Thought Exercise: Part 1 Take two minutes to think about the following questions: • What’s your area of expertise? What are you working on right now? • How does it interact with privacy? What are the potential attacks and potential consequences? • What can you do to mitigate negative privacy effects? • What can you do to educate users about possible privacy implications?
  192. 192. Closing Thought Exercise: Part 2 • Turn to the person next to you and share your thoughts. Ask each other questions! • You have five minutes.
  193. 193. Acknowledgments Work together with: • Jaeyoung Choi, Luke Gottlieb, Robin Sommer, Howard Lei, Adam Janin, Oana Goga, Nicholas Weaver, Dan Garcia, Blanca Gordo, Serge Egelman, and others • Georgios Petkos, Eleftherios Spyromitros-Xioufis, Adrian Popescu, Rob Heyman, Georgios Rizos, Polychronis Charitidis, Thomas Theodoridis and others
  194. 194. Thank You! Acknowledgements: • This material is based upon work supported by the US National Science Foundation under Grant No. CNS-1065240 and DGE-1419319, and by the European Commission under Grant No. 611596 for the USEMP project. • Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding bodies.

Tutorial for ACM Multimedia 2016, given together with Gerald Friedland, with contributions from Julia Bernd and Yiannis Kompatsiaris. The presentation covered an introduction to the problem of disclosing personal information through multimedia sharing, the associated security risks, methods for conducting multimodla inferences and technical frameworks that could help alleviate such risks.

Views

Total views

1,816

On Slideshare

0

From embeds

0

Number of embeds

7

Actions

Downloads

20

Shares

0

Comments

0

Likes

0

×