Introducing Differential Privacy

Introducing Differential Privacy

Secure & Private AI (Introduction)

  • Advances in AI have been facing the problem of not having enough data to enable research breakthroughs and answering life changing problems, this comes from the large enterprises that avoid sharing their data for security and privacy reasons ,that's why we should find solutions to how can we answer and give solutions to modern problems using private data.

Differential Privacy (Introduction)

  • Differential privacy is defined as the science that enable us to build Deep Learning models that learn what are they supposed to learn from the data, in further lessons will see some of the state-of-the-art techniques that enable us to do just that.

What is Differential Privacy at a high level

  • The origin of DP : Started in 2003 with statistical database queries, or more recently applied to fields like Machine Learning.
  • Meaning of DP : in common terms, the goal of DP is to ensure that statistical analyses applied to private data doesn't compromise individual information.

1 - DEFINITION OF PRIVACY:

  • First Def : Privacy is preserved if :

    "After Statistical analyses, the analyzer doesn't know anything about the people in the data-set ; they remain "unobserved"."

    — This definitions goes a bit off the purpose of statistical analyses, because the purpose of this last is to learn helpful information about the data-set without learning specific things about individuals that might harm or be sensitive in certain way.

  • Second Def : "Anything that could be learned from a participant in a statistical database could be learned without access to the database."

    — this definition fails for several reasons:

    • This will not allow us to learn information from private data-set, defeating the reason to use them at all.
    • This encourages to learn and propagate private information already made public : which is in the base case impossible to know and in the worst case harmful information spreading.
  • Third Def : Cynthia Dwork , Algorithmic Foundation of DP "Differential privacy describes a promise made by a data-holder or curator to a data subject and the promise is like this: 'You will not be affected adversely or otherwise, by allowing your data to be used in study or analysis, no matter what other studies, data-sets or information sources are available'".

Can we just Anonymize Data ?

  • Even though we could anonymize our data, the risk of having someone else releasing a related data-set could increase the chances of your data to be divulge perhaps "de-anonymized". The most famous situation that we can think of is the "One million prize by Netflix" where we saw Netflix private data-set get de-anonymized by a group of researchers from the University of Texas, they've achieved that by finding similarities between IMDB movie rating data-set and the anonymized Netflix one.

Introduction to Canonical Database

Simple database : a database with one column with one row for each person.

import pytorch

#the number of entries in our dataset
num_entries = 5000

db = torch.rand(num_entries) < 0.5 

db

The question is how would we define privacy in the context of this simple database ; given that we are performing some type of query against the data-set, privacy would be preserved for a person if we remove him from the database and the query output doesn't change , which means that the person wasn't leaking any information to the output of the query.

The next question that we might ask is, could we construct a query that doesn't change no matter who we remove from the data-set ?

Project Intro Build A Private Database In Python

Assignment 1 : Generate Parallel Databases :

My solution for this:

# try project here!
#function that removes entry "i"
def remove_entry(db , i ):
    db = db[torch.arange(db.size(0)) != i ]
    return db

#function that creates  parallel databese 
def create_db_and_parallels(num_entries):
    db = (torch.rand(num_entries) > 0.5).type(torch.uint8)
    parallel_db = []
    for i in range(num_entries):
        parallel_db.append(remove_entry(db, i))
    return db, parallel_db

print(create_db_and_parallels(5000)[1][0].shape)

The answer is surprisingly YES ! the field of Private AI and more specifically techniques like Differential Privacy and Federated Learning allow us to do such thing ! this field is in its early days and it didn't yet reach its peek of research breakthroughs

This field will be very important in the future because of all the data that's been held private by the big companies which hinders research advances in the field of AI.