When using any AI tool privacy, security and compliance need to be considered incredibly carefully. This guide outlines a framework for the key points to consider. It describes what measures we’ve taken with CoLoop to ensure your data privacy, ownership and ultimate control are totally respected. This guide has also been reversed in order to begin the most pertinent issues and questions specifically related to AI before moving to generic ones.

Third-party Vendors

3rd party vendors are a particular consideration when evaluating AI tools. This is because a vast majority of AI application providers rely in some way or another on privately hosted “Foundational Models”.

Foundational models, GPT and OpenAI

  • Foundational Models are large multimillion parameter AI models that are trained on a vast corpus of data.
  • They are often the starting point for adaptation for downstream tasks e.g. transcription, summarisation, search etc.
  • The most famous of these at the time of writing are the GPT-n* series of models from OpenAI

So why does this matter..?

  • The source code and “weights” (specific configuration obtained through training) for the GPT-n series are privately held and closely guarded IP*
  • At the time of writing no other AI company currently posses models with equivalent performance to the GPT-n models across almost all categories (this may be about to change with Llama-2)
  • In short, for most, access to the best in class technology only comes from privately operated foundational models who’s precise configuration is deliberately obscured

*GPT-2 is publically available

What to look out for in new AI tools?

  • When looking at any new AI tool it is important to review which 3rd party vendors they are working with.
  • You should check that they have adequate provisions in place to guarantee that those 3rd parties are subject to the same or higher restrictions as the company providing the tool.
  • This could be…
    • Adherence to industry standards (ISO / SOC)
    • Written guarantees
    • Data processing agreements (read more about this)
  • These agreements should align with the developer guarantees about the tool i.e. If they say they aren’t using your data for training 3rd party agreements should state that as well!

What do we do at CoLoop?

  1. Equivalent Provisions: All 3rd party providers we work with our bound by equivalent provisions to those in our own Data Processing Addendum (DPA) and Privacy Policy. This includes OpenAI who are contractually restricted from using any data they come into contact with for the improvement of their product and services.
  2. Data Minimization: All systems are engineered to provide limit access to data strictly defined by their function. Data is only share with each service where required.
  3. Vendor Management: All 3rd parties are vetted to ensure compliance with our standards when being considered.
  4. Physical Access Controls: As a cloud service provider relying on hosting services like AWS or GCP for physical facilities used to store and process user’s data, they rely on these third parties’ physical access controls to prevent unauthorized access to facilities used to process or store user’s data.

Data Usage

Foundational Models are adapted or improved for specific tasks through training. A base model such as GPT4 could be improved to answer questions about lifesciences by fine-tuning it on many 1000s of pieces of text from say academic papers about lifescience. The resulting model would end up better on text based tasks to do with lifesciences.

How does this impact privacy?

  • Language models are primarily trained using Next Token Prediction which basically means predicting which sub-word comes next in a sentence e.g. “The cat sat on the …”
  • If the underlying data it was trained on contained many references to a pieces of personally identifiable information e.g. Somebody’s name “John Doe”;
  • When prompted with “John” it would become more likely to predict “Doe” as the next word.
  • This could have disastrous consequences as the underlying models would then be liable to ‘leak’ pieces of its training data such as “John Doe” and whatever else he was mentioned in conjunction with if prompted correctly.

How do I know whether a service is learning from my data?

  • You need to check the privacy policy and any other associated service agreements or data processing agreements for something to the effect of “your data may be used for the improvement of our products and services”
  • It is important to note that simply passing data into an AI model after it’s been trained to get a result (termed ‘Inference’) does not cause it to learn from your data

What to look out for?

  • Make sure the tools you are looking at are clear about whether your data will be used for training
  • It is important to be specific on this point!
    • Many products gather anonymised analytics for reasonable purposes such as KPI tracking e.g. How many people logged in this week.
    • Often these in product analytics are too noisy or few to have any trainable value
  • Companies are extremely unlikely to be actively dishonest on this point
  • The penalties for misuse of customer data in this fashion are extremely severe and increasingly under review
  • Do also remember that these policies can be updated from time to time - you should be notified in writing when this occurs!

What do we do at CoLoop?

Transparency

AI models are in many ways still a black box. Just like the human brain, researchers are aware of their properties and behaviour. They know broadly which neurons are lit up by certain stimuli and how they adapt under training. They know what happens when broad groups of neurons are disabled in some way and how that can effect the performance of overall system on a set of baseline tasks. They don’t typically have a deep and fundamental understanding of what goes on inside that enables the outputs to be derived from the inputs. Also just like the human brain this doesn’t stop somebody from intepreting the output of a system perfectly well.

When evaluating a system it is important to make a distinction between explainability and interpretability which according to “Explainable AI: A Review of Machine Learning Interpretability Methods” is defined as:

1. Interpretability: Is the “the ability to explain or to present in understandable terms to a human … The more interpretable a machine learning system is, the easier it is to identify cause-and-effect relationships within the system’s inputs and outputs”.

2. Explainability: Is the depth to which the internal procedures that are taking place during training and inference are understood.

When is this a problem?

  • AI systems can ‘make decisions’ which impact the lives of real people for instance: deciding whether somebody is approved for a mortgage; deciding whether a patient is eligible for a certain treatment or deciding whether an autonomous car should protect the pedestrian or the driver…
  • Most individuals and legal frameworks are increasingly and rightly expecting that these systems should be able to explain their decision making to affected parties. (“Explainable AI: A Brief Survey on History, Research Areas, Approaches and Challenges”)
  • To truly realise the potential of AI there are many situations where it would be necessary to defer entire decisions to a machine we cannot however live in a world where the rationalisation for live changing decisions becomes “because the computer said no…”

When is this not a problem?

  • The popular model right now for AI tooling is the ‘copilot’ or assistant that preserves the agency of the professional while supporting them in decision making
  • While the inner workings of these systems may not be strictly explainable they remain reasonably interpretable
  • This means it is possible for a user to rationalise about where an output came from given the input by showing both of these side by side for instance.
  • This is a popular approach with large language models that I would refer to as ‘post-hoc’ interpretability (“Explainable AI: A Review of Machine Learning Interpretability Methods “)

What should I look for in AI tools to make sure this is addressed?

  • When looking at an AI tool it is important to make sure you can at the very least rationalise in a post-hoc manner where the output come from.
  • Is it really collaborating with you as a junior research or are you required to treat the outputs of a less qualified colleague as gospel without question?

What do we do at CoLoop?

  • At CoLoop we link every single generated output right back to the original source material
  • You can even click through and see the quote in-situ in the transcript and replay the original audio if you want to
  • We also provide up and down votes with every generation to allow users to flag when the result does not align with their interpretation
  • These votes are tracked internally as a KPI and used as a benchmark against which we compare future updates

AI Ethics (Bias / Accuracy)

  • The debate around AI ethics in research other than the points discussed above focus largely on bias.
  • Bias in AI models is defined as systematic misrepresentations, attribution errors, or factual distortions that favor certain groups or ideas, perpetuate stereotypes, or make incorrect assumptions based on learned patterns. (“Should ChatGPT be Biased? Challenges and Risks of Bias in Large Language Models “)
  • Language models are prone to absorb biases present in the textual data they are trained on.
  • These biases can often manifest as “Hallucinations” (also called “confabulations” or “delusions”) that consistent of confident sounding generations that contain factual innaccuracies.
  1. Demographic bias: over / under representation of certain groups
  2. Cultural bias: perpetuation of stereotypes
  3. Linguistic bias: overepresentation of dominant languages e.g. English over ‘low resource’ languages e.g. Swahili
  4. Temporal bias: lack of up to date information
  5. Confirmation bias: reinforcement of pre-existing stereotypes
  6. Ideological / Political bias: preference for a particular political view or ideology
  • At present significant effort is expended by large AI model providers such as OpenAI to address this through “AI alignment”
  • It’s also worth noting that AI models suffer to much lesser extend from cognitive biases that effect human researchers such as In-group bias; relativity bias and others

How are these effects mitigated in AI research tools?

  • There are many approaches that can be taken to mitigate the effects of bias in AI models, some of these include
  1. Fine tuning / alignment: Special training regimes undertaken by language model providers to eliminate biases or other harmful outputs (try askign ChatGPT which political party it voted for..!)
  2. Curated datasets: Review and curation of datasets before training to ensure they are balanced
  3. Post-hoc emergent analysis: analysis of the output of models against a test set of inputs (e.g. prompts for language models) to test them for biases to eliminate with fine-tuning or removed with a second pass to correct the generated output
  4. Grounded Generation: Providing the model with additional instructions, context or information to base it’s generation on
  5. Traceable / Interpretable outputs: Building models into AI tools in a way that enables researchers to reason about where the generation came from to ensure transparency

What to look out for in AI tools?

  • Does the tool use AI models that have been aligned / evaluated for bias e.g. Claude, GPT-n etc.
  • Are these models deployed in a manner that puts researchers in the loop to check outputs for existing or residual bias even after alignment?
  • Are outputs traceable or interpretable so the systems ‘reasoning’ can interpreted and checked

How do we address this at CoLoop?

  • At CoLoop we deploy all of the outlined steps above to limit the effects of this on our outputs
  • All generations are fully interpretable and can be traced back to their sources
  • We are additionally working to implement more post-hoc analysis on top of what we already offer to further limit the prevalence of these types of bias

Data Privacy & Security

How is the confidentiality of data ensured at CoLoop?

At CoLoop data privacy and security is immenseley important to us and We are currently deployed with clients in sensitive areas such as government and healthcare. Some of the measures we take to guarantee this include the following:

  • All data is encrypted securely at rest and in transit using AES-256 and TLS or equivalent
  • Systems are built and configured with secure industry standards such as NIST these include regular audits
  • Data is stored and processed on machines housed in secure tier 3 and 4 data centres hosted by GCP and AWS
  • Data is siloed from other company data in a logical order and arrangement
  • There are legally binding agreements in place (Privacy Policy, Data Processing Addenda and NDAs) to protect client information
  • Unecessary PII elements of data are removed before processing by different components to ensure data minimisation

Data Localization*

Data localization with any saas tool has become an important topic in recent times. Different jurisdictions have different rules for what can be done with data derived from citizens living under their laws e.g. CCPA in the US; GDPR in the EEA and UK GDPR in the UK. Most regulation kicks in when data transfers start to take place.

  • Data transfers are the export of data from one jurisdiction to another
  • The concern around this point for regulators is not so much where the data actually resides but rather what laws it is subject to
  • In this section I will discuss specifically how this affects a UK / EEA research agency subject to GDPR

Is it GDPR compliant if the data is hosted in the US?

  • If the service you are interacting with is owned / operated by a UK / EEA based software company, even if they host your data on servers physically outside this region no cross-border transfer has occurred
  • Provided they take reasonable steps to minimise the data that is processed, collected and stored and do so in a secure fashion this can still be GDPR compliant.
  • By being registered in the UK / EEA they are still subject to and within the reach of UK / EEA GDPR

What if the software company are using US based 3rd parties like OpenAI?

  • If the service you are interacting with use 3rd parties that are US owned / operated then they should establish a data processing agreement that contains standard contractual clauses (SCCs) with those 3rd parties.
  • The US based 3rd party should also be registered via the US Data Privacy Framework to establish a route for recourse if they do not fulfill their obligation
  • You should ask for a list of US based 3rd parties and establish whether all have entered into data processing agreements that contains SCCs and whether they are properly registered to handle data subject to UK / EEA GDPR.
  • Please note there are other more complex mechanisms to ensure compliance such as Binding Corporate Rules however these are outside the scope of this article.

How do we address this at CoLoop?

  • Data is processed by CoLoop outside of the EEA / UK in GDPR compliant manner
  • Data minimisation features are enabled within CoLoop to enable users to store high classification Personal Data in the region of origin
  • All data is encrypted at rest and in transit using AES-256 and TLS respectively
  • All third parties used by CoLoop have entered into binding DPAs with SCCs, you can find a list of these here
  • Refer to the section on privacy and security to understand more about the security features we implement to ensure your data remains secure.

*in order to answer this question we consulted a UK based law firm, this section is based on their responses to our questions.

Data Retention

Providers of AI tools may wish to retain data for training; evaluation and in some cases prevention of abuse. Information on this can usually be found in the privacy policy published by the providers of the AI tool. You should make sure you are clear on the retention policy and how to go about removing your data. It is worth noting that good engineering practices will also involve storing back ups which may also need to be removed.

What do we do at CoLoop?

At CoLoop data is only processed or retained on your instruction. Personal Information and Client Data is deleted within at most 7 days of a request to CoLoop either written or submitted via the platform.