Security & Ethics - how does CoLoop remain secure
Discover how CoLoop maintains robust security measures and ethical standards to protect user data. Learn about third-party vendor management, data localization compliance, and approaches to mitigate AI biases in research tools.
At CoLoop, we are committed to maintaining the highest standard of privacy, security and compliance. In this document, we outline the key points and describe the measures CoLoop puts in place to ensure your data privacy, ownership and ultimate control are respected.
Third-party Vendors
3rd party vendors are a particular consideration when evaluating AI tools. This is because a vast majority of AI application providers rely in some way or another on privately hosted “Foundational Models”.
Foundational models, GPT and OpenAI
- Foundational Models are large multimillion parameter AI models that are trained on a vast corpus of data.
- They are often the starting point for adaptation for downstream tasks e.g. transcription, summarisation, search etc.
- The most famous of these at the time of writing are the GPT-n* series of models from OpenAI
So why does this matter..?
- The source code and “weights” (specific configuration obtained through training) for the GPT-n series are privately held and closely guarded IP*
- At the time of writing no other AI company currently possesses models with equivalent performance to the GPT-n models across almost all categories (this may be about to change with Llama-2)
- In short, for most, access to the best in class technology only comes from privately operated foundational models whose precise configuration is deliberately obscured
*GPT-2 is publicly available
What to look out for in new AI tools?
- When looking at any new AI tool it is important to review which 3rd party vendors they are working with.
- You should check that they have adequate provisions in place to guarantee that those 3rd parties are subject to the same or higher restrictions as the company providing the tool.
- This could be…
- Adherence to industry standards (ISO / SOC)
- Written guarantees
- Data processing agreements (read more about this)
- These agreements should align with the developer guarantees about the tool i.e. If they say they aren’t using your data for training 3rd party agreements should state that as well!
What do we do at CoLoop?
- Equivalent Provisions: All 3rd party providers we work with our bound by equivalent provisions to those in our own Data Processing Addendum (DPA) and Privacy Policy. This includes OpenAI who are contractually restricted from using any data they come into contact with for the improvement of their product and services.
- Data Minimization: All systems are engineered to provide limited access to data strictly defined by their function. Data is only shared with each service where required.
- Vendor Management: All 3rd parties are vetted to ensure compliance with our standards when being considered, and their register is maintained in our AI Overview Document.
Data Usage
Foundational Models are adapted or improved for specific tasks through training. A base model such as GPT4 could be improved to answer questions about life sciences by fine-tuning it on many 1000s of pieces of text from say academic papers about life science. The resulting model would end up better on text based tasks to do with life sciences.
How does this impact privacy?
- Language models are primarily trained using Next Token Prediction which basically means predicting which sub-word comes next in a sentence e.g. “The cat sat on the …”
- If the underlying data it was trained on contained many references to a piece of personally identifiable information e.g. Somebody’s name “John Doe”;
- When prompted with “John” it would become more likely to predict “Doe” as the next word.
- This could have disastrous consequences as the underlying models would then be liable to ‘leak’ pieces of its training data such as “John Doe” and whatever else he was mentioned in conjunction with if prompted correctly.
How do I know whether a service is learning from my data?
- You need to check the privacy policy and any other associated service agreements or data processing agreements for something to the effect of “your data may be used for the improvement of our products and services”
- It is important to note that simply passing data into an AI model after it’s been trained to get a result (termed ‘Inference’) does not cause it to learn from your data
What to look out for?
- Make sure the tools you are looking at are clear about whether your data will be used for training
- It is important to be specific on this point!
- Many products gather anonymised analytics for reasonable purposes such as KPI tracking e.g. How many people logged in this week.
- Often these in product analytics are too noisy or few to have any trainable value
- Companies are extremely unlikely to be actively dishonest on this point
- The penalties for misuse of customer data in this fashion are extremely severe and increasingly under review
- Do also remember that these policies can be updated from time to time - you should be notified in writing when this occurs!
What do we do at CoLoop?
- At CoLoop we NEVER use your data for any kind of training.
- Data is used exclusively in the manner outlined in our Data Processing Addendum (DPA) and Privacy Policy
Transparency
AI models are in many ways still a black box. Just like the human brain, researchers are aware of their properties and behaviour. They know broadly which neurons are lit up by certain stimuli and how they adapt under training. They know what happens when broad groups of neurons are disabled in some way and how that can affect the performance of the overall system on a set of baseline tasks. They don’t typically have a deep and fundamental understanding of what goes on inside that enables the outputs to be derived from the inputs. Also just like the human brain this doesn’t stop somebody from interpreting the output of a system perfectly well.
When evaluating a system it is important to make a distinction between explainability and interpretability which according to “Explainable AI: A Review of Machine Learning Interpretability Methods” is defined as:
1. Interpretability: Is the “the ability to explain or to present in understandable terms to a human … The more interpretable a machine learning system is, the easier it is to identify cause-and-effect relationships within the system’s inputs and outputs”.
2. Explainability: Is the depth to which the internal procedures that are taking place during training and inference are understood.
When is this a problem?
- AI systems can ‘make decisions’ which impact the lives of real people for instance: deciding whether somebody is approved for a mortgage; deciding whether a patient is eligible for a certain treatment or deciding whether an autonomous car should protect the pedestrian or the driver…
- Most individuals and legal frameworks are increasingly and rightly expecting that these systems should be able to explain their decision making to affected parties. (“Explainable AI: A Brief Survey on History, Research Areas, Approaches and Challenges”)
- To truly realise the potential of AI there are many situations where it would be necessary to defer entire decisions to a
machine we cannotmachine. We cannot however live in a world where the rationalisation forlivelife changing decisions becomes “because the computer said no…”
When is this not a problem?
- The popular model right now for AI tooling is the ‘copilot’ or assistant that preserves the agency of the professional while supporting them in decision making
- While the inner workings of these systems may not be strictly explainable they remain reasonably interpretable
- This means it is possible for a user to rationalise about where an output came from given the input by showing both of these side by side for instance.
- This is a popular approach with large language models that I would refer to as ‘post-hoc’ interpretability (“Explainable AI: A Review of Machine Learning Interpretability Methods “)
What should I look for in AI tools to make sure this is addressed?
- When looking at an AI tool it is important to make sure you can at the very least rationalise in a post-hoc manner where the output comes from.
- Is it really collaborating with you as a junior researcher or are you required to treat the outputs of a less qualified colleague as gospel without question?
What do we do at CoLoop?
- At CoLoop we link every single generated output right back to the original source material
- You can even click through and see the quote in-situ in the transcript and replay the original audio if you want to
- We also provide up and down votes with every generation to allow users to flag when the result does not align with their interpretation
- These votes are tracked internally as a KPI and used as a benchmark against which we compare future updates
AI Ethics (Bias / Accuracy)
- The debate around AI ethics in research other than the points discussed above focus largely on bias.
- Bias in AI models is defined as systematic misrepresentations, attribution errors, or factual distortions that favor certain groups or ideas, perpetuate stereotypes, or make incorrect assumptions based on learned patterns. (“Should ChatGPT be Biased? Challenges and Risks of Bias in Large Language Models ”)
- Language models are prone to absorb biases present in the textual data they are trained on.
- These biases can often manifest as “Hallucinations” (also called “confabulations” or “delusions”) that consist of confident sounding generations that contain factual inaccuracies.
- Demographic bias: over / under representation of certain groups
- Cultural bias: perpetuation of stereotypes
- Linguistic bias: overrepresentation of dominant languages e.g. English over ‘low resource’ languages e.g. Swahili
- Temporal bias: lack of up to date information
- Confirmation bias: reinforcement of pre-existing stereotypes
- Ideological / Political bias: preference for a particular political view or ideology
- At present significant effort is expended by large AI model providers such as OpenAI to address this through “AI alignment”
- It’s also worth noting that AI models suffer to much lesser extent from cognitive biases that affect human researchers such as In-group bias; relativity bias and others
How are these effects mitigated in AI research tools?
- There are many approaches that can be taken to mitigate the effects of bias in AI models, some of these include
- Fine tuning / alignment: Special training regimes undertaken by language model providers to eliminate biases or other harmful outputs (try asking ChatGPT which political party it voted for..!)
- Curated datasets: Review and curation of datasets before training to ensure they are balanced
- Post-hoc emergent analysis: analysis of the output of models against a test set of inputs (e.g. prompts for language models) to test them for biases to eliminate with fine-tuning or removed with a second pass to correct the generated output
- Grounded Generation: Providing the model with additional instructions, context or information to base its generation on
- Traceable / Interpretable outputs: Building models into AI tools in a way that enables researchers to reason about where the generation came from to ensure transparency
What to look out for in AI tools?
- Does the tool use AI models that have been aligned / evaluated for bias e.g. Claude, GPT-n etc.
- Are these models deployed in a manner that puts researchers in the loop to check outputs for existing or residual bias even after alignment?
- Are outputs traceable or interpretable so the systems ‘reasoning’ can interpreted and checked
How do we address this at CoLoop?
- At CoLoop we deploy all of the outlined steps above to limit the effects of this on our outputs
- All generations are fully interpretable and can be traced back to their sources
- We are additionally working to implement more post-hoc analysis on top of what we already offer to further limit the prevalence of these types of bias
Data Privacy & Security
How is the confidentiality of data ensured at CoLoop?
At CoLoop data privacy and security is immensely important to us and we are currently deployed with clients in sensitive areas such as government and healthcare. Although the full list of implemented controls is available on our trust center, some of the most important measures we take to guarantee this include the following:
- All data is encrypted securely at rest and in transit using AES-256 and TLS or equivalent
- We are fully compliant with SOC 2, with the certificate available on our trust center
- Systems are built and configured in line with the industry standards such NIST regulations
- Data is stored and processed on machines housed in secure data centres hosted by AWS
- Data belonging to different customers is logically segregated on our systems
- We have legally binding agreements in place to protect client information, including Privacy Policy, Data Processing Agreements and NDAs
- We make it possible for clients to remove unnecessary PII data points before processing to ensure data minimisation
Data Localization*
Data localization with any saas tool has become an important topic in recent times. Different jurisdictions have different rules for what can be done with data derived from citizens living under their laws. The main concern around this point for regulators is not so much where the data actually resides but rather what laws it is subject to. Currently, CoLoop is fully GDPR compliant, but we are happy to discuss any other legal and regulatory requirements that you might have.
Is it GDPR compliant if the data is hosted in the US?
- If the service you are interacting with is owned / operated by a UK / EEA based software company, even if the data is hosted on servers physically outside this region, the data processing can be GDPR-compliant. The company is responsible to ensure the data is processed, collected and stored securely and in line with all the GDPR requirements.
What if the software company is using a non-EU based 3rd party like OpenAI?
- If the service you are interacting with uses 3rd parties that are non-EU owned or operated, then they should establish a data processing agreement that contains standard contractual clauses (SCCs) with those 3rd parties. Additionally, apart from the SCCs, they could rely on other transfer mechanisms, such as the US Data Privacy Framework, or Adequacy Decisions. The company should be able to produce a list of all the subprocessors that handle your data, and the transfer mechanisms in place.
How do we address this at CoLoop?
- Data is processed by CoLoop outside of the EEA / UK in GDPR compliant manner
- Data minimisation features are enabled within CoLoop to enable users to store high classification Personal Data in the region of origin. We also perform Data Processing Impact Assessments (DPIA) and Data Transfer Impact Assessments (DTIA) wherever necessary.
- All data is encrypted at rest and in transit using AES-256 and TLS respectively, and all the third parties used by CoLoop have entered into binding DPAs with adequate transfer mechanisms in place. You can find a list of all of CoLoop’s subprocessors here.
Data Retention
Providers of AI tools may wish to retain data for training; evaluation and in some cases prevention of abuse. Information on this can usually be found in the privacy policy published by the providers of the AI tool. You should make sure you are clear on the retention policy and how to go about removing your data. It is worth noting that good engineering practices will also involve storing back ups which may also need to be removed.
What do we do at CoLoop?
At CoLoop data is only processed or retained on your instruction. Personal Information and Client Data is deleted within at most 7 days of a request to CoLoop either written or submitted via the platform. Similarly, our subprocessors retain the data only for as long as it is needed for the agreed processing to take place.