1) QUALITY OF THE DATA TRAINING SET
In light of the principles of the court’s random case allocation criteria and the case sample we have been given access to that includes cases from all of the 12 Commercial courts in Barcelona, we believe that the 3000 finalized cases can be considered representative of the type of documents, debts, structure and amount owed by individuals requesting a debt waiver so that they can enjoy a fresh start.
2) DATA PROTECTION CONCERNS
The data protection concerns basically deal with the requirement to anonymize every single file contained in each of the 3000 court cases and the assurance of the lack of processing of the data eventually inferred from the data subjects.
The Catalan government requires us to contract a third-party company to anonymize the documents included in each of the 3000 court cases so that treatment takes place on anonymized data.
The anonymization of court cases is an ongoing work-in-progress that is performed iteratively, through several processing and evaluation cycles with limited sets of court cases, gaining knowledge and effectiveness at each cycle. Currently, anonymization technology shows, generally, very good performance in detecting personal data. After the detection of personal data the anonymization process has to replace these data by synthetic data (i.e. new invented names, streets, numbers, emails, etc,…, unrelated with the original data).
3) HETEROGENEOUS DOCUMENTS, STRUCTURE AND FORMAT
Challenges and technical limitations of document anonymization - The anonymization process has currently technical limitations in some special cases in which some personal data cannot be detected, replaced or censored due to the inherent limitations of anonymization technology when confronted with the challenges of heterogeneous sets of documents that do not follow a specific structure and present different formats. Obviously these special cases had to be anonymized manually if no technical solution is found to deal with them automatically with an AI algorithm or a custom-based script.
Moreover other special issues include: lack of consistency between fields of synthetic data inside the same document, lack of consistency between fields of synthetic data across documents, misalignments of synthetic data with the textual context, etc. All these cases can sometimes put some difficulties to the AI textual content extraction algorithms when applied over them.
Handwritten documents – some (generally few) documents provided are hand-written, which means that the AI language models cannot extract the information included in them. The amount of handwritten documents in a case, the bigger the number of errors that the AI system will generate and the AI system will be remarkably useless.
The claimant’s document filing is different from the procedural requirements and timing – in some cases claimants’ solicitors are required to provide an additional number of documents than the ones initially submitted to the court.
Heterogeneous PDFs internal formats: most of the 3000 cases we have had access to had PDF files with several different internal formats. As explained earlier the instalment of proceedings is a docket that it’s fully digitalized but some of the files are word documents converted to PDF, some of them are fully scanned PDFs, other PDFs types contain pictures or scanned parts inside,.… the heterogeneity of document format presents a major challenge for having an adequate extraction of the information provided.
The AI system is designed in a way that errors can be relatively easily detected by the user of the system: 1) the system generates alerts when there are required documents missing in the court case folder, 2) the system shows the data source documents below the extracted information so that the user can check whether the extraction has been correctly done. Moreover the system can detect whether there’s missing information and it can detect whether there’s inaccurate information.
The constant and easy oversight for error detection is one of the essential principles of the design of the AI system.
As of now we can identify two major risks presented by the AI judicial assistant:
a) Overreliance in the AI judicial assistant proposals by the user of the system – judges, law clerks and court officials overly confident on the proposal that the judicial system will do to them at each stage of the judicial proceeding. This could result in errors as well as challenge the required human intervention necessary at each phase of the proceeding.
b) Infra-representation of exceptions – as explained earlier the AI judicial assistant is designed to serve the most standard cases of insolvent proceeding of natural person with no assets. Given that the system does not address all the possible cases and all the possible nuances arising in the different cases, it will be necessary that the system users are active to detect exceptions, different special circumstances of the claimant or particular cases that will surely appear.
AI systems function well for treating standard, repeated cases, but they require an active human involvement so that there is a close look to the special, non-standard circumstances that real case law regularly presents.