Dr. Daniel Ferrés, Universitat Pompeu Fabra
1) HARDWARE ARQUITECTURE
This project uses a local in-house hardware architecture with a local open-source Large Language Model (LLM). This decision was chosen for these main reasons: a) to avoid sending data to external companies, b) to prevent possible unexpected security issues with personal data such as leakage, theft or misuse, and c) also to have more control of the algorithms and the models.
The hardware architecture of the project consists of three main hardware environments: 1) an AI Training platform, 2) a front-end server, and 3) an AI Generation back-end server.
2) SOFTWARE ARQUITECTURE
There are three phases: 1) the Model Training Phase, 2) Front-end interface phase, and 3) the Prediction and Generation phase.
3) INTERFACE
The communication of the user with the front-end interface is done inside a secure Information Technologies (IT) infrastructure.
The front-end interface uses secure protocols to receive and send encrypted electronic Court Cases.
The user interacts with a simple, intuitive and functional interface.
In this phase of the project the interface includes 3 main sets of tasks: Document Classification, Textual Field Extraction, and Auto Generation.
The interface is currently focused to be used mainly in desktop environments.
4) PREDICTION AND GENERATION PHASE
The AI Generation server is a back-end server with a REST API that receives requests from the front-end server. The back-end server includes several Artificial Intelligence algorithms with LLMs.
This server can process two types of ECCs formats:
1) a folder with several documents corresponding to the ECC,
2) a unique PDF file that contains all the documents of the ECC merged.
In the last type of format, a splitting script will divide the compounded file in all the Court Case PDF documents and will name them with the original name included in the unique PDF. Then it is necessary to extract the textual data from the PDF files in order to provide the algorithms an electronic textual input of the case. This process is done by the PDF textual extraction module based on an existing AI open-source software that can deal with both text-based PDFs and image-based PDFs.
The Document Classification classifies each document in the court case in on 7 possible classes. This information is returned back to the front-end server and displayed to the user. In some cases the system shows an alert message than informs the user that the some required mandatory documents (and/or complementary ones) were not found and thus it is not possible to process the whole case until the necessary documents are delivered to the system. The classification module currently is based on a rule-based algorithm that uses the following information to classify the documents: 1) the document filenames, and 2) the internal textual content of the documents. A Machine Learning algorithm is currently in development for this task.
The user can also generate a Case Management Order draft for required documents.
The Textual Field Extraction module returns a set of textual fields extracted from the most relevant classes of documents of the Court Case (see the list of relevant classes of documents above and the set of extraction fields in section 2.1 ). This set of fields is displayed in the front-end to inform the user about the following data about the debtor: 1) personal data, 2) assets, 3) income tax return, and 4) creditors. An AI Large-Language model is used for Textual Fields Extraction. A generic local LLM is used in combination with prompting techniques to extract the relevant data.
The final task of this part of the project is the Automatic Textual Generation of documents to be returned to the user. Currently two kind of files with proposed content are generated: the Admission Auto draft, and the Exoneration Auto draft. This module uses a set of predefined automated rules and legal text templates that in combination with the textual fields extracted during the previous phase generates the proposed output files and sends them to the front-end server to be returned to the final user. This task does not use AI algorithms directly, it uses AI indirectly by using information extracted from the previous module, the Textual Extraction module, where state-of-the-art AI LLM algorithms are used. The rules that compose this task have been predefined by a law expert and will generate the textual output according to the values of the fields extracted in the Textual Field Extraction task. The specific parts of the predefined texts used (sentences and paragraphs) are stored in files and the internal database. This module does not require training and tuning.