Transforming documents into data to improve decision making, actions and outcomes in investment, financing, legal, compliance…
Structured data, such as collections of numbers ready to be processed, is not the only kind of data available. Nowadays, AI can be used to automatically extract data from very different kinds of texts: notes on balance sheets, contracts, legal judgements, financial analysis reports, product and service descriptions on websites, chatbots, reviews, reports, presentations, memos, emails, financial and insurance product information documents (e.g. KID), posts on social media etc. This is defined as unstructured data.
Even audio and video content (e.g. earnings calls) can be transformed into written text and then “distilled” into data.
Texts may already be in digital format, ready to be elaborated, or they can be digitalised with the appropriate computer vision technology or OCR.
The uses of datalysed text are manifold: from financial applications (credit management, risk management, investment choice etc.), compliance and legal to marketing & sales, or perhaps just the monetisation of the data itself.
The ultimate goal is not to replace humans in the decision making process, but to render the process more efficient, reducing timescales and allowing people to make decisions and take actions based on as much information as possible in order to achieve the best possible results (AI as Augmented or Actionable Intelligence).
(Although it is not within the scope of this article, it is worth noting that images can also be datalysed: think for instance of satellite photos, maps, CT scans, spectrograms etc.)
The data obtained can be made available through API services or visualized in dedicated dashboards with integrated alarm systems (alert) which automatically alert the user to any positive or negative anomalies found in certain factors on which the evaluation is based (the so-called KPIs).
The data obtained from texts can be combined with any structured data the company may already have, whether it be their own proprietary data (e.g. for a bank this may be data obtained from transactions with existing clients) or data downloaded from public databases (e.g. balance sheets, stock prices etc.). Advanced analysis models can be then developed using machine learning logic, in particular with a view to finding correlations and predictive elements.
The Datrix group has developed several solutions in this field. In particular the tech company PaperLit, drawing on over ten years’ experience in transforming the publishing industry, has developed digitalisation solutions, which, through special algorithms designed to read the characters in a document convert them into machine-readable digital text, with the aim of obtaining a high search-engine ranking for the resulting material. “A typical case might be one in which we need to start from a paper-based document, possibly damaged by time, or from a scan or print out” says Luca Filigheddu, CEO of PaperLit. PaperLit also specializes in summarisation solutions through which texts can be reduced to an appropriate length after analysis by neural networks has determined the relevance of their content.
The tech company FinScience has developed natural language comprehension and analysis solutions (Natural Language Processing or NLP). Giving a machine the capacity to elaborate natural language is extremely complex, given that each language has its own specific rules which can differ greatly from one language to another; not to mention the frequent use of conventions and idioms which depend heavily on context in order to be understood.
Many methodologies from the machine learning field are used to solve the problems frequently encountered in NLP: for example, categorization algorithms are used to associate a sentiment with a given text, or statistical models which can learn how to recognise the main subject of a piece of content.
“In FinScience we have faced many challenges” – says Ilaria Bianchini, Head of Research Tech -, “such as using unsupervised machine learning to group together similar documents and summarise their content, extraction of the emotions contained in a text through proprietary algorithms based on deep learning, definition of the keywords in a text (the words present in the text are represented as nodes in a network, from which we try to determine which are the most important nodes inside this network, similar to how Google’s famous PageRank works), and using supervised learning to classify a great number of legal documents starting from a group of tags.
To resolve these problems we need to combine knowledge of the given field, for example collaborating with lawyers in the case of legal applications, and technical knowledge of algorithms and programming.”
FinScience has “textual” experiences in relation to the datalisation of contracts underlying a Non-performing Loan (NPL), the buying and selling of real estate, development of alternative investment indicators and quantamental strategies, ESG evaluations of companies – in particular the measurement of the distance between in-house sustainability reports and public sentiment, and the improved reliability of SME default risk estimate models.