More about Classora's technology...
The current technology used by Classora Technologies comes from the product Classora Knowledge Base, the first knowledge base in Spanish on the Internet.
In order to provide updated and really useful information, the knowledge base needs to be constantly incorporating public data from the available sources. But given the huge amount of data available on the Internet, these sources range from completely structured official platforms (such as Eurostat, the Natinal Statistics Institute, FIFA...) to public non-official sources written in plain text or scarcely structured (such as blogs, e-commerce stores, or even the Wikipedia). To that effect, Classora has developed three types of robots for data managing:
1) ETL robots: responsible for the massive uploading of reports from official public sources. They are used for either absolute or incremental data uploading.
2) Data scanner robots: responsible for seeking and updating the data of a unit of knowledge. They use specific sources to perform this task.
3) Content aggregators: they do not connect to external sources. Instead, they generate new information using Classora´s internal database.
Classora's ETL robots perform the following procedures:
- Extraction: parsing the information in the different data sources.
- Transformation: filtering, cleaning and structuring the data.
- Loading and enrichment: new data are linked to old information.
In absolute values, however, Classora Knowledge Base handles a small amount of the information available on the Internet. Besides, each new source of information adds to the complexity of the integration with the data previously loaded due to the increase of the number of variables. Without manual monitoring (more and more expensive and impractical) Data quality is bound to diminish due to the increase in the data volume.
This can be avoided, however, by investing in I+D+i; therefore our company is constantly improving the loading robots in order to incorporate new data sources with lower structuration levels, in more languages, and with better integration with the previously loaded data. The main problem we are facing is one imposed by technological evolution: the transformation of unstructured data into structured information.
ETL: Extraction, Transformation and Load
ETL processes are the most important components and the ones that offer the greater added value in a Business Intelligence infrastructure. Although this processes may seem transparent to the platform users, ETL processes gather data from every necessary source and prepare the information to be presented through the reporting and analysis tools. Thus, the accuracy of any platform that manages data integration depends entirely on ETL processes. In Classora's case, ETL robots complete and enriches every piece of data with the corresponding metadata (loading date, source, reliability of the data, refresh rate, meaning, connections, etc.) for its subsequent automatic procressing.
Implementing effective and reliable ETL processes brings up many challenges:
1) Data volume grows exponentially, and ETL processes have to process a huge amount of information. Some of the systems update incrementally, while others require a complete reloading for each iteration.
2) As information systems are increasingly complex, the disparity of the sources also increases, and therefore their integration. ETL processes need ample connectivity and greater flexibility.
3) Transformations involved in ETL processes can be very complex. Data need to be aggregated, analyzed, computed, statistically processed, etc. Specific transformations, costly from a computational perspective, are sometimes needed.
Currently, there are commercial tools and even free software with great capacity for data extraction. In fact, speed and performance issues don't pose a big technical problem in extraction and loading. Data transformation is where the bottleneck is actually found: at this point unstructured information needs to be converted into structured information in order to be integrated with the already existing data in the target system.
NPL (Natural Language Processing) is one of the early cornerstones of Artificial Intelligence (AI). Automated translation, for instance, was born at the end of the 40's before the expression "Artificial Intelligence" was coined. In general terms, NPL deals with the formulation and investigation of computationally effective mechanisms for the communications between people and machines through natural languages.
At this stage, however, natural language interpretation algorithms have not been completely developed yet. The main problem is the ambiguity of human language. Such ambiguity is made clear at different levels:
1) At lexical level, one word can have different meanings, and the suitable one has to be deduced from the context. Many investigations in the field of natural languages processing have studied methods of solving lexical ambiguities through dictionaries, grammars, knowledge bases and statistical correlations. But the solutions still need further development.
2) At referential level, the resolution of anaphoras and cataphoras implies the determination of the previous and following linguistic items they are referring to.
3) At estructural level, semantics is required to determine the hierarchy of the propositional syntagms that forms the different syntactic trees, i. e. : "He is a student of Philosophy of education".
4) At pragmatic level, a sentence does not often mean what it is actually being said. Elements such as irony and sarcasm play an important role in the interpretation of the message.
To solve these types of ambiguities and others, the central problem in the PLN is the translation of natural language inputs to an internal representation unambiguous as parse trees. This is precisely the solution which are choosing most of public knowledge bases available on the Internet, including the initial approach of Classora with CQL (Classora Query Language).