Our client from the medical industry wanted to build a system used for analyzing medical articles connected with drugs. The client also wanted to use the system for doing the research on the alternatives of these drugs, their molecules and the therapy areas in which they are applied. We designed and developed the whole system from scratch. The system allows the users to analyze the data from multiple datasources, process it and see the results of processing in web application. At the beginning, the system was meant for internal usage (it was a PoC version); then the client wanted to have a production version and share it with his users.
What the client had
We were provided with the list of websites containing medical articles about the drugs, their alternatives, molecules and therapy areas that the client wanted to monitor. The client also provided us with the list of KPIs, filters and functionalities that he wanted to include in the web application so that the users can narrow down the results of their analyses.
Challenges and our solutions
The first challenge regarding this project is related with the number of datasources (there were thousands of websites included on the client’s list). Some of the datasources have their own APIs and we created connectors for them (a connector is an independent agent program downloading the data from a given datasource). There are also datasources that do not have their own APIs, and we needed to use scraping in such cases. We created the connectors and used scraping to improve the accuracy of finding relevant data. Because different websites return the data in various formats, a process of normalization was implemented to unify the data in our system, so that it can be processed by further components of the system. The connectors downloading the articles save normalized and raw data in Amazon S3.
Duplicated documents and different language versions
Because of multiple datasources, a process of deduplication needed to be implemented. The same article can be downloaded from several different datasources – it can have exactly the same form, or it can differ slightly between the websites. The client wanted to store information about duplicated articles, and he also wanted to store different versions of the same document (in cases when different versions differ slightly from each other). Because of this fact, we implemented custom logic for deduplicating the data and for storing its different versions. The same challenge applies to different language versions of the same document, because it was significant for the client to include as many languages as possible. The process of deduplication is complex because duplicates are not removed – if there are any changes, they are detected and the articles in the database are updated. When the data is deduplicated, it is stored in Amazon DynamoDB and in Amazon ElasticSearch deduplication index.
Architecture and Infrastructure
Because of the number of datasources, duplicates and different language versions of the same document, and the fact that the system works continuously (the data is processed every day), we wanted to download and process the data from these different datasources simultaneously. Therefore we needed to use microservices – the components of the system work independently. Using microservices allows to protect the processing in cases when, for example, there is a problem with one datasource (we can continue processing the data from other datasources). We needed to have an internal messaging system between the components. We used Amazon RDS (among others) for this purpose. Each process has its own operations log providing information about the state of a given process and about any failures or errors. In case of any failures, the system is able to determine the point at which it stopped working and it can be restarted at this point.
The whole system runs using Amazon services (ECS, S3, RDS, DynamoDB, ElasticSearch).
The client also wanted to have a web application allowing users to analyze the data and narrow it down by various filters. Processing components prepare data for these filters and for calculating KPIs, and the web application shows the results of the work of these processing components. Additionally, we created a user interface so that the users can search through the results of processing easily. Web application was built with the use of Java and Spring Boot.
The web application allows the users to select any time range (based on the publication date of the articles) and search products (drugs, alternatives and molecules) and see the results for the following KPIs:
- – Top Reactions (these are the most frequent keywords connected with the products)
- – Top Authors and Authors Cloud (a selected number of authors with top number of articles)
- – Tag Cloud (tags for a given product)
- – Volume Trend for search products (the frequency of occurrence of a given product, alternative or molecule in the articles)
Additionally, we created KPIs based on Amazon services using Natural Language Processing – Comprehend and Comprehend Medical. These services analyze the text and return additional keywords connected with medicine.
The combination of technologies and ideas implemented in the project resulted in a system satisfying the needs of our client. We built the entire system dealing with downloading the data from numerous datasources, deduplicating the data, processing it and displaying the results in web application.
Back-end development, Front-end development, Machine Learning
Amazon ECS, Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon ElasticSearch, Amazon Comprehend, Amazon Comprehend Medical, Python, Java, Spring