The client wanted to have a system used for analyzing medical articles connected with drugs from certain therapy areas, their alternatives and molecules. The system is based on Big Data processing. We were provided with the list of websites containing such articles. The system includes various language versions of the articles, data versioning and custom deduplication process. The UI allows the users to narrow down the results to the ones containing specific information. Faceting and KPIs are used for this purpose, and they are obtained in a separate process using Machine Learning.
Challenges and solutions
The first challenge regarding this project is related with the number of datasources and their APIs. Because of the use of numerous APIs returning data in various formats, a process of normalization was implemented to unify the data in our system, so that it can be processed by further components of the system. The connectors downloading the articles save normalized and raw data in S3.
Because of the same reason (data coming from multiple datasources), a process of deduplication needed to be implemented. The same article can be downloaded from several different datasources – it can have exactly the same form, or it can differ slightly between the websites. The client wanted to store information about duplicated articles, and they also wanted to store different versions of the same document (in cases when different versions differ slightly from each other). Because of this fact, we implemented custom logic for deduplicating the data and for storing its different versions. The same challenge applies to different language versions of the same document, because it was significant for the client to include as many languages as possible. The process of deduplication is complex because duplicates are not removed – if there are any changes, they are detected and the articles in the database are updated. When the data is deduplicated, it is stored in DynamoDB and in ElasticSearch deduplication index.
Another issue was related to the management of the components of the system. The information about the state of downloading the data and about the steps of processing are saved in RDS tables.
The client also cared about having a UI allowing to analyze the data and narrow it down by filters. The web application uses the backend part for these filters to work.
Regarding the UI, the client asked us also to implement KPIs.
These KPIs are:
- Top Reactions (these are the most frequent keywords connected with the products) for the search products on a given time range – a diagram showing a selected number of these keywords
- Top Authors for the search products on a given time range – a diagram showing a selected number of top authors of the articles
- Authors Cloud – a cloud showing a selected number of top authors
- Tag Cloud for the search products on a given time range – a cloud showing the tags for a given product
- Volume Trend for the search products on a given time range – a diagram showing the frequency of occurrence of a given product in the articles on a given time range
- Volume Trend for the search competitor products on a given time range – a diagram showing the frequency of occurrence of a given competitor’s product in the articles on a given time range
- Volume Trend for the molecules on a given time range – a diagram showing the frequency of occurrence of a given molecule in the articles on a given time range.
In order to implement these KPIs in the web application, we needed to create another component processing the data, and this component uses Machine Learning.
The combination of technologies and ideas implemented in the project resulted in a system satisfying the needs of our client. A major part in the development process was communication with the client who could provide feedback and comments regarding his needs on a regular basis.