We developed a system searching for documents about a particular type of doctors for our client from the medical industry. We designed the system and we are responsible for all its components. The client provided us with input files, URLs to be used for scraping and a database containing the list of doctors. We needed to use these sources to create a system that would identify the doctors that the client was interested in. The project encompasses Machine Learning models, Big Data processing and Named-Entity Recognition.
Challenges and solutions
As in the case of the majority of projects dealing with Big Data, we faced some challenges when developing this system because of the amount of data.
Firstly, we needed to use 2 ECS containers. The client provided us with an input csv file and URLs for scraping, and these two input components needed to be processed separately.
Secondly, the system was finding different types of documents. Sometimes the documents were on a different topic – not on the one that the client was interested in. We needed to implement a logic for classifying relevant and non-relevant documents. For this purpose, we used Machine Learning models – SVM model and TFIDF model.
Thirdly, the system needed to identify the doctors. To find the names within the documents, we used Named-Entity Recognition. When the names were found, we had to identify whether these are the names of the doctors that we were interested in. Regular expressions turned out to be useful in this case. We created 4 lists for identifying a doctor:
- the list of keywords and phrases identifying doctors
- the list of keywords and phrases disqualifying a given person as a doctor
- the list of keywords and phrases disqualifying a given person as a doctor containing the distances of the keywords or phrases from names (this list is very useful to determine that a name belongs to a person unrelated to medicine)
- the list of other names (e.g. name of a product) used for filtering the names of people (this list was used because sometimes Named-Entity Recognition was classifying a name of a product or facility as a person).
After using the above lists to identify the names, a special final score was counted for each pair of name and keywords/phrases.
Finally, we needed to implement an idea for presenting the results. S3 buckets are used for storing the reports obtained during processing. When the input data is processed, an individual report for each item is saved. There is a processor triggered once a day that merges these individual reports. As a result, the client gets daily reports in csv format.