Popular articles by drceenish
Ontotext case – Team _A
Identrics NLP Team
Price and promotion optimization for FCMG
Datathon Ontotext Mentors’ Guidelines – Text Mining Classification
The SAP Case using KNIME and Multiple Linear Regression Method
CASE SAP, TEAM 31415
Case_VMWare TEAM anteater
IDENTRICS – Team Vistula
Datathon Sofia Air Mentors’ Guidelines – On IOT Prediction
Datathon Telenor Mentors’ Guidelines – On TelCo predictions
Datathon NSI Mentors’ Guidelines – Economic Time Series Prediction
Popular comments by drceenish
Identrics NLP Team
The use case Identrics is indeed not a typical NLP task. This team has shown a lot of courage to pick up this use case. You have taken an interesting approach of using Neural coref server and SpaCy. For this use case it was a good starting point.
While using a Neural coref library is good, it would have been great to also fine tune the model to the specific use case data. Looking at some model performance graphs and understanding how to optimize the NLP algorithm so that it performs the best for this problem are some of the other things the team could have tried. Also, some of the suggested sections of crisp-dm methodology is missed.
In conclusion, I’ll say that team did a fantastic job in boot strapping available resources in minimal time to get the outcome but missed on some important machine learning optimization steps.
Keep up the great work and please feel free to reach out to me if you have specific questions.
Critical Outliers – VMware Case
Not sure if the name “Critical Outliers” was chosen before seeing the data. I’ll say that this problem definitely needed that kind of critical thinking from the teams.
It is definitive appreciative of the team that they have been successful in building an end to end working system to solve a quite complex but very practical industry problem in the NLP space.
The approach to use LDA is quite classical and has been known to work on variety of data though the challenge is always to find the optimal model parameters to get it working perfectively. It is very important to make sure that the model parameters are chosen such that the features don’t overlap significantly in the topic space – Remember the elbow rule in the learning curve?
In conclusion, I’ll say that nice to see a working system but it would have been impressive to see some implementation of hieratical models like Agglomerative clustering.
Best of luck,
The SAP Case using KNIME and Multiple Linear Regression Method
Hi Team,
Very good data understanding. Let’s take this forward to answer some of the specific questions as part of the this case study.
-Nish
CASE SAP, TEAM 31415
Congratulations to team 31415 for taking up a difficult challenge primarily because, as some might say, the data size for this problem was very small. However, the best part was that problem was well defined.
I thoroughly enjoyed reading a very crisp flow of ideas and implementation on the case. I liked that the team though of using multiple algorithms but discounted that because of limitations of the algorithm given the data.
The only suggestion I’ll like to provide is that the team should have thought twice on using train_test_split at (80,20). Typically when the observations are so less (especially in the medial research studies) the choice is one-vs-all or leave-one-out. However, neither of those methods could have guaranteed a significant change in the model response.
Best of luck.
Case_VMWare TEAM anteater
The VMWare use case is a very well defined real life problem. Congratulation to TEAM anteater for competing in this use case.
I’m impressed that the team has tried two different approaches, SVD and NMF. Like most NLP problems, there was some strong signals which was appearing in multiple topics for e.g. “DaaS” appears in both topic 1 and topic 2. Ideally in such cases, one should optimize the parameters of the algorithms such that the features don’t overlap in the topics. Another approach is to try hieratical models like Agglomerative clustering.
Overall, I think the team did a good job in completing the task in a short time however, I suggest that the choice of algorithm should be based on the outcome of the baseline results.
Best of luck,