IIITH on Quora

Introduction

When looking through potential colleges to apply to, if you Google search for a question about the college, you're very likely to get a top result from Quora, a popular social question-and-answer website, and the same is true for IIIT Hyderabad. Our project aims to study largely the topics that people talk about when asking and answering questions about IIIT Hyderabad, and the insights we gained while inspecting the same.

Quora often forms the first impression of IIIT for many people

Initial Approach

Our main idea was to create a corpus of questions which are about IIIT Hyderabad, and their top upvoted paired answers wherever available. We aimed to perform topic modelling on this corpus in order to understand the most talked about topics. In addition, we aimed to use metadata about the answer and the author such as the number of views, upvotes, the author's bio etc. in order to gain some more insights.

Scraping and cleaning

Unfortunately, Quora does not provide any official API to retrieve questions by querying with some parameters. This meant that it was necessary for us to scrape the questions using some other means. We used selenium for the same. Since the website would frequently crash during the scraping period it is possible we were also being rate limited.

We searched for the terms 'IIIT Hyderabad', 'IIIT H', and 'UGEE' on the website and scraped these questions, and their top upvoted answer whenever the questions had been answered. After scraping, we removed duplicates and false positive matches with unrelated content (sponsored posts or advertisements). In addition, we also cleaned the data to remove any identifying information from the aggregate analytics, maintaining the anonymity of those mentioned in the answers.

After the above mentioned steps and some standard data cleaning to ready the data for further processing and analysis, we are left with a total of 1800 data points across answered and unanswered questions.

Topic Modelling with LDA/NMF and BERT

Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) are two common topic modelling approaches, and we utilized the same in order to extract the most common topics from the corpus. Both these methods are language agnostic, and primarily reliant on the frequency of words in the document and corpus. This gives a slight advantage in understanding the use of some code-mixing and terminology specific to IIIT, which might not work well with a pre-trained model.

 

BERTopic is a stack which utilizes BERT for topic modelling, and allows hierarchical clustering of topics. We make use of BERTopic as well, since the underlying Transformers architecture is more powerful and the resultant topics should be semantically linked in a manner that more closely resembles human understanding of the topics, and though we mention that code-mixing and college-specific terminology may not work well, we actually find good results nonetheless from BERTopic.

Topics for questions answered by IIITians

Topics for questions answered by non-IIITians


Insights from Topic Modelling

IIIT vs Non-IIIT

For the authors answering a certain question, we checked for the presence of 'International Institute of Information Technology', or 'IIITH' in their bio to determine whether they were associated with the college or not. Of course, this is not a perfect approach since people can add it to their bio without the fact being verified, but we believe this is counter-balanced by the number of people who are indeed associated with the college but do not bother to mention the same to their bio.

The results from the topic modelling showed that there did indeed seem to be a discrepancy between the kinds of topics that were being answered by IIITians and those being answered by Non-IIITians.

We note that for the topics answered by non-IIITians, a majority of topics revolve around entrance examinations like JEE and UGEE and for the preparation of the same, and topics regarding GSoC selections, coding culture and placements.

As for the topics answered by IIITians, there are of course commonalities, but there are also some topics we don't get to see answered by non-IIITians much, such as questions about teaching assistantship, research load and hostels. These topics are pretty specific to the students on campus so it makes sense that these do not see answers from non-IIITians.

Answered vs Unanswered

Additionally, we noted that there was a slight difference in the most common topics in the answered vs the unanswered questions. We initially hypothesized that perhaps there was some intrinsic difference in the topics that were answered vs the ones that remained unanswered. However, unlike the previous case we were unable to come up with a direct explanation.

We then decided to check the semantic similarity of the answered and the unanswered questions in order to gauge how different they truly were, for which we used MiniLM, a distilled language model.

From Unanswered to Answered through Semantic Similarity

We started by checking the semantic similarity of an unanswered question, to the similarity of the answered questions and retrieving some of the most semantically similar questions. In doing this we discovered something interesting: for almost all the unanswered questions, we were able to retrieve answered questions that were very similar to the given unanswered question.

 


In fact, in most situations, the answers for the retrieved questions could be used in-place to respond to the unanswered question, and in most other cases, the answers could be expected to contain information relevant to answering the original unanswered questions. This leads us to believe that there is no such inherent trait in the questions which remain unanswered as opposed to those that do get answered.

Insights from Metadata Statistics

IIIT vs Non-IIIT

We then compared the number of unique authors for answers, the number of upvotes and views on average etc. for authors from IIIT vs those not from IIIT.

 

We note that a there are far fewer unique authors of answers from IIIT than non-IIIT, which is of course to be expected.

   

It is interesting that while the number of views for non-IIIT answers is a fair bit higher than the number of views for IIIT answers (which is again, to be expected), the number of upvotes does not lag behind as much. This implies that questions answered by IIIT students are of in general higher quality.

Conclusion

In conclusion, we note some interesting insights about the discourse on Quora about IIITH, and reveal that there are differences in the kinds of topics answered by authors from IIIT vs authors from outside IIIT, and that while it initially seems like there are differences in the kinds of topics that get answered vs those which do not get answered, these are spurious and there are no major topics which inherently are left unanswered.

In terms of future work, we believe that using semantic similarity to retrieve similar answered questions and form an answer for unanswered questions utilizing these answers is a promising direction. This could also be done using sequence-to-sequence models in order to generate an answer for the question utilizing information from these similar answers.

We would like to thank Prof. PK and the TAs of the course for the opportunity and their guidance in the process of the project, which went some refinement from its initial stages to what it has finally culminated in. Our team was made up of Yash Mehan, Pratyaksh Gautam, Harshit Gupta, Jatin Agarwala, and Bhaskar Hanuma (L-R in the below photo, with Prof. PK between Jatin and Hanuma).

The Team (QQQuora) with Prof. PK

Comments

Popular posts from this blog

Parlia-metrics!

[READ THIS BEFORE YOU PITCH] The Indian Bible for The Indian Investor

Stock Tip Simulator