Data Challenge Winners
Winner (First Prize): Team ProFighters
Abhay Kumar Gupta and Mayank Deshwal, LNMIIT Jaipur.
This team employed a novel strategy, integrating heuristic-based methods with traditional models within an ensemble framework to maximize F1 score efficiency in problem-solving. Additionally, they ventured into experimenting with Large Language Model (LLM) driven text summarization and embedding methodologies. They made a total of 169 submissions during the course of competition.
Runner-up (Second Prize): Team AIED
Shabana K M (IIT Palakkad), Humaira Firdowse (kore.ai), and Sathvik Joel Kothapalli (IIT Madras)
This team leveraged video transcripts, extracting 20 principal keywords per video through a TF-IDF methodology. Subsequently, a pre-trained Word2Vec model, fine-tuned on the refined transcripts, was utilized to ascertain similarities among the keywords for paired videos. This foundation was then used to train models using FastAPI. They made a total of 121 submissions.
Honorable Mention 1: Team JeS0
Jyotirmaya Shivottam, Rucha Bhalchandra Josh, and Dr. Subhankar Mishra
National Institute of Science Education and Research, Bhubaneswar.
This team solved the problem by using a GNN on top of extracted embeddings. They made a total of 17 submissions.
Honorable Mention 2: Team CyPSi
Simran, Aashi Ansari, and Vajratiya Vajrobol
Cyber-Physical System Lab, Institute of informatics and communications, Delhi University.
This team solved the problem using a BERT+biGRU based approach. They made a total of 99 submissions.
Data Challenge Overview
Data challenge is a new addition to the CODS-COMAD conference in the form of a pre-conference competition. The challenge aims to attract teams from the Data Science, Data mining and Machine Learning community to participate. We expect students, practitioners, and researchers to form teams and participate in the competition.
Discover the sequence of academic videos through our data challenge, aimed at addressing the navigation of online learning content. As educational material grows, students often struggle to find the optimal learning path.
Participate in the data challenge by completing this registration form. Organizers will reach out to you with a Kaggle link after the registration deadline.
Prizes
- First place (team prize): ₹ 50,000/-
- Second place (team prize): ₹ 25,000/-
(Second place prize will be awarded only if the number of participating teams are 25 or more)
In addition to the cash prize, winners and runners-up will be presented certificates during the conference award ceremony. They will also get time in the conference program to present their solution(s). One team member from the winning team will receive a complimentary registration to the conference. Student member(s) from the winning team will also be eligible for the CODS-COMAD student travel grant.
What's the Challenge?
- Dataset: VID-REQ, as detailed in "Auto-req: Automatic detection of pre-requisite dependencies between academic videos." Learn more at ACL proceedings link.
- Collaboration: A combined endeavor by Extramarks Education's data team and IIIT Delhi.
- Focus: Tailored to Indian K12 science and math, with content from Extramarks Education.
Why is this Important?
Unraveling pre-requisites in academic videos unites machine learning, deep learning, and NLP. This groundbreaking research emphasizes our dedication to democratizing education through AI, with a focus on the unique challenges of the Indian K12 context. Join us in solving this engaging problem, fostering innovation, and developing diverse, accessible solutions.
Dataset Information
- Availability & Confidentiality:
- The VID-REQ pre-requisite dataset is open for public access at DATA Link with README.
- Test set of 500 hand-labeled edges remains confidential.
- Dataset Composition:
- Original set contains 2,797 labeled pre-requisite edges.
- An additional 500 edges, labeled in-house, are included for this challenge.
- Evaluation Metric:
- Solutions will be gauged using the weighted F1 score, suitable for binary classification tasks with minor class imbalances.
- Licensing & Permissions:
- Dataset usage complies with Creative Commons Attribution-NonCommercial-Share Alike 3.0 International License per ACL guidelines.
- Dataset Explanation:
- Total of 2,797 hand-labeled edges: 1,684 non-prerequisite (label: 0) and 1,113 prerequisite (label: 1).
- Labels were assigned by experts after reviewing video pairs for prerequisite relationships.
- Each video comes with a distinct taxonomy, sourced from K12 textbooks via a PDF parser.
- The Inter-annotator agreement, as per Cohen’s Kappa coefficient, stands strong at 0.64.
- Dataset Contents:
- Includes annotated video edges, titles, taxonomies, transcripts, binary labels, and our extracted features.
- User Opportunity:
- This dataset serves as a foundation for modeling prerequisite video relations. We offer our extracted features for beginners and raw data (like video transcripts) for advanced users aiming to refine prediction accuracy.
Tasks & Problem Overview
Dive into a challenge that tests your creativity, analytical prowess, and problem-solving skills. We're eager to see your innovative approaches in identifying the correct pre-requisite edges among potential edges.
Problem Statement
Is watching Video A crucial before diving into Video B? In essence, does Video A lay the foundational knowledge that enhances comprehension of Video B?
Your Mission
- Utilize the provided features and extract more from the public video transcripts if needed.
- Train a machine learning model to predict one of the two outcomes for each dataset edge:
- 0 - Not a prerequisite.
- 1 - Indeed a prerequisite.
Resources: We'll provide the essential training data with labels and the label-excluded testing data. Once you've trained your model on our dataset, submit your predictions for the testing data on the designated platform for evaluation.
Competition Platform
- Platform: Hosted on Kaggle, covering dataset release, leaderboard, registration, submission, and evaluation.
Compute Platform Guidelines
- Flexible Environment: Participants have the freedom to choose their computing setup.
- Cloud Platforms: Kaggle and Google's Colab are available for participants, offering in-browser work without the need for advanced hardware.
- Personal Resources & Advanced Compute: Using personal setups or additional computational power is permitted. As a reference, prior training utilized:
- CPU: Intel i9-12900K
- GPU: Nvidia RTX 3060 12 GB
- Note: Higher-tier systems may expedite processing, but solution quality depends on creativity and model selection. Large models like LLMs or BERT may need high-performance GPUs. For setup assistance, contact the organizing team.
Evaluation and Leaderboard
- Metric: Submissions evaluated using Weighted F1 score on a 500-edge test set, ensuring it mirrors the training set's distribution.
- Real-World Impact: A live demo during the conference will showcase the algorithms' ability to identify prerequisite video relationships.
- Evaluation Objective: Blend of online and real-world assessments to ensure practical and theoretical excellence.
Timeline and Implementation Details
1. | Registration Deadline: | August 31, 2023 (AoE Time) |
2. | Competition Start: | September 1, 2023. Public release of training data (2,797 edges), competition guidelines, evaluation metrics, and submission format. |
3. | Test Data Release: | October 1, 2023. 500-edge test set release; predictions required without public labels. |
4. | Submission Deadline: | November 30, 2023. |
5. | Winner Notification: | December 10, 2023. |
6. | Official Announcement: | December 12, 2023. |
7. | Presentation & Demo: | January 6, 2024. |
Competition Rules & Regulations
Platform & Submissions
- Competition hosted on Kaggle with a public leaderboard.
- Maximum of 15 submissions per team, daily.
Team Composition
- Teams may have up to 3 members.
- Composition is fixed once evaluation starts.
Evaluation & Dataset
- Weighted F1 metric for the public leaderboard.
- Redistribution of dataset is disallowed without citation.
Papers & Results
- Top teams invited to submit system description papers.
- Data preprocessing allowed but must be documented.
- Public release of scores is mandatory; reporting in papers encouraged.
Liability & Integrity
- No guarantees on dataset accuracy by organizers.
- Incomplete or deceptive submissions can lead to withheld scores.
Open-Source & Public Tools Policy
- Open-source software and public APIs are encouraged.
- Abide by terms and service limits of these tools.
- Cite and acknowledge all used resources.
- Ensure code clarity and reproducibility.
- Avoid proprietary/confidential tools without permission.
Pre-Trained Model Policy
- Public models are allowed with citation.
- Proprietary models disallowed without owner consent.
- Always disclose pre-trained model usage and modifications.
- Models should be accessible and reproducible.
- If a model is trained on a similar dataset, state and justify.
LLM Usage Policy
- LLMs like ChatGPT can be used for coding help.
- Encourage personal creativity in model and feature selection.
- Mention LLM usage in submission.
- Properly attribute any ChatGPT-assisted code.
Code Sharing Requirement
- Submit final solution source code.
- You retain ownership of your code.
- Include scripts and instructions for execution.
- Code should be well-documented with a clear README.
- The aim is for others to reproduce your solution.
Dispute & Arbitration Terms
- Disqualification: Organizers can disqualify participants for misconduct.
- Plagiarism: Immediate disqualification for plagiarism.
- Novelty: Original solutions encouraged; build upon existing ones with acknowledgment.
- Dispute Resolution: Organizer decisions are final.
- Arbitration: For unresolved disputes, organizers choose the location and arbitrator.
- Governing Law: Host country laws apply to competition terms.
By participating, you fully accept these terms.
Registration Link
Register here. Organizers will reach out to you with a Kaggle link after the registration deadline.
