As more companies implement artificial intelligence (AI) in the workplace, compiling accurate, historical data sets to train AI models is becoming more and more essential.
Although AI can offer significant advantages to companies developing cutting-edge digital products, they must focus on the data quality they use for training.
With good data, models can support many product engineering tasks, such as helping with research, coding, prototyping, and collaboration.
But with bad data? As the saying goes, “garbage in, garbage out.”
Techopedia spoke with Nitesh Bansal, CEO of R Systems, a product engineering and digital services company, about how datasets will drive a new level of customization in product experiences, improve product decision-making, and lead to the broader implementation of continuous feedback loops.
Will feedback loops help reinforce the reliability of AI? Where will this impact healthcare and financial services? Let’s explore the role of data curation and AI in 2025.
Key Takeaways
- AI success in 2025 hinges on accurate, well-curated datasets.
- Poor data quality leads to unreliable and biased AI outcomes.
- Continuous feedback loops and HITL approaches enhance AI reliability.
- Industries like healthcare and finance gain from diverse, curated data.
- Robust data governance ensures long-term accuracy and compliance.
About Nitesh Bansal
Nitesh Bansal is the Managing Director and Chief Executive Officer (CEO) at R Systems with 25 years of experience in digital and product engineering services, joining R Systems after a 23-year tenure at Infosys, where he held various leadership positions.
At Infosys, he was SVP and Global Head of Engineering Services with direct responsibilities over sales, delivery, consulting, and R&D.
The Importance of AI Data Curation
Q: Why will data curation for AI be a key task for companies over the next year?
A: Well-organized, clean, and thoughtfully constructed data has never been more important.
Companies that prioritize data curation will give themselves a competitive edge, enabling them to unlock accurate and valuable insights that can be used to drive business growth and optimization.
Conversely, those who employ poor data curation practices will see biased or inaccurate AI outputs, resulting in detrimental consequences for their businesses.
While organizations must assess their data curation processes in the near-term, data curation and management needs to remain a ‘forever priority.’ It shouldn’t be something that is done once and never discussed or acted on again.
At this juncture, many companies don’t have the vast wealth of data required to build generalized AI models, so we’ll see curriculum learning used more commonly.
However, this approach will require heavily sanitized and structured data, hence the criticality of data curation going forward.
Q: How will the curation of comprehensive data sets drive a new level of customization?
A: In today’s business landscape, generalized large language model construction won’t make sense for many specific contexts, so having comprehensive datasets that power a model is like having the ultimate power for your business.
Extensive, comprehensive data sets allow organizations to customize products and services to meet the needs of their customers, which can ultimately lead to increased customer satisfaction, loyalty, and revenue growth.
For example, AI can analyze numerous data points from various sources, which can help a business uncover connections that might have gone unnoticed. Then, with continuous feedback loops, the business can refine and improve its processes, allowing for less disruption in future cases.
Regardless of the architectural path, the old adage of “garbage in, garbage out” is critical to remember when building or augmenting an AI model. Data cleansing, filtering, formatting, and prep are arguably the most important elements to a model’s quality.
Continuous training and augmentation requires a high bar of governance on all inputs and is ultimately critical to building models that most use cases will desire.
Keep the Data Moving Towards Accuracy
Q: How do continuous feedback loops reinforce the reliability of AI models?
A: Continuous feedback loops and ‘human-in-the-loop’ (HITL) are both important for generative AI systems to reinforce reliability and mitigate risks.
Just as MLOps has been critical to maintaining reliability and confidence for traditional machine learning models, AIOps must do the same for AI systems.
In AIOps, continuous integration and delivery practices (similar to DevOps) will need to become more fluid and real-time.
As new data flows in, everything must be checked or iterated to ensure the model is operating with the highest quality inputs.
Incorporating human oversight and feedback can improve the robustness, trustworthiness, and overall performance of generative AI models.
For example, in healthcare, AI-powered insurance pre-authorization assessment models can benefit both from feedback loops and HITL processes.
With these two sources of feedback, models can become more accurate and reliable, while also ensuring that decisions are fair and comply with the insurance company’s policies and other precedents.
This can lead to faster pre-authorization, which can ultimately have a significant impact on patient experience.
It’s Not Just About Your Customers: Keep Data Diverse For a Fuller Picture
Q: How can companies ensure their datasets are diverse enough to improve AI performance across various applications?
A: When training AI, it’s critical for organizations to understand how a subset or population of data within a company does or does not represent its targeted user base.
By prioritizing diverse datasets — whether seasonal, geographical, demographic, or something else — companies can reduce the need for manual intervention, minimize the risk of bias, and enable the models to handle unexpected scenarios and anomalies.
This can be beneficial when looking to have humans focus on higher-value tasks.
Furthermore, organizations must be vigilant in understanding and preventing bias within models. In many cases, they will need to rely on external data providers along with inferential modeling (in some use cases) to ensure they use a robust and representative training set in constructing their AI models and applications.
Where AI Data Goes Next
Q: What industries stand to benefit most from curated datasets, and why?
A: Data curation is crucial across all sectors. However, industries with significant regulations, such as healthcare, insurance, finance, legal, and utilities (including telecom), may benefit greatly from including data from external sources to help ensure accurate, diverse data in their models.
For example, healthcare providers can use population health use cases to significantly improve clinical outcomes while reducing the cost of healthcare. Governments can use curated data to track economic growth, monitor public health trends, and optimize resource allocation.
By prioritizing data curation, organizations across all sectors can unlock the full potential of their data, drive innovation, and achieve business goals.
Q: What steps can help maintain data accuracy and relevance over time, especially in industries like healthcare and finance?
A: It is essential that organizations establish a robust data governance framework, which should include enhanced data monitoring, and a consistent application of governance protocols whenever new data is added to models.
The governance framework should define data standards, policies, and procedures, as well as implement data quality checks and validation rules. This could include monitoring and assessing data quality, identifying and addressing data discrepancies, and implementing data cleansing and normalization processes.
It is also important that these organizations ensure data security and compliance with regulatory requirements, regularly update and refresh data, and provide training and education on data management best practices.
Moreover, ensuring that data analysts, data scientists, and other data professionals have engineered data sets that can be worked with as complexity grows will prevent inadvertent misuse and misapplication of data in models as their construction becomes faster and more accessible.
The Challenges & Evolution of AI Data
Q: What are the biggest challenges in curating datasets for AI applications, and how can they be addressed?
A: A major challenge for AI is delivery latency and rigorous edge-case testing/bias prevention. Many AI applications will require data access and delivery speed to meet expectations, so the engineered pipelines feeding them must be fast and reliable. This makes testing particularly important to ward off adversarial attacks against the AI.
Another major challenge is getting industry buy-in on datasets to ensure accuracy and reflection of real-world scenarios. To resolve these challenges, collaborating with industry experts to validate dataset accuracy, implementing robust data governance policies to ensure sensitive data is appropriately handled, and techniques like data augmentation and transfer learning to customize publicly available data for company-specific use cases is imperative.
Additionally, using advanced data processing and analytics tools can help remove variance-causing factors, while investing in data quality and validation processes can ensure dataset consistency and accuracy.
Q: How do you see the role of data curation evolving as AI technologies become more advanced?
A: As large language models and generative AI continue to advance, the importance of data curation will only continue to grow. As AI becomes more advanced, elements of master data management and governance will garner their own AI applications.
However, when it comes to data curation, we’ll see ‘partnered AI’ in the near-future. Data curation challenges require a deep, contextual understanding of how the data is sourced, what it means, and how it’s used.
This requires a level of Artificial General Intelligence [AGI] for smaller-scale models that just haven’t been seen yet, so it’ll be data professionals, potentially augmented by AI, executing data curation.
Overall, organizations that prioritize data curation will be better equipped to harness the power of these technologies in the future.
Simultaneously, we’ll see industries align on data quality certifications in their domains when using external datasets managed by industry-specific collaborations.