What 'Clean Data' Actually Means for a Business Like Yours

Founders get told to clean their data before using AI. No one explains what that means without a data team. Here's the practical test.

Your CRM has 4,000 contacts. Some fields are blank. The company names are formatted three different ways. You've been told you need "clean data" before you touch AI, and you have no idea what that means or where to start. So the project sits.

The advice isn't wrong. It's just aimed at companies with data engineers and governance committees. At your scale, it translates into nothing actionable.

The part everyone skips

A 2023 ScienceDirect study manipulated data quality across multiple datasets and measured the effect on model performance. Missing values and inconsistent formats damaged accuracy more than moderate random noise did — especially in tree-based models and linear regression, which are the algorithmic families underlying most off-the-shelf AI tools SMBs actually use. The implication is specific: standardizing the fields that matter for a particular decision produces more reliable outputs than scrubbing every record in your database.

This is not a license to ignore data quality. It reframes what you're doing. You're not cleaning a database. You're preparing specific fields for a specific question.

The fitness test, not the cleanliness standard

Researchers at PubMed Central reviewed how different fields define "data readiness" and synthesized a framework with four components: quality, availability, interoperability, and provenance. Their finding matters here because it shifts readiness from an abstract content property to a relational one. Data is ready when it is available, linkable across your systems, and documented well enough to trace its origin for a specific use case. Not in general. For this decision.

Zhao et al., in an arXiv preprint on data quality in supervised learning, showed that quality problems vary by dimension — erroneous labels, missing features, outliers, and distribution shifts each affect models differently depending on the task. A blanket cleaning pass does not resolve this. Targeting does.

So the practical test for a founder without a data team is: pick the decision you want AI to help with, identify the fields that feed that decision, and ask three questions. Are those fields complete, with no systematic gaps? Are they formatted consistently — dates as dates, currencies as currencies, categories spelled the same way? Do you know where the data came from and whether the collection method changed over the period you're looking at?

That last question is the one most founders skip, and it's the one that causes the worst failures.

Where this test breaks

Here is the steelman, stated at its strongest: a fitness test that checks completeness and format consistency within the records that exist cannot detect systematic gaps in which records exist.

Say you're building a churn model. Your support ticket data looks complete. What you don't know is that tickets from your first 18 months of operation were logged in a spreadsheet before your CRM was adopted, and only 40% were migrated. Customers from that period have artificially low ticket counts in your training data. The model learns that low ticket counts predict retention. It's wrong, and it will be confidently wrong for a predictable segment of your customers.

The NIST AI Risk Management Framework and ISO/IEC 5259 standards both flag representativeness and bias checks as minimum conditions for trustworthy AI, applicable to small firms. The PubMed Central framework treats provenance not as documentation paperwork but as the mechanism that makes errors detectable after deployment.

The fitness test catches this — but only if "traceable to source" means asking whether collection methods were consistent across the time period your model will train on, not just confirming who entered the data last Tuesday.

What to actually do Monday morning

Pull the data for the one decision you want to automate. Check completeness in the fields that drive that decision. Standardize the formats. Then trace each field back to its origin and ask whether anything changed in how that data was collected over time — a new CRM, a staff handover, a process change.

That's it. That's the data readiness check. Not a data audit. Not a governance framework. A fitness test for one use case, run by you, in an afternoon.