Machine learning (ML) has recently enabled many modeling tasks in design, manufacturing, and condition monitoring due to its unparalleled learning ability using existing data. Data have become the limiting factor when implementing ML in industry. However, there is no systematic investigation on how data quality can be assessed and improved for ML-based design and manufacturing. The aim of this survey is to uncover the data challenges in this domain and review the techniques used to resolve them. To establish the background for the subsequent analysis, crucial data terminologies in ML-based modeling are reviewed and categorized into data acquisition, management, analysis, and utilization. Thereafter, the concepts and frameworks established to evaluate data quality and imbalance, including data quality assessment, data readiness, information quality, data biases, fairness, and diversity, are further investigated. The root causes and types of data challenges, including human factors, complex systems, complicated relationships, lack of data quality, data heterogeneity, data imbalance, and data scarcity, are identified and summarized. Methods to improve data quality and mitigate data imbalance and their applications in this domain are reviewed. This literature review focuses on two promising methods: data augmentation and active learning. The strengths, limitations, and applicability of the surveyed techniques are illustrated. The trends of data augmentation and active learning are discussed with respect to their applications, data types, and approaches. Based on this discussion, future directions for data quality improvement and data imbalance mitigation in this domain are identified.