Logo Utrecht University

Digital Humanities Lab


Applied Data Science webinar: Exploiting the Syntactical Structure of the Data for Data Cleaning



Patterns (or regex-like expressions) are widely used to discover meta-knowledge in a given domain, e.g. a ‘Year’ attribute should contain only four digits, and thus a value like “1980-” would be erroneous. Modeling the syntactic structure of a given attribute requires developing a suitable syntactic representation of its values. In structured datasets, many attributes, such as ZIP codes and phone numbers, would have a set of dominant syntactic structures with a few (maybe none) values with syntactically different patterns.

In this talk, I will discuss different techniques for utilizing the syntactic structure of the data values to detect syntactic outliers especially those that represent disguised missing values (DMVs) and discover a new type of dependencies between the attributes in a given data table. While syntactic outliers are represented by non-dominating patterns of the data, DMVs has different patterns since they are fake values that replace missing values. In most cases, DMVs has special structures such as repeated patterns (‘111-111-1111’ for phone number), or values that do not fit with the majority of the values (‘-1’ for the age). DMVs are used frequently which makes traditional outlier detectors unable to detect such erroneous values. Discovering a set of dominating syntactic patterns that represents the majority of the values in a given attribute while ignoring the repetition of the DMVs can help in detecting the DMVs. Moreover, patterns can be leveraged to model the dependencies (or meta-knowledge) between partial values across columns. For instance, in an employee ID ‘F-9-107’, ‘F’ determines the finance department. We call this type of dependencies as pattern functional dependencies (PFDs) which can also be used for discovering more errors that cannot be detected otherwise.


About the speaker

Hakim Qahtan is an assistant professor at the Data Intensive System Group, Information and Computing Sciences Department. Before joining Utrecht University, he worked as a Postdoctoral Researcher at Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University (HBKU), Qatar (2016-2019). Dr. Qahtan earned his PhD degree from the Machine Intelligence & kNowledge Engineering (MINE) Lab at King Abdullah University of Science and Technology (KAUST) (2016).  He completed his B.S. and M.S. in Computer Science at Cairo University, Egypt and King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia, respectively. He worked as a teaching assistant at Taiz university, Yemen and a lecturer at KFUPM, Saudi Arabia.  His current research focuses on data cleaning, data stream mining and explainable machine learning. Dr. Qahtan has published many papers in top conferences and journals including VLDB, KDD, AAAI, ICDE, TKDE, and CIKM