Data factories for data products

Jun,11
2020

Today, data is broken. If “time is money” (Franklin, 1749), then data analytics is a disaster because more than 80% of the time budget of a data analytics project is spent with data processing and refinement – not with results (see “Data is broken,” link). Figure 1 shows a metaphorical example of the problem: At an executive-level “data” tends to be thought of as a product that can be grabbed off the shelf, like bottled water. However, for a data scientist it more like a puddle of water, raw water. Nobody would drink from a puddle. The water would have to be “treated,” analyzed for harmful ingredients and nutrients, labeled and bottled to an exact quantity. This is not cheap. The magnitude of the price difference between raw and bottled water speaks for itself. The price of one liter of branded, bottled water at retail, buys one thousand liters of raw water in Berlin. These numbers make it crystal clear how much value is created from raw to bottled water. The gist of our analogy is this: Data needs refinement and in order to economize on the process it takes – no surprise – a factory. We use factories for everything else. Henry Ford invented the modern factory and with it, automotive as an industry. His factory turned auto-making from a hand-made affair to mass production through automation (Womak et al. 1990). Now we need data factories to stop wasting analytics time on data, to make data analytics scalable.

Di H23 Fig1 Water Analogy 1024x329

Figure 1: The water analogy of data refinement

Don’t reinvent the wheel: Create a factory

How would one go about creating a data factory? Raw data goes in, and a refined data product comes out of it. But what is happening in between? With cars, we have a rough idea: Sheet metal goes in, is cut, stamped, welded together, painted, and many thousand parts are added, like an engine, transmission, seats, lights, etc. (Clark and Fujimoto 1991). So, inside a factory, there is a lineup of different departments, such as stamping operations, white body welding, the paint shop, final assembly, etc. What are the different departments in a data factory? One solution is to consult with the knowledge base that has evolved from research and industry best practice worldwide, complement if with additional field work and experiments, and then vet it with other scientists in conferences and through publication. Figure 2 presents a result of this approach.

Di H23 Fig2 Data Factory Conceptualization 1024x653

Figure 2: A data factory conceptualization

Toward a data factory framework

Figure 2 depicts the distinct activities required to turn raw data into a data product that can be either fed into an artificial intelligence application or traded. It reveals the different “departments” of a data factory. In a nutshell, raw data rights must be verified before any data can be ingested or harvested (rights, licensing, user consent). Then data ought to be harmonized, properly labeled, or tagged for it to be made discoverable through a catalog of categories and search engines (harmonization). Furthermore, it needs to be scored to provide some indication of quality, because without it any subsequent analytics is pointless – “garbage in, garbage out” (GIGO, quality scoring). Finally, governance mechanisms are required to ensure that data can be exchanged while data sovereignty is maintained for each data provider.

In-depth case studies and systematic literature reviews of 250+ articles

The conceptualization of our data factory framework builds on an established foundation. It has evolved in a multi-step investigation from (a) in-depth case study analysis in the literature and (b) systematic literature reviews (SLRs) to (c) our own observations building a data factory in practice. Figure 3 summarizes developments in the literature as the foundation of our refinements.

Di H23 Fig3 Evolution 1024x610

Figure 3: Evolution of the framework foundation in the literature

Pääkkönen & Pakkala present a first analysis of internal “data factories” using in-depth case studies of big data pioneers (2015). The authors dissect data operations at pioneers like Facebook and Netflix and establish that data preparation at these companies is a ‘process’ as “a series of actions or steps” (Webster) analogous to a ‘factory’ as “a set of […] facilities for […] making wares […] by machinery” (Webster). This stepwise decomposition conforms with the evolution of information system capabilities toward modularization and flexibility as seen with the emergence of Web services, for example with Microsoft’s .NET framework (Schlueter Langdon 2006, 2003b). Specifically, Pääkkönen & Pakkala reveal three major and common steps of data refinement – because of our focus on data refinement, we are explicitly excluding any analysis, analytics and visualization steps: (i) data extraction, loading and pre-processing; (ii) data processing, and (iii) data transformation. This in-depth, case-study based assessment of big data pioneers is corroborated through extensive SLRs: The first study includes 227 articles from peer-reviewed journals extracted from the Scopus database from 1996-2015 (Sivarajah et al. 2017). It confirms three steps in the data preparation process (again, excluding data analysis, analytics and visualization steps): data intake (acquisition and warehousing), processing (cleansing), and transformation (aggregation and integration; p. 273). A second, recent study considered 49 articles from three different branches in the literature (Stieglitz et al. 2018): computer science (ACM and IEEE), information systems (AIS), and the social sciences (ScienceDirect). This second SLR yields the addition of data quality as another distinct and common step in the data refinement process (Stieglitz et al. 2018, Figure 3, p. 165). These four steps as illustrated in Figure 3 provide the foundation to which we add our observations constructing a real-world data factory.

Di H23 Fig4 Web Icis 2019

Figure 4: Data factory demonstration and best practice during WEB@ICIS 2019

First data factories are emerging

Leading information communication technology pioneers like Microsoft, IBM and Deutsche Telekom (see “T-System is #1”, link) are already offering advanced data refinement tools. Microsoft is offering “Azure Data Factory” as a feature in its Azure cloud, which is raising concerns in Europe that hyperscalers are already expanding their dominance beyond data storage (Clemons et al. 2019). In the Azure Data Factory, users can “create and schedule data-driven workflows (called pipelines) that can ingest data from disparate … [sources and] … move the data as needed to a centralized location for subsequent processing” (Microsoft 2018).

Deutsche Telekom has launched its Telekom Data Intelligence Hub as a minimum viable product in late 2018 in Germany at: https://dih.telekom.com (DIH, Deutsche Telekom 2018). Telekom DIH is an integrated data refinement, analytics and exchange platform-as-a-service offering for B2B customers. Coming from this practical experience we propose a slightly more granular decomposition of data refinement activities to explicitly recognize issues that have emerged as a critical concern in practice and that require additional data processing steps: data privacy and data sovereignty. Both issues had already surfaced in the SLR by Sivarajah et al. but only as “management challenges” not explicitly as data refinement steps (p. 274). However, since 2018, the General Data Protection Regulation (GDPR) is mandating data privacy protection in the entire European Union, which necessitates additional data refinement steps, such as consent management, anonymization and user data deletion (European Commission 2018). Similarly, the issue of data sovereignty has evolved from a hygiene factor to a key element of a company’s business strategy (e.g., Otto 2011). And Europe is not alone; in 2018 California became the first U.S. state with a comprehensive consumer privacy law when it enacted the California Consumer Privacy Act of 2018 (CCPA), which becomes effective 2020 (Cal. Civ. Code §§ 1798.100-1798.199). CCPA not only grants residents in California new rights regarding their personal information; more importantly, it imposes data protection duties on entities conducting business in California.

Legal issues may not be so important from a pure computer science and software engineering perspective. For information systems, they certainly matter, because any information system and its architecture would have to correspond with business requirements (Schlueter Langdon 2003a). Therefore, we propose to bookend the data refinement process by data rights management at the beginning to ensure any refinement is compliant with legal requirements in the first place and by data governance at the end to safeguard data sovereignty. Figure 5 illustrates the expanded data factory framework.

Di H23 Fig5 Data Natives

Figure 5: Presenting the extended data factory framework at 2019 Data Natives

Internal and outsourced data factories

A data factory can be internal or external: It can be operated internally within the IT function (e.g., under a Chief Information Officer, CIO) or outside of it (e.g., under a Chief Marketing Officer, CMO), or it can be a separate, standalone business entirely. First standalone data factory service offerings have already arrived with large enterprises, for example in Microsoft’s Azure cloud and in the Telekom Data Intelligence Hub. Internal data factories may provide a way forward to extract value from data lakes and convert cost into business advantage by creating data products for internal operations and applications, such as anomaly detection, or into top-line growth with the sale of data products to third parties. Finally, combining a data factory with a data exchange may be an elegant way for large multi-divisional companies to quickly enable and promote a data-centric organization across functional or departmental silos.

This article is based on a longer piece by Schlueter Langdon, C., and R. Sikora. 2019. Creating a Data Factory for Data Products. Proceedings of the 18th Workshop of E-Business at ICIS Munich (December)

References

Clark, K. B., and T. Fujimoto. 1991. Product Development Performance: Strategy, Organization, and Management in the World Auto Industry. Harvard Business School Press: Boston, MA

Clemons, E.K., H. Krcmar, S. Hermes, and J. Choi. 2019. American Domination of the Net: A Preliminary Ethnographic Exploration of Causes, Economic Implications for Europe, and Future Prospects. 52nd Hawaii International Conference on System Sciences (HICSS), DOI: 10.24251/HICSS.2019.737

Deutsche Telekom. 2019. At a glance, link

Deutsche Telekom. 2018. Creating value: Deutsche Telekom makes data available as a raw material. Press Release (September 27), link

European Commission. 2018. General Data Protection Regulation, link

Microsoft. 2018. Introduction to Azure Data Factory (November 11), link

Miller, R. 2019. AWS and Microsoft reap most of the benefits of expanding cloud market. Techcrunch (February 1st), link

Otto, B. 2011. Organizing data governance: Findings from the telecommunications industry and consequences for large service providers. Communications of the AIS 29(1): 45-66

Pekka Pääkkönen, P., and D. Pakkala. 2015. Reference Architecture and Classification of Technologies, Products and Services for Big Data Systems. Big Data Research 2: 166–186

Schlueter Langdon, C., and R. Sikora. 2019. Creating a Data Factory for Data Products. Proceedings of the 18th Workshop of E-Business at ICIS Munich (December)

Schlueter Langdon, C. 2006. Designing Information Systems Capabilities to Create Business Value: A Theoretical Conceptualization of the Role of Flexibility and Integration. Journal of Database Management 17(3) (July-September): 1-18

Schlueter Langdon, C. 2003a. Information Systems Architecture Styles and Business Interaction Patterns: Toward Theoretic Correspondence. Journal of Information Systems and E-Business 1(3): 283-304

Schlueter Langdon, C. 2003b. The State of Web Services. IEEE Computer 36(7): 93-95

Schlueter Langdon, C., and R. Sikora. 2019, Creating a Data Factory for Data Products, 18th Workshop on e-Business, International Conference on Information Systems (2019)

Sivarajah, U., M.M. Kamal, Z. Irani, and V. Weerakkody. 2017. Critical analysis of Big Data challenges and analytical methods. Journal of Business Research 70: 263–286

Stieglitz, S., M. Mirbabayea, B. Rossa, and C. Neuberger. 2018. Social media analytics – Challenges in topic discovery, data collection, and data preparation. International Journal of Information Management 39: 156–168

Womack, J, D. Jones, and D. Roos. 1990. The Machine That Changed the World: The Story of Lean Production. Free Press, Simon & Schuster: New York, NY