Accountability in Big Data Systems

NII Shonan Meeting:

@ Shonan Village Center, March 12-15, 2018


  • Bill Howe, University of Washington
  • Ichiro Satoh, National Institute of Informatics
  • Julia Stoyanovich, Drexel University


Description of the meeting

Motivation and background

Our society is undeniably data-driven. Governments, companies, and universities are now routinely extracting signals from large, heterogeneous, and noisy datasets — Big Data — to inform decisions around criminal sentencing, college admissions, hiring and promotion, loans and financial services, and more.
This technology holds the promise to improve people’s lives, accelerate scientific discovery and innovation, and enable broader participation. Yet, if not used responsibly, Big Data can increase economic inequality and affirm systemic bias, polarize rather than
democratize, and deny opportunities rather than improve access. Worse yet, all this can be
done in a way that is non-transparent and defies public scrutiny.

Big Data impacts society at all levels: individuals, communities, nations. Because of the central role played by this technology, it must be designed and deployed responsibly — in accordance with the ethical and moral norms that govern our society, and adhering to the
appropriate legal and policy frameworks. And as journalists,1 legal and policy scholars2 and governments345 are calling for greater insight into data-driven algorithmic processes, there is an urgent need provide scalable systems support for transparent and accountable data
management and analysis. This in turn calls for a broad and coordinated agenda in computer science systems research that focuses on incorporating legal and ethical requirements into the core design, in addition to typical performance requirements. This agenda can only be
formulated in dialog with legal, policy and social science scholars, and executed by collaborative teams of data management, systems and big data researchers.
The societal, legal and policy implications of data-driven algorithmic decision-making transcend national borders. The challenges raised by this technology must therefore be tackled through an international effort.
The proposed Shonan meeting will be a follow-up on a Dagstuhl Seminar “Data, Responsibly” 6 that took place in July 2016. The Shonan meeting will place particular emphasis on transparency, interpretability and accountability of data-driven algorithmic decision making.
The high-level goals of this meeting are twofold: (1) to survey the state of the art and push forward a computer science systems agenda in digital accountability; and (2) to establish new and to further existing international collaborations in this important emerging area of research and practice, particularly among US/EU participants and their Japanese counterparts.

Research questions
Transparency and accountability of data-driven algorithms and pipelines mean different things to different stakeholders, which include researchers, developers, vendors, competitors, regulators, auditors, and end users. We propose to tackle several dimensions of accountability, making explicit the audience that the proposed techniques will serve.

Auditing and Verification
Internet platforms and other entities that derive competitive advantage from data and results of analysis have strong reasons to behave in a data-responsible manner. To gain or retain the trust of their user base, and to adhere to responsible practices in the face of regulation
or legal action, these entities need to demonstrate that their processes are indeed responsible. Two scenarios are useful in this context. The first is external auditing, where code remains a black box, and its behavior is interrogated by observing inputs and outputs. The second is
internal refinement, where some or all details of the algorithm, and of the data on which it is trained, are available. Recent work on auditing and verification includes cryptographic techniques like zero knowledge proofs,7 audits8 and reverse engineering.9 However, significant computer science research is still needed to support (1) auditing mechanisms that can be used to certify that a black-box data analysis method or pipeline is data-responsible; (2) enhancement mechanisms that can improve data-responsible properties of a method; (3) fundamental systems approaches that make these mechanisms scalable, robust and reusable.

Interpretability and Explainability
Recent scholarship on algorithmic accountability has devalued transparency in favor of verification. The claim is that because algorithmic processes are extremely complex or secret (due to trade secrets or privacy concerns), we need to rely on retrospective checks to ensure
that the algorithm is performing as promised. Auditing and verification are valid methods of interrogation, but they have an important shortcoming, in that they put the burden of inquiry exclusively on individuals for whom interrogation may be expensive and ultimately fruitless. The burden instead should fall more squarely on the least cost avoider, which will be the vendor who is in a better position to reveal how the algorithm works, even if only partially.
In a recent article10 we argued that syntactic transparency (revealing the code of the algorithm and the data on which it operates) is neither necessary nor sufficient to achieve accountability. Instead, methods are needed to generate human-interpretable descriptions of
how an algorithm works (over-all or on a subset of the inputs) and how a specific decision was made. Significant computer science research is needed to (1) define properties of an explanation, such as stability, understandability, soundness, and consistency; (2) generate
explanations with these properties for a wide range of data-driven algorithms and multi-step pipelines; (3) develop presentation mechanisms that are appropriate for different stakeholders;and (4) make these methods available in scope of a general-purpose big data system, while judiciously trading off sophisticated functionality for efficiency and robustness.

Validity and Reproducibility
As society (and science) becomes increasingly data-driven, systems support to ensure the veracity of the models used to make decisions becomes increasingly critical. Our field is currently failing in this regard: Multiple studies have demonstrated poor reproducibility across
multiple fields of science including psychology,11 clinical medicine,12 and economics.13 What role can systems research play in improving reproducibility? A number of projects aim to facilitate sharing code and data related to an analysis (e.g., ReproZip14), but these
projects are only relevant at the final stage of publication. We aim to open a discussion about upstream support for reliable statistical analysis by detecting and controlling for bias and errors in the data, tracking experimental hypotheses to manage publication bias, enforcing corrections for multiple hypothesis testing issues, ensuring peer review between domain experts and methodological experts, generating artificial samples of data to prevent overuse of limited datasets, and checking assumptions on which statistical methods are based.

Broader impacts
Beyond the research questions above, this workshop will serve to direct attention of the research community to the timely topic of digital transparency and accountability. This line of research offers ample opportunities for exciting algorithmic contributions, and for making a
tangible positive effect on society. An explicit goal of the workshop is to initiate and strengthen international collaboration on the topic of digital transparency and accountability.



Comments are closed.