To Type or Not to Type: Quantifying Detectable Bugs in JavaScript


Overview

JavaScript is growing explosively and is now used in large mature projects even outside the web domain. JavaScript is also a dynamically typed language for which static type systems, notably Facebook's Flow and Microsoft's TypeScript, have been written. What benefits do these static type systems provide?

Leveraging JavaScript project histories, we select a fixed bug and check out the code just prior to the fix. We manually add type annotations to the buggy code and test whether Flow and TypeScript report an error on the buggy code, thereby possibly prompting a developer to fix the bug before its public release. We then report the proportion of bugs on which these type systems reported an error. Evaluating static type systems against public bugs, which have survived testing and review, is conservative: it understates their effectiveness at detecting bugs during private development, not to mention their other benefits such as facilitating code search/completion and serving as documentation. Despite this uneven playing field, our central finding is that both static type systems find an important percentage of public bugs: both Flow 0.30 and TypeScript 2.0 successfully detect 15%!

Responsive image

Methodology

The fact that long-running JavaScript projects have extensive version histories, coupled with the existence of static type systems that support gradual typing and can be applied to JavaScript programs with few modifications, enables us to conduct a “time-travel” experiment.


Corpus Collection


For a project, we extract all bug ids from the issue tracker, then search for them in a project's commit log messages; concurrently, we extract all SHA from the version history, and search for them in the project's issues.

We seek to construct a corpus of bugs that is representative and sufficiently large to support statistical inference. As always, achieving representativeness is the main difficulty, which we address by uniform sampling. We cannot sample bugs directly, but rather commits that we must classify into fixes and non-fixes. Why fixes? Because a fix is often labelled as such, its parent is almost certainly buggy and it identifies the region in the parent that a developer deemed relevant to the bug. To identify bug-fixing commits, we consider only projects that use issue trackers, then we look for bug report references in commit messages and commit ids (SHAs) in bug reports. This heuristic is not only noisy; it must also contend with bias in project selection and bias introduced by missing links.

We used the standard sample size computation to determine the sample size. On 19/08/2015, there were 3,910,969 closed issues for JavaScript projects on GitHub, which we used to approximate the population. We set the confidence level and confidence interval to 95% and 5%, respectively. The calculation showed that a sample of 384 bugs was sufficient for the experiment, which we rounded to 400.

Please click here for the list of the 400 studied bugs.



Annotation


Procedure 1 defines our manual type annotation procedure. Because we annotate each bug twice, once for each type system, our experiment is a within-subject repeated measure experiment. As such, a phenomenon known as learning effects may come into play, as knowledge gained from creating the annotations for one type checker may speed annotating the other. To mitigate learning effects, for a bug \(b\) in \(B\), we first pick a type system \(ts\) from Flow and TypeScript uniformly at random, so that, on average, we consider as many bugs for the first time for each type system. If \(b\) is not type related “beyond a shadow of a doubt”, such as misunderstanding the specification, we label it as undetectable under \(ts\) and categorise it, skipping the annotation process. If not, we read the bug report and the fix to identify the patched region, the set of lexical scopes the fix changes. Combining human comprehension and JavaScript's read–eval–print loop (REPL), e.g. Node.js, we attempt to understand the intended behavior of a program and add consistent and minimal annotations that cause ts to error on \(b\). We are not experts in type systems nor any project in our corpus. To combat this, we have striven to be conservative: we annotate variables whose types are difficult to infer with any. Then we type check the resulting program. We ignore type errors that we consider unrelated to this goal. We repeat this process until we confirm that \(b\) is \(ts\)-detectable because \(ts\) throws an error within the patched region and the added annotations are consistent, or we deem \(b\) is not \(ts\)-detectable, or we exceed the time budget \(M\).

Results

What we have discovered



Histogram of TC-Detectable Bugs

Histogram of Undetectability

Annotation Effort


What developers say


"A scientifically true effect can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed."


Assessment

The complete assessment on the 400 bugs.

Annotation Facilitator

The tool used to facilitate the annotation procedure.

Case Study

Some bugs are worth closer inspection. Based on three criteria we select bugs for further manual assessment: ones whose TypeScript- or Flow-detectability differs among the three "raters", ones whose TypeScript- and Flow-detectability differ, and ones that are TypeScript-detectable under version 2.0 but not under 1.8. Please click here for a detailed discussion.


Disagreement

Of the 80 uniformly-sampled bugs that we used to calculate inter-rater agreement, each rater needed to make 160 decisions in total, 80 for TypeScript-preventability and 80 for Flow-preventability. 138 of these 160 decisions were unanimously labelled. We define a strong disagreement as a disagreement in which one rater deems the bug preventable while another deems it unpreventable. Of the 22 disagreements, 12 are strong.

Flow vs. TypeScript

Though sharing a similar annotation syntax, Flow and TypeScript differ in some dimensions. We compared Flow and TypeScript in terms of their ability to potentially prevent public bugs had they been used when those bugs were introduced. Flow and TypeScript both catch a nontrivial portion of public bugs. In our dataset, the bugs they can prevent largely overlap, with 6 exceptions: 3 bugs are only Flow-preventable and 3 only TypeScript-preventable.

TypeScript 1.8 vs. 2.0

TypeScript 2.0 was released during this study, giving us the opportunity to measure the effectiveness of its handling of null and undefined. Prior to 2.0, all types were nullable in TypeScript. TypeScript 2.0 added the compiler option --strictNullChecks, which makes most types nonnullable. We reviewed our corpus and found that 22 bugs, an increase of 58%, are preventable under TypeScript 2.0 but not under TypeScript 1.8.

PUBLICATIONS


Slides: keynote, pdf

CONTACT


Contact us and we'll get back as soon as possible.

Univerisity College London, Gower Street, London, UK

z.gao.12 (at) ucl.ac.uk