Academic publishers play a major role in the dissemination of scholarly information. As a society, we need to be able to rely on these publishers to provide information securely, accurately, and with content integrity. We also want to ensure that our personal information (e.g., a site password) is secure, and scholarly publishers have a responsibility to the community to protect our data.
I've been surprised how often scholarly publishers' pages are published as HTTP, which (unlike HTTPS) doesn't encrypt data in transit. Implementing HTTPS has become much easier with initiatives such as Let's Encrypt and Certbot (but I recognize legacy systems can make it more difficult).
As a scholar, I am concerned with content integrity. This is essential when conducting systematic reviews, meta-analyses, or simply reading research and planning new studies. I am also concerned about the security of my and my colleagues' login credentials. Given how often passwords are reused, HTTP-based published pages threaten the security credentials of people visiting scholarly publishers' websites.
In order to hold the disseminators of scholarly information accountable, we need to be able to recognize whether this is a widespread issue and where improvements can be made. For example, Science magazine, one of the most acclaimed journals, apparently considers HTTP good enough and makes no statement about why it has not upgraded. Many other publishers are forgoing the same responsibility towards their users.
Publishers that take a negligent or dismissive position to the situation belittle the security of users and their role in accurate content presentation. In the long run, it will hurt the publishers too: Chrome is starting to label pages as not secure if they use HTTP. Given that users have no choice but to use these sites if the articles are copyrighted, and there is no other way to share the materials, publishers have a significant responsibility to the extended scholarly community.
The https-checker project
The https-checker project aims to address this problem by checking the websites of publishers indexed by CrossRef, the main metadata store for scholarly publications, to get a sense of the overall scope. There are approximately 10,000 members (i.e., publishers) in CrossRef, of which ~7,500 are actively publishing (meaning they published in 2017).
This project began by canvassing the scholarly publishers' landscapes for those that use (and don't use) HTTPS. By identifying publishers that publish the largest body of work in an unsecure way, we can start a dialogue with them to improve the situation. Previously, I had a constructive dialogue with Collabra, which upgraded its webpage to default to HTTPS after I contacted them.
Https-checker just completed its initial data collection phase. By using pshtt, an open source HTTPS testing tool, and a set of calls to the CrossRef API, it was relatively easy to script an initial canvas of the publishers' security practices.
Active, default HTTPS | Active, not default HTTPS | Inactive |
---|---|---|
1,923 | 5,575 | 2,513 |
At first glance, 26% of all 7,498 active publishers default to HTTPS. In general, estimates of websites that default to HTTPS range between 10% and 44%. In other words, scholarly publishers seem to be securing their web pages at similar rates when compared to the overall population of websites. Even so, this does not waive their responsibility to improve the situation.
Running a basic logistic regression to try to predict whether publishers' default their pages to HTTPS shows that large publishers are more likely to do so. Publishers' publications range from one to 1,104,607. Our analysis shows a publisher with only 100 publications since 2017 is estimated to have a 27% chance of using HTTPS by default, whereas one with 1,000 publications since 2017 is estimated to have a 32% chance of using HTTPS by default. Given the average number of publications, the estimated probability of a publisher using HTTPS by default is 31% (median: 25%).
Moving forward
The https-checker project's next step is opening a dialogue with some of the largest publishers that do not provide HTTPS by default. These conversations will be tracked to get a sense of how active and willing they are to improve the security of their users and the content they serve.
The HTTPS scan can go much deeper than just checking whether a page defaults to HTTPS and uncover other practices that can further improve security. For example, preloading an HTTP Strict Transport Security (HSTS) header can help mitigate man-in-the-middle attacks. By using in-depth assessments, we can identify ways websites that already default to HTTPS can further improve their security.
Increased use of secure practices in content transfer on web pages is key to a secure web and truthful information, which ultimately affects users that rely on information published on the internet. Given that misinformation is spreading, it seems like this is low-hanging fruit for 2018.
2 Comments