Web Scraping in Academic Research: Public Is Not the Same as Collectable

That a datum is visible on an open page is a statement about access, not about permission and still less about ethics. Web scraping in academic research lives in exactly that confusion: the ease of automatically collecting millions of public records makes the technically accessible look freely usable. It is not. The line between public data and ethically collectable data is drawn by consent, terms of use, and risk of harm, and none of those limits appears in the code that downloads the page. A reviewer who receives a study built on scraping asks, before any result, how the data were obtained and why that was judged legitimate.

The first limit technical accessibility ignores is contractual. Fiesler and colleagues (2020)² analyzed the terms of service of more than a hundred social platforms and found prohibitions on automated collection that are common but ambiguous, inconsistent, and context-poor. The fact that a page loads in a browser does not mean the site authorizes a robot to traverse it, and a terms-of-use violation has been enough to trigger litigation. The second limit is the expectation of privacy. Someone posting in a forum about a health condition writes for a specific community, not for an indexed research corpus archived indefinitely. Treating that text as free data ignores the contextual integrity of the information. The third limit is harm: re-identification, exposure, and the circulation of attributable quotations from people who never knew they were in a study.

These limits are known but poorly observed, and that has been measured. Taylor and Pagliari (2018)⁴ reviewed 156 health studies using social media data and found that only 50 mentioned any ethical consideration, usually only to state that approval had been obtained or waived. The problem is therefore neither theoretical nor rare: it is the standard practice of treating public data as tacit permission. Takats and colleagues (2022)¹ quantified that habit in a systematic review of 367 studies that used publicly available Twitter data.

Ethics safeguards reported across 367 studies using public Twitter data, from the Takats and colleagues (2022) review. Just over a third anonymized, sought ethics approval, or discussed the topic; informed consent of the account holder was attempted in zero studies.

The reading is the argument. Among 367 published studies, 36% anonymized and paraphrased the messages, 32% sought ethics-board approval, 30% did not even discuss ethical considerations, and informed consent of the message authors was attempted in none of them. The zero bar is not a detail: it confirms that, in practice, an entire field treated public availability as consent. What that gap produces when carried to the extreme is documented. Chiauzzi and Wicks (2019)⁵ report four cases of researchers scraping a patient community in violation of its terms of use and of research ethics, with retractions and corrections as the outcome. The harm is not hypothetical, and the defense that the data were public did not prevent it.

The practical consequence is not to abandon scraping but to conduct it defensibly. Mancosu and Vegetti (2020)³ show the path with a concrete example: collect only public information and pseudonymize user identifiers with a one-way hash, keeping records analyzable without exposing the people behind them, in line with privacy regulation. Boegershausen and colleagues (2022)⁶, reviewing more than three hundred articles using web data, propose handling the technical, legal, and ethical questions together rather than as an afterthought, across the three stages of collection: selecting the source, designing the collection, and extracting the data. The validity of the data and the legitimacy of the collection are decided in the same choices.

The confusion between public and collectable is made worse because the legal ground is unstable and offers false comfort. Court decisions on scraping vary across jurisdictions and shift over time, and platforms’ own access restrictions, after the end of open API access, pushed researchers back toward scraping just as the rules grew less clear. Leaning on the phrase that the data were public is fragile in two ways: it does not settle the ethical question, which is independent of legality, and it does not even secure the legal one, which depends on where and when the study is judged. The question that survives these swings is not whether collection was possible, but whether it respected the people and the systems on the other side of the request.

The operating rule fits a verifiable sequence run before the first request. Read the source’s robots.txt and terms of use and respect what they forbid, because contractual permission is distinct from technical access. Submit the protocol to an ethics board when there are people behind the data, rather than assuming a waiver. Minimize collection to what the question requires, pseudonymize identifiers, and never reproduce quotations that allow re-identification. Rate-limit requests so as not to overload the source’s server, which is a third party affected by the collection. And document each of these decisions in the methods section, because what separates legitimate scraping from indefensible scraping is not the tool, it is the recorded justification that the line between the public and the collectable was recognized and respected.

Web Scraping in Academic Research: Public Is Not the Same as Collectable

References

This analysis reflects Aria's practice in Web Scraping and Data Collection and Generative AI Applied to Research.

References

This analysis reflects Aria's practice in Web Scraping and Data Collection and Generative AI Applied to Research.

Missing Data Is Not a Technical Detail: The Mechanism Decides

Publishable vs Exploratory Visualization: Two Objects, Two Rule Sets

SEM for Multiple Mediation: When Linear Regression Stops Answering