pyHDB - heuristic tool for the Brazilian Newspaper Digital Library

using web scraping technics for Historical research

Authors

  • Eric Brasil University for International Integration of the Afro-Brazilian Lusophony

DOI:

https://doi.org/10.15848/hh.v15i40.1904

Keywords:

History Methodology, Heuristics, Digital History

Abstract

This article aims to analyze the relationship between search tools and users’ interfaces in digital source repositories and the construction of historical knowledge in the digital age. Therefore, I analyze the pyHDB: Heuristic Tool for the Brazilian Digital Newspaper Library of the National Library, characterizing its technical, methodological and heuristic aspects. The tool is a computer program written in the Python programming language and uses web scraping techniques. Its purpose is to assist researchers in the process of methodological construction and recording, creating reports, tabular data and datasets from the defined search parameters. First, the results generated by the Hemeroteca Digital Brasileira graphical interface are critically analyzed. Then, the pyHDB, both its ethical and technical aspects and analytical possibilities, is presented in detail through three search examples. Finally, in the concluding remarks, the advantages of developing and using digital methodological tools for historical research are discussed.

Downloads

Download data is not yet available.

References

BENJAMIN, Ruha. Assessing risk, automating racism. Science, New York, v. 366, n. 6464, p. 421–422, 25 out. 2019a. Disponível em: https://doi.org/10.1126/science.aaz3873. Acesso em 1 out. 2021.

BENJAMIN, Ruha. Race after technology: abolitionist tools for the new Jim code. Cambridge: Wiley, Polity Press, 2019b.

BETTENCOURT, Angela Maria Monteiro; PINTO, Monica Rizzo Soares. A hemeroteca digital brasileira. In: CONGRESSO

BRASILEIRO DE BIBLIOTECONOMIA, DOCUMENTAÇÃO E CIÊNCIA DA INFORMAÇÃO, XXV, 2013, Florianópolis. Anais [...], Florianópolis: FEBAB, 2013, p. 1028–1038.

BIRHANE, Abeba. Algorithmic injustice: a relational ethics approach. Patterns, Amsterdam, v. 2, n. 2, p. 1-9, 12 fev. 2021. https://doi.org/10.1016/j.patter.2021.100205. Acesso em: 12 set. 2022.

BRASIL, Eric. Carnavais Atlânticos: Cidadania e Cultura Negra no pós-abolição do Rio de Janeiro, Brasil e Porto de Espanha, Trinidad (1838-1920). 2016. Tese (Doutorado em História), Universidade Federal Fluminense, Niterói, 2016.

BRASIL, Eric. pyHDB: ferramenta heurística para a Hemeroteca Digital Brasileira. Zenodo, 2021. Disponível em: https://zenodo.org/record/5706507. Acesso em: 12 set. 2022.

BRASIL, Eric. Germano Lopes da Silva: experiências de um carnavalesco, eleitor e cidadão no Distrito Federal (c. 1900-1930). 2018. Biblioteca Consuelo Pondé. Disponível em: http://www.bvconsueloponde.ba.gov.br/modules/conteudo/conteudo.php?conteudo=201. Acesso em: 12 set. 2022.

BRASIL, Eric; NASCIMENTO, Leonardo Fernandes. História digital: reflexões a partir da Hemeroteca Digital Brasileira e do uso de CAQDAS na reelaboração da pesquisa histórica. Revista Estudos Históricos, Rio de Janeiro, v. 33, n. 69, p. 196–219, 1 jan. 2020. Disponível em: http://dx.doi.org/10.1590/S2178-14942020000100011. Acesso em: 12 set. 2022.

BRESCIANO, Juan Andrés. La investigación histórica y las nuevas tecnologías. Montevideo: Librería de la Facultad de

Humanidades y Ciencias de la Educación, 2000.

CLAVERT, F.; FICKERS, A. On pyramids, prisms, and scalable reading. Journal of Digital History, jdh001, 2021. Disponível em: https://www.journalofdigitalhistory.org/en/article/jXupS3QAeNgb. Acesso em: 12 set. 2022.

COSTA, Marcela Albaine. Ensino de história e historiografia escolar digital. 1. ed. Curitiba: EDITORA CRV, 2021. DOI

24824/978655868256.1.

DANTAS, Carolina Vianna. Monteiro Lopes (1867-1910), um “líder da raça negra” na capital da república. Afro-Ásia, Salvador, n. 41, 2010. p. 167-209. DOI 10.9771/aa.v0i41.21201. Disponível em: https://periodicos.ufba.br/index.php/afroasia/article/view/21201. Acesso em: 12 set. 2022.

EHRMANN, Maud; BUNOUT, Estelle; DÜRING, Marten. Historical Newspaper User Interfaces: A Review. In: LIBRARIES:

DIALOGUE FOR CHANGE. Atenas, 2017. Anais [...], Atenas: IFLA WLIC, 2019, p. 1-24. Disponível em: http://library.ifla.org/id/eprint/2578/. Acesso em: 14 set. 2021.

FRIDLUND, Mats; OIVA, Mila; PAJU, Petri (org.). Digital Histories: Emergent Approaches within the New Digital History. Helsinki: Helsinki University Press, 2020.

FUNDAÇÃO BIBLIOTECA NACIONAL. Por motivos técnicos, nosso site se encontra temporariamente fora do ar. Esperamos solucionar o problema o quanto antes. Rio de Janeiro, 12 abr. 2021. Twitter: @FBN. Disponível em: http://pic.twitter.com/OUGMDWE3hJ. Acesso em: 22 out. 2021.

GALLOWAY, Alexander R. The Interface Effect. Cambridge: Polity, 2012.

GOODING, Paul; TERRAS, Melissa; WARWICK, Claire. The myth of the new: mass digitization, distant reading, and the future of the book. Literary and Linguistic Computing, Oxford, v. 28, n. 4, p. 629–639, dez. 2013. DOI: http://dx.doi.org/10.1093/llc/fqt051. Acesso em: 12 set. 2022.

Impresso. Media Monitoring of the Past. Supported by the Swiss National Science Foundation under grant CR- SII5_173719, 2019. Disponível em: https://impresso-project.ch. Acesso em: 12 set. 2022.

JARLBRINK, Johan; SNICKARS, Pelle. Cultural heritage as digital noise: nineteenth century newspapers in the digital archive. Journal of Documentation, Bingley, v. 73, n. 6, p. 1228–1243, 12 out. 2017. Disponível em: https://doi.org/10.1108/JD-09-2016-0106. Acesso em: 12 set. 2022.

JENSEN, Helle Strandgaard. Digital Archival Literacy for (all) Historians. Media History, Londres, v. 27, n. 2, p. 251–265, 2021. Disponível em: https://doi.org/10.1080/13688804.2020.1779047. Acesso em: 12 set. 2022.

KARSDORP, Folgert; KESTEMONT, Mike; RIDDELL, Allen. Humanities data analysis: case studies with Python. Princeton: Princeton University Press, 2021.

KEMMAN, Max. Trading Zones of Digital History. Berlin: De Gruyter Oldenbourg, 2021. DOI 10.1515/9783110682106.

KROTOV, Vlad; JOHNSON, Leigh; SILVA, Leiser. Tutorial: Legality and Ethics of Web Scraping. Communications of the Association for Information Systems, Atlanta, v. 47, n. 1, p. 539–563, 2020. Disponível em: https://doi.org/10.17705/1CAIS.04724. Acesso em: 12 set. 2022.

LUCCHESI, Anita. Digital history e Storiografia digitale: estudo comparado sobre a escrita da história no tempo presente (2001-2011). 2014. Dissertação (Mestrado em História), Universidade Federal do Rio de Janeiro, Rio de Janeiro, 2014.

MARRES, Noortje; WELTEVREDE, Esther. Scraping the Social? Journal of Cultural Economy, Londres. v. 6, n. 3, p. 313–335, 2013. Disponível em: https://doi.org/10.1080/17530350.2013.772070. Acesso em: 12 set. 2022.

MCKINNEY, Wes. Python Para Análise de Dados: Tratamento de Dados com Pandas, NumPy e IPython. 1 edição. São Paulo: Novatec Editora, 2018.

MILLIGAN, Ian. History in the Age of Abundance?: How the Web Is Transforming Historical Research. 328. ed. London; Chicago: McGill-Queen’s University Press, 2019.

MITCHELL, Ryan. Web Scraping with Python: Collecting More Data from the Modern Web. 2. ed. Sebastopol, CA: O’Reilly Media, 2018.

NASCIMENTO, Leonardo F. Sociologia digital: uma breve introdução. Salvador: EDUFBA, 2020.

NASCIMENTO, Leonardo Fernandes. Combinando webscraping em R e ATLAS.ti na pesquisa em ciências sociais: as possibilidades e desafios da sociologia digital. In: CONGRESSO BRASILEIRO DE SOCIOLOGIA, 18, 2017, Brasília. Anais [...], Brasília: Sociedade Brasileira de Sociologia, 2017, p. 2-17.

NICHOLSON, Bob. The Digital Turn. Media History,Londres, v. 19, n. 1, p. 59–73, 1 fev. 2013. Disponível em: https://doi.org/10.1080/13688804.2012.752963. Acesso em: 12 set. 2022.

NICODEMO, Thiago Lima; CARDOSO, Oldimar Pontes. Meta-história para robôs (bots): o conhecimento histórico na era da inteligência artificial. História da Historiografia: International Journal of Theory and History of Historiography, Ouro Preto, v. 12, n. 29, 28 abr. 2019. DOI 10.15848/hh.v12i29.1443. Acesso em: 12 set. 2022.

NOBLE, Safiya Umoja. Algorithms of oppression: data discrimination in the age of Google. New York: New York University Press, 2018.

PFANZELTER, Eva; OBERBICHLER, Sarah; MARJANEN, Jani; et al. Digital interfaces of historical newspapers: opportunities, restrictions and recommendations. Journal of Data Mining and Digital Humanities, v. HistoInformatics, jdmdh:6121, 2021. Disponível em: https://jdmdh.episciences.org/7069. Acesso em: 12 set. 2022.

PIROVANI, Juliana; OLIVEIRA, Elias. Portuguese named entity recognition using conditional random fields and local grammars. 2018. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 11, 2018, Miyazaki. Anais [...]. Miyazaki: European Language Resources Association (ELRA), 2018, p. 4452-4456.

Programming Historian, ISSN: 2397-2068. Disponível em: https://programminghistorian.org. Acesso em: 12 set. 2022.

RAYMOND, Eric S. The art of Unix programming. Harlow: Addison-Wesley, 2003.

RÖHLE, Bernhard Rieder Theo; RIEDER, Bernhard. Digital Methods: Five Challenges. In: BERRY, David M. (org.). Understanding Digital Humanities. London: Palgrave Macmillan UK, 2012. p. 67–84. DOI 10.1057/9780230371934_4.

ROMEIN, C. Annemieke; KEMMAN, Max; BIRKHOLZ, Julie M.; BAKER, James; GRUIJTER, Michel De; MEROÑO‐PEÑUELA,

Albert; RIES, Thorsten; ROS, Ruben; SCAGLIOLA, Stefania. State of the Field: Digital History. History, Hoboken, v. 105, n. 365, p. 291–312, 2020. Disponível em: https://doi.org/10.1111/1468-229X.12969. Acesso em: 12 set. 2022.

SALGANIK, Matthew J. Bit by Bit: Social Research in the Digital Age. Reprint edição. Princeton: Princeton University Press, 2017.

SALMI, Hannu. What is Digital History? 1edição. Cambridge: Polity, 2020. What is History? Series.

SANTOS, Luara. ‘Etymologias preto’: Hemetério José dos Santos e as questões raciais de seu tempo (1888-1920). 2015. Dissertação (Mestrado em História), Centros Federais de Educação Tecnológica, Rio de Janeiro, 2015.

SHERRATT, Tim. GLAM Workbench (version v1.0.0). Zenodo, 2021. Disponível em: https://doi.org/10.5281/zenodo.5603060. Acesso em: 12 set. 2022.

SILVA, Tarcízio. Racismo Algorítmico em plataformas digitais: microagressões e discriminação em código. In: SILVA, Tarcízio (org.). Comunidades, Algoritmos e Ativismos Digitais: Olhares Afrodiaspóricos. São Paulo: LiteraRUA, 2020.

SILVEIRA, Pedro Telles da. História, técnica e novas mídias: reflexões sobre a história na era digital. Tese (Doutorado em História), UFRGS, Porto Alegre, 2018. Disponível em: <https://lume.ufrgs.br/handle/10183/189249>. Acesso em: 27 set. 2021.

SINCLAIR, Stéfan; ROCKWELL. Geoffrey. Voyant Tools. Web. Disponível em: http://voyant-tools.org/. Acesso em: 12 set. 2022.

SOLBERG, Janine. Googling the Archive: Digital Tools and the Practice of History. Advances in the History of Rhetoric, Londres, v. 15, n. 1, p. 53–76, 1 jan. 2012. Disponível em: https://doi.org/10.1080/15362426.2012.657052. Acesso em: 12 set. 2022.

SOUTHERTON, Clare. Datafication. In: SCHINTLER, Laurie A.; MCNEELY, Connie L. (org.). Encyclopedia of Big Data. Cham: Springer International Publishing, 2020. p. 1–4. DOI 10.1007/978-3-319-32001-4_332-1. Acesso em: 12 set. 2022.

TURKEL, William J.; CRYMBLE, Adam. Understanding Web Pages and HTML. Programming Historian, Londres, 17 jul. 2012. Disponível em: https://programminghistorian.org/en/lessons/viewing-html-files. Acesso em: 8 jun. 2021.

WALSH, Melanie. Introduction to Cultural Analytics & Python, Version 1, 2021, Disponível em: https://doi.org/10.5281/zenodo.4411250. Acesso em: 12 set. 2022.

WEIBEL, Stuart. Metadata: the Foundations of Resource Description. d-lib magazine. Disponível em: https://www.dlib.org/dlib/July95/07weibel.html. Acesso em: 3 out. 2021

Published

2022-12-31

How to Cite

BRASIL, E. pyHDB - heuristic tool for the Brazilian Newspaper Digital Library: using web scraping technics for Historical research. História da Historiografia: International Journal of Theory and History of Historiography, Ouro Preto, v. 15, n. 40, p. 186–217, 2022. DOI: 10.15848/hh.v15i40.1904. Disponível em: https://historiadahistoriografia.com.br/revista/article/view/1904. Acesso em: 3 jul. 2024.

Issue

Section

Research article