News

Some dataset pointers and project comments

Written on 29.11.2024 15:06 by Ingmar Weber

There are some high-level pointers re datasets here: https://docs.google.com/presentation/d/1CXfA9eWV_GtAcVDwNPcu2hhrQ1FwzWfno_DHGNzMz-s/edit

As mentioned, searching at https://datasetsearch.research.google.com/ for existing datasets is a good idea. If you want to build your own then options include:

(i) Using websites with usable APIs. E.g. Wikimedia has APIs that tell you how often a Wikipedia page is viewed/edited and so on.

(ii) Using existing tools to download data. E.g. Arcshift (https://arctic-shift.photon-reddit.com/download-tool) has a great tool for downloading Reddit data. And pyTrends (https://pypi.org/project/pytrends/) is a great tool for downloading Google Trends data.

(iii) Scrape your own data. Before doing this, make sure to run a search on Google/Github for existing scraping code for your target website. If none exists, write your own.

(iv) Use data from one of the projects we're proposing. E.g. we'll propose projects involving review data from Google Maps (which we have collected) or satellite imagery (which we also have collected).

(v) Collect your own. This can be done through "data donations" (Google this) and services such as https://takeout.google.com/. You can also run experiments and measure things. E.g. you could try automating certain things using LLMs (and tools such as https://github.com/gregpr07/browser-use) and then do a study to measure how many typical tasks of a student can already be automated.

We'll share project ideas early this coming week. If you want to propose your own project, we strongly encourage you to email us before so that we can give some feedback re feasibility, ethical concerns, and so on. If we find your proposed project inappropriate for the seminar, then we might request that you pick one of our projects.

Have a nice weekend!

Ingmar 

Privacy Policy | Legal Notice
If you encounter technical problems, please contact the administrators.