Geneanet to gedcom scraping

This is a 2nd year project from the IUT, we have a client and 2 of our professors as tutors, we work together using git and the Agile method (which means that we make multiple rendering to the client until the final app). Geneanet is a website which host the genealogic trees of 4 million of users, however even if we can watch them (when they are public) for our client a genealogy enthusiast who wants to improve his family tree, it is annoying to rewrite in his software named Heredis every individual. Our job is to make an app which only needs the profil url and a click to get the whole family tree in a Gedcom file that the client can import in his software (Heredis).

The important development steps are :

  • make a program able to scrap a profile, which means get the informations of the web page of the profile thanks to the DOM
  • adapt it according to the html-css variations between some pages (some profiles lack specific informations like a date of birth)
  • execute the scraping in recursive way on all the individuals of the tree and save them
  • clean the datas (incorrect dates, bad format for the names…)
  • write the informations under the Gedcom 5.5 format (a very specific format for genealogic files )
  • make an app that proposes to the user to enter an URL and make all the actions in one clic

Profile example :

(this project is still in progress)

Click here to see the code.

Arnaud GODET
Arnaud GODET
2nd year computer science student

Related