incubator-devlake: [Bug] [Database] author_id column in commits table of database contains an email instead of an id
Search before asking
- I had searched in the issues and found no similar issues.
What happened
Exploring the structure of the database and the content of the tables once we scanned few repositories, we realized that the values of the column author_id in the table commits, contains the same value of column author_email. We compared this with another table and in pull_requests table there is a column also called author_id, but its content is the unique GitHub id of the user that created the pull request.
What you expected to happen
We expected that the content of author_id in Commits had the same value as author_id in pull_requests (ex: github:GithubUser:11111111). The design of the table commits contains two columns with the same values which means had duplicated values in the table, and it is not an ideal design. Have this unique identifier in the table commits will help to create dashboards in Grafana to obtain more exactly metrics.
Without an unique author_id on commits table, it’s not possible obtain all the commits of an individual as now each combination of email + display name makes a different commiter.
Actual behavior:
| author_name | author_email | author_id |
|---|---|---|
| Jon Doe | jon.doe@gmail.com | jon.doe@gmail.com |
| Jon Doe | jon.doe@hotmail.com | jon.doe@hotmail.com |
(Both emails belong to the same person)
Expected behavior:
| author_name | author_email | author_id |
|---|---|---|
| Jon Doe | jon.doe@gmail.com | github:GithubUser:11111111 |
| Jon Doe | jon.doe@hotmail.com | github:GithubUser:11111111 |
How to reproduce
- Have a GitHub connection configured with a token
- Go to Pipelines > Create Pipeline Run.
- Click on Create Pipeline Run.
- Scroll down to until ‘Github’ is shown in Data Providers list.
- Toggle on GitHub Data provider
- Enter repository owner and name for a repository that contains few commits and pull requests created by different users.
- Click on ‘Run Pipeline’
- Once the register of the repository have finished, go to Pipelines > Create Pipeline Run.
- Click on Create Pipeline Run
- Scroll down to until ‘Advanced Mode’ option at the bottom appears.
- Click on ‘Advanced Mode’.
- Create a task in the task editor to launch a GitHub Extractor Task.
- Use the following JSON:
[ [ { "Plugin": "gitextractor", "Options": { "url": "Url of the repository registered in the previous step ended with .git", "repoId": "Github repository id. It looks like -> github:GithubRepo:384111310", "user": "Name of the user who is the owner of the GitHub Token", "password": "GitHub Token" } } ] - Click on ‘Run Pipeline’
- Once the scan have finished, connect to the database
- Use the table
commits - Execute the query:
SELECT author_email, author_id FROM lake.commits;
Anything else
Not sure about the exact version of lake we are using, because we are working with the fork of MericoDev.
Version
0.10.0
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (12 by maintainers)
Sure! We will discuss with our managers which topics we can raise and then we can try to find time to schedule a meeting. We will be in touch 🙂
@lukasgomez @marcemv90 Thanks for putting together such a detailed bug report, really appreciate it!
Like @klesh mentioned, the
commitstable is created by the `gitextractor plugin which directly extracts data from a git repository and it’s impossible to tell a git author’s GitHub id using the git repo alone.If you would like to filter/group commits by a specific GitHub user, I’d also recommend looking into the team configuration feature that’s going to be shipped in v0.12.0. The release candidate images are already available on docker hub, we’re just going through Apache’s formal voting process for releases. If you’d like to get a taste now, feel free to try the images below 😃
Hi, We will try this new feature to try to filter commits by GitHub user and test the release candidate images. Hope this solution helps us to achieve the desired behavior. Thanks for your response and your time 😄 .
Hi, @marcemv90 , but the behavior you expected is not viable. As I said, in reality, a
commitmay or may not have itsauthor_idpointing to github user id. Relying on GithubUserID is very limited and won’t be supported.I suggest that you take a look at the Team Feature, it allows you to connect multiple email addresses to a Unified Identity.
@Startrekzky I think you are correct. Let me test this out and confirm back if I’m still having issues.
@hezyin @klesh Sorry to pull up an old issue but can you provide some details on how the Teams functionality can help with this problem. When implementing Teams it seems that having a record in the
accountstable is key to the process so that it an then be mapped through the mapping process.gitextractordoes not create any accounts in this table so mapping seems problematic.@lukasgomez Hi Lukas, we’d love to support you guys in implementing DevLake for your team and get your feedback. Would you be open to a quick Zoom conversation to get properly connected? Connecting with users is incredibly helpful for us to keep improving DevLake.
The team feature @klesh mentioned is supported in v0.12.0, which will be released soon. Stay tuned.
Bump, I don’t think this issue should be closed.