git
README.md
and MarkdownAUTHORS
& CONTRIBUTING
<div class= info>
Yesterday:
git
and version controlIn this lecture:
</div>
<div class= warn>
@James Percival
in #General & #Random, or DM me.</div>
By the end of this lecture you should:
git
¶Git is a distributed Automated Version Control/Revision Control tool, which can track, revert and tag changes in simple data files such as the text of program source code. You should have already configured your system to know who you are by running a commmand like:
git config --global user.name "Gerard Gorman"
git config --global user.email "g.gorman@imperial.ac.uk"
git
cheatsheet¶git init my_new_repo
git status
git status -uno
git clone https://github.com/jrper/my_new_repo
git add letter_to_granny.txt *.png
git commit -m "Update my letter to Granny."
Remember that for each git
command help is available:
git help
git add -h
GitHub.com is a web-based repository and collaboration system for code (and other stuff) controlled via the git
version control system. Octocat is the name of its mascot, who also has a useful profile page to practise cloning and forking repositories from. Recently purchased by Microsoft, GitHub is also the home for a large number of major open source projects, including numpy
.
hg
) revision control tool. Now services git
as well. bzr
) revision control tool.<div class= exercise>
git clone <url>
.git add <files>
andgit commit -m "<log message>"
.</div>
GitHub public repositories can be searched and read by anyone, whether logged in or anonymously. To write to a repository, or to administer to it (i.e. have control over deletion, renaming and write access) both require authentication and express permissions. Meanwhile, private repositories can only be accessed at all by those authenticated users with proper permissions. Only paying (and educational) accounts can create new private repositories.
GitHub accounts can either be individual (i.e. personal) or for organizations (i.e. companies, project communities & formal groups). Any existing GitHub account can create an manage a new organization, however only one individual account is allowed per username, and your interactions with GitHub are linked to your personal identity (i.e. to your individual account). Each code repository must exist under an account (whether an individual or an organization) and has a standard URL assigned to it
https://github.com/<account_name>/<repository_name>
GitHub is based around collaboration, so it's natural to want to interact with repositories you don't own. The easiest way to grant permissions for another GitHub user is to add them as an external collaborator. This works for both individual repositories and for organizations.
Members of organizations can be assigned to subgroups called "teams", each of which can be given read, write or admin rights to the repositories that organization owns. This gives better mass controls for projects with a large numbers of people. You will revisit the team structure when you start the mini-projects with Gareth Collins later this term.
<div class= exercise>
git clone
.</div>
Extending our example project from Tuesday, a version of the project stored on GitHub might look like
.gitignore
.travis.yml
AUTHORS
CONTRIBUTING
docs/
conf.py
index.rst
LICENSE.txt
mycoolproject/
__init__.py
cool_module.p
another_cool_module.py
tests/
test_mycoolproject.py
requirements.txt
README.md
setup.py
You will see several changes. The .gitignore
file is just the same as that used for vanilla git
, containing a list of patterns for files git shouldn't track in your working directory. These files aren't automatically listed with git status
or added with git add -a
:
.gitignore:
txt
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Sphinx documentation
docs/_build/
/docs/html
/docs/pdf
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
The tests/
and the .travis.yml
file will be covered in tomorrow's lecture on testing and continuous integration so we will not cover them in detail today.
However, let's briefly discuss the other new files.
README.md
and Markdown¶One of the most visible files on any GitHub project is the README.md
file placed in the repository root directory (or in a docs/
or in a hidden .github/
directory). GitHub automatically runs this file through a Markdown preprocessor and displays the resulting html
output on the main landing page of the product.
There exist numerous pages giving recommendations for good content, but at the very least you should address the following:
Describing what you code does (or will do) is naturally very important.
For a Python project, the usual answer to the "how?" question is to include pip
installation instructions, whether from the PyPi repository, or directly from source downloaded from the GitHub project page.
"Who?" will depend on the content of the project. You may want to invite anyone to use your code, or you may be trying to solve a very specific problem, and know of another project to use for the general case.
Github Flavoured Markdown (the easiest way to write your README.md file) is a markup language something like html, but with a very compact and human readable syntax in its native form (In fact quite similar to the reStructured text used with Sphinx for Python documentation, which GitHub also supports). GFM is a GitHub extension (See their guide) to the popular Markdown language.
Markdown uses simple punctuation characters to indicate style and formatting in a plain text file. This makes it a lot less cluttered than the equivalent HTML markup file. For example, html-formatted itemized list looks like
<ul>
<li>Item one</li>
<li>Item two</li>
<li>Item three</li>
</ul>
Meanwhile the Markdown equivalent looks like
gfd
- Item one
- Item two
- Item three
<div class= info>
In fact, Markdown is also the language used to create formatted text blocks inside Jupyter notebooks. If you look at some of the text in this notebook, as well as the previous ones in this course, you will see some more examples of ways to mark up your text.
</div>
As well as the full specification linked to above, there is also a quick cheat-sheet available for many of the more common commands.
Create and upload a README.md
for your new public repository you created earlier. It should include:
You can add it either using the web interface (where you can preview your work) or by using git add
, git commit
and git push
commands if you would like the practice.
In particular, try adding lists, links and headings.
GitHub has limited support for bug tracking on repositories via its issues pages. A separate issues page exists for every repository you create. Issues could be serious ("I used your code and now I'm blind.") or minor ("There is a spelling mistake on the special screen which appears on the 29th of February"). The GitHub interface allows you to assign an "owner" to each one, who can then take charge of dealing with the problem as they see fit.
Compared to some other software on the market, the GitHub Issues interface is relatively generic and lightweight. The principal advantage it has is the integration with the code storage side of things. You can easily link to people and code branches by using GitHub Flavoured Markdown special formatting inside your issues or replies. In particular
@
sign (just like Slack, e.g. @jrper
).#
(e.g "See issue #7
).There is also a (new and currently very limited) project management tool called project boards, allowing you to arrange issues and pull requests (see later) on a timeline of to do versus work in progress versus done. This can be useful to make sure all collaborators are aware of the current priorities and schedule, especially when people are working on different sites, or in different timezones.
<div class= exercise>
Create some issues on your project repository, and on another student's project repository. Try to include some Markdown in your issue.
Start a project board for the project. Organize the issues.
</div>
AUTHORS
& CONTRIBUTING
¶An AUTHORS
file is basically a credits (or blame) list for static versions of your project. The typical format is a list of the names of authors (in the contributor's own preferred format), one per line, with an optional email address/webpage as contact details following it within angled brackets.
AUTHORS
Ada Lovelace <ada@babbage.com>
Albert Einstein <a.einstein@princeton.edu>
Bill Gates
Grace Hopper <http://www.cs.yale.edu/homes/tap/Files/hopper-story.html>
Tan Jiazhen
Elon Musk <https://www.spacex.com/elon-musk>
Marie Curie
This is useful both for recognition of your collaborators work, and as a starting point if you ever need to relicense your code (see later). It is perhaps less important than it used to be in the age of popular use of Revision Control/Version Control Softwate, but ensures that authors are credited, even when users receive software via routes other than GitHub.
The CONTRIBUTING
file is a recommended addition coming out of the GitHub community. It documents the standards and procedures a project expects contributors to follow.
The LICENSE.txt
file is the GitHub standard place to put information dealing with the copyright status and software license under which a project is distributed. When starting a new repository, GitHub gives the option to include a LICENSE.txt
matching several popular open source licences, otherwise, you are free to add your own.
<div class= warn>
I am not a laywer! More specifically, I am not your lawyer. Lawyers spend a lot of money on insurance, so that they are safe to give specific legal advice without the fear of liability. While I will try to be as accurate as possible in the information provided here, don't plan on using these notes as a defence in court. </div>
The author, or commissioner (for work done "for hire" for an employer) of software code has certain property rights (called copyrights) to control the ability of other people to copy and distribute their work, just as the authors of a book or the producers of a film do. Depending on the juristictions involved, and the particulars of, breach of copyright may be either a civil (one person sues another for money or to stop doing something) or a criminal (the State prosecutes an individual, possibly leading to imprisonment).
country | UK | EU | USA | China | India |
---|---|---|---|---|---|
copyright period | life+70 | life+70 | life+70 | life+50 | life+60 |
There are some exceptions to these time periods. In the UK, "where a work is made by Her Majesty or by an officer or servant of the Crown in the course of his duties" it is placed under Crown Copyright. New Crown copyright material that is unpublished has copyright protection for 125 years from date of creation. Published Crown copyright material has protection for 50 years from date of publication. Meanwhile the copyright to the play Peter Pan (which the author J. M. Barrie gifted to the Great Ormond Street childrens hospital) is specifically legislated to last forever.
Although various methods exist to register the date at which works were created, there is now generally no need to do anything to copyright your work. Your rights exist automatically from the moment of creation (i.e. when you first wrote the code), and continue to exist unless you explicitly give them up, or until the legally mandated time has passed. In fact, in some juristictions specifically some parts of the EU) authors are unable to opt out of their moral rights over their work.
For computer software specifically (a "literary work"), UK copyright laws allow creators to control the acts of:
In the US in particular (but often not the EU) software can also be patented. This gives a non-trivial idea (an 'invention') additional protections for a limited period of time (often 20 years) against others making, using, sell or import/exporting it. Unlike copyright, this doesn't apply to a specific implementation (expression) but to more general concepts (e.g. the one-click button to buy something, which Amazon held the US patent to until September 2017).
<div class= warn>
Although the two are sometimes confused, academic plagurism is a separate issue from copyright. While a copyright holder can give you permission to use or copy their work, that should not be assumed as permission to pass their work off as your own, which is never academically acceptable. In particular, when you use others' work during this course, you should provide proper attribution, regardless of the licence you obtained it under. Direct copy-pasting of code for assessed exercises is serious academic misconduct, and would have serious implications if discovered.
</div>
That's enough about copyright in general. Next we'll talk specifically about copyright for software, and the Open Source movement.
The word "free" in English has two main meanings
The free software movement is aimed at ecouraging software to be distributed under terms matching the second meaning.
As a copyright holder, you can always grant others the ability to use, copy and distribute your software. The easiest and simplest way to do this is to publish a licence together with your code. As a user & developer, ensuring that software you use has a licence with terms compatible with what you intend to do with it prevents long, costly and embarrassing legal action further down the line.
Although in theory you could always write your own licence, few scientists are also lawyers. Because legal text has legal meaning, it is always safer to use one of the well known and well understood existing copyright
The "most free" thing you may be able do with code (depending on the local legal system) is to release it into the public domain. This is the same state that literary works are left in after the legally mandated time has expired. At this point, anyone is free to use or reapply the material in any way they see fit.
Since some legal systems (particularly the civil law practised in much of the EU) can make it practically impossible for authors to give up thair "moral rights".
In many juristictions, especially those based on the English Common Law (including the USA), the transfer or sale of goods or services can imply an implicit warranty that they are fit for the usual purpose the product would be put to. For example an item sold as a "child's high chair" would be expected to take the weight of a child without breaking.
Many FLOSS licences include specific wording attempting (as far as they can) to explicitly deny any such warranty For example the MIT license states:
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Many licences, while retaining copyright over the work and not releasing it into the public domain, otherwise give users relatively unrestricted rights to copy, modify and distribute the code. In particular, they allow the code to be used (often with attribution) as part of greater works released under more restrictive licences (for example, ones which prohibit distributing your own copy of the source for the larger project, or the modified version of the existing code). These are often called "permissive" licences.
On the other hand, a set of licences modelled after the GNU General Public License are intended to ensure that once software is released as "free software, it stays as "free software". As such, they place restrictions on the immediate recipient of the work, in order to ensure that people later down the chain retain their version of the four freedoms:
Specifically, the various versions of the GPL all require that when modified versions of GPL'd projects are distributed, the new version is placed under a GPL licence (e.g. they much also release the source code on demand, and allow other users the right to modify and distribute it). This "carry forward" operation has caused such licences to be called "copyleft" (a play on words from "copyright").
Various bodies, including the Free Software Federation, the organization behind GNU, have recognised that software is seldom used in isolation. One component interacts with another component, which calls a third component etc. With a "strong" copyleft licence such as the GPL, this requires every piece of code in the ecosystem to also be copyleft. In most practical environments, this is impossible to ensure past a given size, since some components (e.g. the "binary blob" provided to run your graphics card) are liable to be provided under a permissive open source or proprietary commercial licence.
As such, a second class of "weak" copyleft licences, such as the GNU Lesser General Public Library allow their code to be linked to (i.e. called in automated sense) in derivative works by code not under a (L)GPL licence. Specifically, if the code is called or used as a library then no restriction is implied, but if the code of the libray itself is modified then the standard restricitions still apply. The word "lesser" is used in terms of the rights of a theoretical third party user, who may no longer be guaranteed the right to modify the code that links to the original library.
Because "copyleft" licences require derivative works to also be released under suitable "copyleft" licences, it is impossible to release packages containing GPL components entirely under more permissive licences such as BSD.
Licence | BSD | LGPL | GPL |
---|---|---|---|
BSD | Yes | No | No |
LGPL | Yes | Yes | No |
GPL | Yes | Yes | Yes |
Some licences make a distinction between "commercial" and "non-commercial" uses. In particular the work may be freely licenced for non-commercial use, with the right reserved to charge a fee for commercial use. In general "commercial use" can be interpretted fairly broadly as related to income-generating use of any kind, whether direct or indirect. This means that for code under a non-commercial license, not only should you not sell the work itself, you probably shouldn't use it in a way that earns money.
Fortunately, academic study and pure research uses are frequently specifically excluded as non-commercial activities, avoiding the awkward question of "the code lets me do my reseach, for which a funding body pays me, is that commercial?". However, this can be an issue when the intellectual property (IP) produced at the end of a project contractually becomes the property of an industrial partner. Many companies (including Imperial College) have lawyers on retainer to deal with this kind of question.
The best time to choose a licence for a project is right at the beginning. Provided all the copyright holders agree, a project can always be relicenced, but collecting this agreement becomes increasingly difficult as time passes and contact details and levels of interest change. As such, the bigger and older a project is, the harder switching a licence becomes.
You may be tempted to create your own licence. For example, you may want to exclude a certain group of people (e.g. weapons manufactures or animal rights activists) from using your code. Alternatively, you may not like one specific element in an existing licence and want to use a licence which is otherwise the same, with that one thing changed.
Unless you have a very, good reason, don't do this. Even if you do have a good reason, you probably shouldn't do this. The popular existing licences are the ones which have been tested and refined in law courts, and increasing the number of licences vastly increases the potential licence incompatibility problems.
Unlicence
Apache
MIT
GPL
This is more an ethical, rather than a technical question. The "copyleft" licenses ensure that modified versions of your code stay publicly available, at the price of removing some options for how your immediate users can apply your code.
Now that you've created a repository with a cool name, a sensible licence and some rules for how to contribute, you need to get round to actually writing some code. When adding new features, it is very easy to break existing ones, either through deliberate acts of evil (these are rare), or accidentally (this is really common). There are a couple of techniques to minimise this potential damage. One of them is to test your code, preferably automatically (we'll revisit this tomorrow). Another is to make use of the GitHub collaboration features to always practise code review, so that no new code is placed into the "production" system until someone other than the original author has examined it.
The GitHub flow can be summarised as:
git
branches or under a GitHub fork of the original repository.git fetch
git checkout -b feature_branch origin/master
# do some work
git add my_new_file.py my_old_file.py
git commit -m "Add the ability to frotz your foobar."
# repeat as necessary.
git push --set-upstream origin feature_branch
git fetch
git merge origin/master
## if there are conflicts
git mergetool
git commit
### otherwise carry on working
git push
git push
ed to github (in the branch\fork) and a pull request is opened to inform collaborators with sufficient permission that there is new work to examine.master
), otherwise the author of the new work is requested to make changes until they are.<div class= exercise>
Code reviews are/can be hard work. Doing good code reviews is even harder. There are several things to keep in mind:
One method to both practise critiquing code and to improve the standard of code written is pair programming. Here two coders sit together at one computer, planning to write code to solve a specific problem. One programmer 'drives' by controlling the keyboard and mouse, while the other 'navigates' by watching the screen.
As the driver:
As the navigator:
Above all, keep on topic and keep talking.
There are lots of successful pairing patterns:
Meeting of minds: When both driver and learner are experts.
One final note, don't try group coding with many more people than two people.
"A camel is a horse designed by a committee." attributed to Sir Alec Issigonis
<div class= exercise>
Find a partner and prepare to try paired programming to write some Python. There are several exercises, so be the driver at some points and the navigator at others. It's ok if you don't get the entire code finished, concentrate on the interaction with your partner.
Write a script to find all the proper divisors of an integer and so find pairs of [amicable numbers](https://en.wikipedia.org/wiki/Amicable_numbers). The proper factors of a number are numbers other than itself which it divides by exactly (e.g. the proper divisors of 6 are 1, 2, and 3. Two numbers $n$ and $m$ are an amicable pair if the sum of the factors of $n$ is $m$ and the sum of the factors of $m$ is $n$. Since the sum of the factors of $6$ is $6$, it is a special kind of amicable number called a [perfect number](https://en.wikipedia.org/wiki/Perfect_number).
For testing, the first amicable pairs are (220, 284)
and (1184, 1210)
.
Tips:
Write a script to list the names of the numbers from 0 to 100, as strings, in alphabetical order.
For testing purposes, an acceptable answer for the numbers from 0 up to 5 is
['five', 'four', 'one', 'three', 'two', 'zero']
Tips:
'%s-%s'5(tens[2], ones[1])
, e.g. 'twenty-one'
.Write a script to turn a `.png` image upside down. Add a linear gradient in transparancy from bottom to top.
Tips:
Model answers are available for the [amicable number](https://msc-acse.github.io/ACSE-1/lectures/lecture7-solutions.html#exercise1), [alphabetized numbers](https://msc-acse.github.io/ACSE-1/lectures/lecture9-solutions.html#exercise2) and [inverted image](https://msc-acse.github.io/ACSE-1/lectures/lecture9-solutions.html#exercise3) problems.
</div>
<div class= interlude>
One of the biggest criticisms made against the Python programming language is that it is slow. Compared to compiled languages, this is often true. However, not all Python operations happen at the same speed, so that often the biggest reason for code to be slow is that you are doing slow things. In particular, Python loops are known to take a relatively long time to execute. Where possible, vectorize your code using tools like numpy
.
</div>
import numpy as np
l = [int(1000000*np.random.random()) for _ in range(100000)]
s = set(l)
print("Time to search list")
%time 500 in l
print("Time to search set")
%time 500 in s
import numpy as np
N = 1000
a = np.random.random(N)
b = np.random.random(N)
def crude_sum(x, y):
c = np.empty(x.shape)
for i, s in enumerate(zip(x, y)):
c[i] = s[0]+s[1]
return c
def numpy_sum(x, y):
return x+y
%time c = crude_sum(a, b)
%time c = numpy_sum(a, b)
Github provides every repository with free webspace and a free wiki site. These spaces can be used to give detailed information that it would be inappropriate to place in a short README.md
The GitHub pages interface uses software called Jekyll to convert your gh-pages
branch into a static webpage. This means that you can rapidly add new pages just by uploading new files containing Markdown highlighted test. In particular, you can make your own personalized blog page with almost no work (apart from writing the content).
<div class= info>
</div>
# This cell sets the css styles for the rest of the notebook.
# Unless you are particlarly interested in that kind of thing, you can safely ignore it
from IPython.core.display import HTML
def css_styling():
styles = """<style>
div.warn {
background-color: #fcf2f2;
border-color: #dFb5b4;
border-left: 5px solid #dfb5b4;
padding: 0.5em;
}
div.exercise {
background-color: #B0E0E6;
border-color: #B0E0E6;
border-left: 5px solid #1E90FF;
padding: 0.5em;
}
div.info {
background-color: #F5F5DC;
border-color: #F5F5DC;
border-left: 5px solid #DAA520;
padding: 0.5em;
}
div.interlude {
background-color: #E6E6FA;
border-color: #E6E6FA;
border-left: 5px solid #4B0082;
padding: 0.5em;
}
div.assessment {
background-color: #98FB98;
border-color: #228B22;
border-left: 5px solid #228B22;
padding: 0.5em;
}
</style>
"""
return HTML(styles)
css_styling()