API docs

compare50

class compare50.Comparator

Abstract base class for compare50 comparators which specify how submissions should be scored and compared.

abstract compare(scores, ignored_files)

Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of compare50.Comparisons

abstract score(submissions, archive_submissions, ignored_files)

Given a list of submissions, a list of archive submissions, and a set of distro files, return a list of compare50.Scores for each submission pair.

class compare50.Comparison(sub_a, sub_b, span_matches=NOTHING, ignored_spans=NOTHING)
Variables
  • sub_a – the first submission

  • sub_b – the second submission

  • span_matches – a list of pairs of matching compare50.Spans, wherein the first element of each pair is from sub_a and the second is from sub_b.

  • ignored_spans – a list of compare50.Spans which were ignored (e.g. because they matched distro files)

Represents an in-depth comparison of two submissions.

__init__(sub_a, sub_b, span_matches=NOTHING, ignored_spans=NOTHING)None

Method generated by attrs for class Comparison.

exception compare50.Error

Base class for compare50 errors.

class compare50.File(name, submission)
Variables
  • name – file name (path relative to the submission path)

  • submission – submission containing this file

  • id – integer that uniquely identifies this file (files with the same path will always have the same id)

Represents a single file from a submission.

__init__(name, submission)None

Method generated by attrs for class File.

classmethod get(id)

Find File with given id

lexer()

Determine which Pygments lexer should be used.

property path

The full path of the file

read(size=- 1)

Open file, read size bytes from it, then close it.

tokens()

Returns the preprpocessed tokens of the file.

unprocessed_tokens()

Get the raw tokens of the file.

class compare50.Pass

Abstract base class for compare50 passes, which are essentially ways for compare50 to compare submissions. Subclasses must define a list of preprocessors (functions from tokens to tokens which will be run on every file compare50 recieves) as well as a comparator (used to score and compare the preprocessed submissions).

class compare50.Score(sub_a, sub_b, score=0)
Variables
  • sub_a – the first submission

  • sub_b – the second submission

  • score – a number indicating the similarity between sub_a and sub_b (higher meaning more similar)

A score representing the similarity of two submissions.

__init__(sub_a, sub_b, score=0)None

Method generated by attrs for class Score.

class compare50.Span(file, start, end)
Variables
  • file – the ID of the File containing the span

  • start – the character index of the first character in the span

  • end – the character index one past the end of the span

Represents a range of characters in a particular file.

__init__(file, start, end)None

Method generated by attrs for class Span.

class compare50.Submission(path, files, large_files=NOTHING, undecodable_files=NOTHING, preprocessor=<function Submission.<lambda>>, is_archive=False)
Variables
  • path – the file path of the submission

  • files – list of compare50.File objects contained in the submission

  • preprocessor – A function from tokens to tokens that will be run on each file in the submission

  • id – integer that uniquely identifies this submission (submissions with the same path will always have the same id).

Represents a single submission. Submissions may either be single files or directories containing many files.

__init__(path, files, large_files=NOTHING, undecodable_files=NOTHING, preprocessor=<function Submission.<lambda>>, is_archive=False)None

Method generated by attrs for class Submission.

classmethod get(id)

Retrieve submission corresponding to specified id

class compare50.Token(start, end, type, val)
Variables
  • start – the character index of the beginning of the token

  • end – the character index one past the end of the token

  • type – the Pygments token type

  • val – the string contents of the token

A result of the lexical analysis of a file. Preprocessors operate on Token streams.

__init__(start, end, type, val)None

Method generated by attrs for class Token.

compare50.compare(scores, ignored_files, pass_)
Parameters
  • scores ([compare50.Score]) – Scored submission pairs to be compared more granularly

  • ignored_files ({compare50.File}) – files containing distro code

  • pass (compare50.Pass) – pass whose comparator should be use to compare the submissions

Returns

Compare50Results corresponding to each of the given scores

Return type

[compare50.Compare50Result]

Performs an in-depth comparison of each submission pair and returns a corresponding list of compare50.compare50Results.

compare50.expand(span_matches, tokens_a, tokens_b)
Parameters
  • span_matches ([(compare50.Span, compare50.Span)]) – span pairs to be expanded wherein the first element of every pair is from the same file and the second element of every pair is from the same file

  • tokens_a ([compare50.Token]) – the tokens of the file corresponding to the first element of each span_match

  • tokens_b ([compare50.Token]) –

    param tokens_a

    the tokens of the file corresponding to the first element of each span_match

Returns

A new list of maximially expanded span pairs

Return type

[(compare50.Span, compare50.Span)]

Expand all span matches. This is useful when e.g. two spans in two different files are identical, but there are tokens before/after these spans that are also identical between the files. This function expands each of these spans to include these additional tokens.

compare50.missing_spans(file, original_tokens=None, processed_tokens=None)
Parameters
  • file (compare50.File) – file to be examined

  • original_tokens – the unprocessed tokens of file. May be optionally specified if file has been tokenized elsewhere to avoid tokenizing it again.

  • processed_tokens – the result of preprocessing the tokens of file. May optionally be specified if file has been preprocessed elsewhere to avoid doing so again.

Returns

The spans of file that were stripped by the preprocessor.

Return type

[compare50.Span]

Determine which parts of file were stripped out by the preprocessor.

compare50.rank(submissions, archive_submissions, ignored_files, pass_, n=50)
Parameters
  • submissions ([compare50.Submission]) – submissions to be ranked

  • archive_submissions ([compare50.Submission]) – archive submissions to be ranked

  • ignored_files ({compare50.File}) – files containing distro code

  • pass (compare50.Pass) – pass whose comparator should be use to rank the submissions

  • n (int) – number of submission pairs to return

Returns

the top n submission pairs

Return type

[compare50.Score]

Rank submissions, return the top n most similar pairs

compare50.passes

class compare50.passes.exact

Removes nothing, not even whitespace, then uses the winnowing algorithm to compare submissions.

class compare50.passes.misspellings

Compares comments for identically misspelled English words.

class compare50.passes.nocomments

Removes comments, but keeps whitespace, then uses the winnowing algorithm to compare submissions.

class compare50.passes.structure

Compares code structure by removing whitespace and comments; normalizing variable names, string literals, and numeric literals; and then running the winnowing algorithm.

class compare50.passes.text

Removes whitespace, then uses the winnowing algorithm to compare submissions.

compare50.preprocessors

compare50.preprocessors.by_character(tokens)

Make a token for each character.

compare50.preprocessors.comments(tokens)

Remove all tokens that aren’t comments.

compare50.preprocessors.extract_identifiers(tokens)

Remove all tokens that don’t represent identifiers.

compare50.preprocessors.normalize_builtin_types(tokens)

Normalize builtin type names

compare50.preprocessors.normalize_case(tokens)

Make all tokens lower case.

compare50.preprocessors.normalize_identifiers(tokens)

Replace all identifiers with v

compare50.preprocessors.normalize_numeric_literals(tokens)

Replace numeric literals with their types.

compare50.preprocessors.normalize_string_literals(tokens)

Replace string literals with empty strings.

compare50.preprocessors.split_on_whitespace(tokens)

Split values of tokens on whitespace into new tokens

compare50.preprocessors.strip_comments(tokens)

Remove all comments from tokens.

compare50.preprocessors.strip_whitespace(tokens)

Remove all whitespace from tokens.

compare50.preprocessors.text_printer(tokens)

Print token values. Useful for debugging.

compare50.preprocessors.token_printer(tokens)

Print each token. Useful for debugging.

compare50.preprocessors.words(tokens)

Split tokens into tokens containing just one word.

compare50.comparators

class compare50.comparators.Misspellings(dictionary)
__init__(dictionary)

Initialize self. See help(type(self)) for accurate signature.

compare(scores, ignored_files)

Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of compare50.Comparisons

score(submissions, archive_submissions, ignored_files)

Number of identically misspelled words.

class compare50.comparators.Winnowing(k, t)

Comparator utilizing the (robust) Winnowing algorithm as described https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

Parameters

t (int) – the guarantee threshold; any matching sequence of tokens of length at least t is guaranteed to be matched

Parma k

the noise threshold; any matching sequence of tokens shorter than this will be ignored

__init__(k, t)

Initialize self. See help(type(self)) for accurate signature.

compare(scores, ignored_files)

Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of compare50.Comparisons

score(submissions, archive_submissions, ignored_files)

Number of matching k-grams.